Fair Domain Generalization with Arbitrary Sensitive Attributes
We consider the problem of fairness transfer in domain generalization with sensitive attributes that change its sensitivity across domains
摘要
评审与讨论
This paper studied a novel fair domain generalization problem where multiple sensitive attributes existed in different domains. The key challenge of this fair domain generalization problem is to deal with multiple potential sensitive attributes and any combinations of sensitive attributes can appear in the unseen testing domains. Then it presented a feasible solution by learning the domain-invariant representation and sensitive attribute invariant representation from training domains. The objective function included four components: domain-invariance loss, fairness-aware invariance loss, classification loss, and equalized odds (for fairness) loss. Experiments on two real-world data sets showed that the proposed outperformed DG baselines in terms of generalization and fairness.
优点
Originality: This paper focused on a novel fair domain generalization problem with multiple sensitive attributes. It was a much more challenging problem setting than previous work due to the complicated interconnections of different sensitive attributes. The major technical novelty of this paper was to selectively learn the invariant representation based on the sensitive attributes, e.g., generate representations with respect to sensitive attributes. Experimental results demonstrated the effectiveness of the proposed SISA method over several baselines in terms of both generalization and fairness metrics.
Quality: The fair domain generalization problem was well-defined. The motivating example in Figure 1 also illustrated that fair domain generalization with multiple sensitive attributes was a challenging yet practical problem. In the derived objective function, both generalization performance and fairness were encouraged in different loss terms.
Clarity: Overall, the presentation of this paper was clear and the derived problem was well-motivated. Experiments also showed the training procedures and evaluation metrics for performance comparison between the proposed method and baselines. Ablation studies supported that the hyper-parameters were relatively robust to the model performance.
Significance: This paper extended previous fair domain generalization to a more general setting where multiple sensitive attributes could appear in different domains.
缺点
W1: The technical novelties of this paper are unclear. The proposed SISA approach involves several techniques from previous works, e.g., invariant representation learning with domain density translators, equalized odds loss, contrastive loss, etc. The major technical contributions could be emphasized in the context of the derived fair domain generalization problem.
W2: The explanation of the fairness encoder in subsection 3.2.1 is not convincing. (1) It randomly chooses a single to learn the representations of sensitive attributes. This can lead to biased and unstable solutions. More empirical evaluations on this sampling strategy can be provided. (2) It uses the concatenation between input and attribute . However, if (e.g., high-dimensional images) and differ in dimensionality, would the concatenated vector be dominated by one of them?
W3: The equalized odds loss is confusing. What are and over the sample ? What is within ? Why does it involve both and ?
W4: The hyper-parameter setting is not explained. (1) It shows that are used in previous work for the best reported results, and thus those parameters are also used in this paper. However, since different models are involved in this paper and previous work, might lead to sub-optimal solutions. (2) The hyper-parameter sensitivity on and are analyzed. However, it is unclear whether the best hyper-parameters are selected based on the testing domains. Is any validation method adopted for hyper-parameter selection during training?
问题
Q1: Figure 2 shows that the drop in performance is high when fairness is enforced on multiple attributes. This might indicate that it becomes more challenging to find the trade-off between generalization performance and fairness when increasing the number of sensitive attributes. Therefore, it would be better to provide some insights into understanding how to balance generalization performance and fairness when a large number of sensitive attributes exist.
Q2: Table 7 shows that the number of encoders can also affect the trade-off between performance and fairness. Why does a single encoder improve the fairness and multiple encoders help generalization performance in the proposed approach?
Q3: This paper considers the covariate shift among domains. Can the proposed SISA method be adapted to deal with other types of distribution shifts, e.g., label shifts, concept shifts, etc?
########################################
After reviewing the rebuttals, I would like to keep my rating unchanged, since most of my concerns have not been addressed. In most cases, the responses are not very convincing. More theoretical or empirical results could be added to support the explanations, e.g., the selection of , hyper-parameter selection, the trade-off between performance and fairness, etc.
- Figure 2 shows that the drop in performance is high when fairness is enforced on multiple attributes. This might indicate that it becomes more challenging to find the trade-off between generalization performance and fairness when increasing the number of sensitive attributes. Therefore, it would be better to provide some insights into understanding how to balance generalization performance and fairness when a large number of sensitive attributes exist.
Our ablation studies in Tables 4, 5, and 6 of the paper show that our model's hyperparameters and can be varied to manage the trade-off between fairness and accuracy quantitatively. For CelebA dataset, with the highest no of sensitive attributes (), we noticed slightly better results for performance when and values are high and fairness when and are low.
- Table 7 shows that the number of encoders can also affect the trade-off between performance and fairness. Why does a single encoder improve the fairness and multiple encoders help generalization performance in the proposed approach?
In the case of a single-encoder model, a single representation denotes the fairness and the generalization information. Hence, it is implicitly equally divided among the loss for the sensitive attribute () and the generalization loss (). As there are sensitive attributes, it overshadows the generalization information due to being the same representation.
In the case of two-encoders model, where one encoder stands for fairness and the other for generalization performance, is explicitly split between and , giving a good enough representation in and not get overshadowed by . Hence, the generalization performance (accuracy) is better with two encoders.
- This paper considers the covariate shift among domains. Can the proposed SISA method be adapted to deal with other types of distribution shifts, e.g., label shifts, concept shifts, etc?
We have only considered covariate shift as the distribution shift in the scope of this paper. We plan to extend our work to other types of shifts in the future.
We thank Reviewer 4 for their constructive and appreciative comments about our paper. We address their queries below.
- The technical novelties of this paper are unclear. The proposed SISA approach involves several techniques from previous works. The major technical contributions could be emphasized in the context of the derived fair domain generalization problem.
We thank Reviewer 4 for the suggestion and have revised our paper elucidating our contribution better.
- a) Fairness encoder randomly chooses a single to learn the representations of sensitive attributes. This can lead to biased and unstable solutions. More empirical evaluations on this sampling strategy can be provided.
- Initially, we had trained the model on all during each iteration. We found that it increased the training complexity. Then, we sampled a number of 's from during each training iteration and found that it did not compromise the stability of the training. Even though in the algorithm we wrote it as sampling a single in each iteration, in practice we sampled multiple 's in each iteration eventually reaching of the total elements in . We provide our code for a better understanding of our algorithm.
b) Fairness encoder uses the concatenation between input and attribute . However, if and differ in dimensionality, would the concatenated vector be dominated by one of them?
- is a binary vector of size x. We reshape it to xx by repeating its values over dimensions 1 and 2 and adding a 3rd dimension. Then it is concatenated to input as an additional channel. So the total dimension of the input becomes ( xx for MIMIC CXR and xx for CelebA). will not dominate as variation between different 's is much lower than variations between different 's.
- The equalized odds loss is confusing. What are and over the sample ? What is P within Div()? Why does it involve both and ?
denotes the latent representation of obtained through the fairness encoder and generalization encoder (Eq 3 and 6 from the paper). Hence even though the operation is on inside , it is a function of the training data which is sampled from . We have simplified for better readability below.
- a) The hyper-parameter setting shows that are used in previous work for the best-reported results, and thus those parameters are also used in this paper. However, since different models are involved in this paper and previous work, might lead to sub-optimal solutions.
We chose to use the best-reported hyperparameter values for both the baseline and our model in order to have a fair comparison between the models. We understand the comments of the reviewer about tuning for the baselines and our models, but we are unable to do so at the moment. However, we would like to highlight that despite tuning for the best values of , we were still able to achieve good performance from our model for the given values of hyperparameters.
b) The hyper-parameter sensitivity on and are analyzed. However, it is unclear whether the best hyper-parameters are selected based on the testing domains. Is any validation method adopted for hyper-parameter selection during training?
The hyperparameter values are chosen based on the highest accuracy on the validation set. However, for the ablation studies, we have reported their respective accuracies obtained on the test set.
This paper introduces an approach aimed at achieving intersectional fairness within the context of domain generalization. Specifically, the proposed method focuses on acquiring two distinct invariant representations across domains, emphasizing both accuracy and fairness. Subsequently, a classifier is employed to make predictions based on these representations. To transfer fairness and accuracy into new domains, the authors train the model to minimize error and fairness loss across source domains.
优点
- This paper targets fairness which is an important research topic in machine learning.
缺点
-
The clarity of the paper is lacking, with several important details omitted. For instance, the paper lacks comprehensive information about the training of the domain density translator for fairness generalization. Training is not straightforward due to the varying sensitive attributes across different datasets.
-
The design choice of using a shared translator for all sensitive attributes appears questionable. Notably, given an input from domain , only generates in domain , without considering which sensitive attributes are relevant to the translation. This implies that the model assumes and for every , which is a strong assumption and may not hold in practical scenarios.
-
The rationale behind learning distinct representations for each sensitive attribute is not well elucidated. Why is it necessary for the model to minimize the gap between domain representations with the same sensitive attribute configurations while maximizing the gap for those with different sensitive attributes? How does concatenating all representations contribute to accurate and fair predictions in target domains?
-
The results in Table 2 and 3 seem to be presented for a fixed value of . It would be more comprehensive if the authors varied to explore the accuracy-fairness trade-off for the methods used in the experiments.
-
The final objective, as defined in Equation (10), encompasses a blend of multiple loss components. I recommend that the authors carry out an ablation study, varying hyperparameters, to assess the impact of each loss on the model's performance.
-
The technical novelty of the paper appears somewhat limited. The primary contribution appears to be the utilization of distinct representations for each sensitive attribute.
-
In the introduction section (Fig. 1), the authors assert that the proposed method can accommodate the heterogeneity of sensitive attributes across domains. However, in the experimental section, the models seem to have access to all sensitive attributes in all domains, which may contradict the initial claim.
-
The paper lacks the provision of code and supplementary documentation, which could significantly enhance clarity and reproducibility. Providing these resources would be beneficial for the reader to understand and replicate the methodology.
问题
Please see the Weaknesses.
We thank Reviewer 3 for their constructive comments to improve our paper. We post responses to their queries below.
- The clarity of the paper is lacking, with several important details omitted. For instance, the paper lacks comprehensive information about the training of the domain density translator for fairness generalization. Training is not straightforward due to the varying sensitive attributes across different datasets.
We have added more details about the training of in the supplementary section of the revised version of the paper.
- The design choice of using a shared translator for all sensitive attributes appears questionable. Notably, given an input from domain , only generates in domain , without considering which sensitive attributes are relevant to the translation. This implies that the model assumes and for every , which is a strong assumption and may not hold in practical scenarios.
We follow the design choice of from FATDM who have assumed the same assumption you have stated and have proved with theoretical bounds that it does not compromise the fairness metric measures.
- a)The rationale behind learning distinct representations for each sensitive attribute is not well elucidated.
We learn distinct representations for each sensitive attribute due to two reasons:
- If there exists a representation for a sensitive attribute S, and if two domains A and B have S as a sensitive attribute, then one can seamlessly minimize the distance between the representations of two domains A and B that only pertain to S and not affect the representations of other sensitive attributes in A and B.
- If there exists a representation for a sensitive attribute S, and if two domains A and B do not share its sensitivity, then one can seamlessly maximize the distance between representations of A and B that only pertain to S.
b) Why is it necessary for the model to minimize the gap between domain representations with the same sensitive attribute configurations while maximizing the gap for those with different sensitive attributes?
When the sensitive attribute configurations are the same, that is, a domain A has its sensitive attribute set as gender and race and a domain B has its sensitive attribute set as gender and race, we want the predictions of both A and B to be equalized across all values taken by gender and race. Hence, we minimize the distance between the representations that pertain to fairness () of domain A and domain B.
On the other hand, if the sensitive attribute configurations are different, that is domain A has its sensitive attribute set as gender and domain B has its sensitive attribute set as race, then the sensitive attribute set is different, we do not want them to have a similar prediction and maximize the distance between 's of domains A and B
c) How does concatenating all representations contribute to accurate and fair predictions in target domains?
The representation is optimized for better fairness through loss and the representation is optimized for better accuracy through loss . We use concatenation as it is an easy operation that can combine the two representations to get a final representation which is then sent to the classifier for a prediction that is fair and accurate.
- The results in Tables 2 and 3 seem to be presented for a fixed value of . It would be more comprehensive if the authors varied to explore the accuracy-fairness trade-off for the methods used in the experiments.
We have chosen from hyperparameter tuning and have reported the results for the best in Tables 2 and 3 in the paper. We have provided an ablation study for in Table 5 and show that varying has an effect on the accuracy-fairness trade-off.
- The final objective, as defined in Equation (10), encompasses a blend of multiple loss components. I recommend that the authors carry out an ablation study, varying hyperparameters, to assess the impact of each loss on the model's performance.
Equation 10 includes hyperparameters. Out of that, are already present in the baseline FATDM. FATDM have optimized their values as and in their paper. We adopted the same values in our model too. For the other hyperparameters that were introduced by our approach ( and ), we already provide an ablation study in Tables 4 and 5 in the paper.
- The technical novelty of the paper appears somewhat limited. The primary contribution appears to be the utilization of distinct representations for each sensitive attribute.
We introduced the idea of learning two different representations (which are combined at the end) for capturing fairness and generalizability to a target domain. We have introduced a very practical problem of having multiple sensitive attributes across different domains in a domain generalization setting and have proposed a simple and clear solution for the same. We believe that the simplicity of our method is one of the positive aspects of our method as it is very easy for a researcher/engineer to understand and implement the model in practice.
- In the introduction section (Fig. 1), the authors assert that the proposed method can accommodate the heterogeneity of sensitive attributes across domains. However, in the experimental section, the models seem to have access to all sensitive attributes in all domains, which may contradict the initial claim.
We assume that we have access to a set of possible sensitive attributesinformation on whether an attribute is sensitive is not part of the observations and is specified by the user. Hence we identify a possible set of sensitive attributes and train the model to face any subset of sensitive attributes from the possible set. As the sensitivity of an attribute is user-specified and not present in the observations, at the time of training we combine any set of sensitive attributes with any domain randomly and train the model so that the model is prepared for any new combination of the sensitive attribute. We have emphasized this in the Introduction section of our paper.
- The paper lacks the provision of code and supplementary documentation, which could significantly enhance clarity and reproducibility. Providing these resources would be beneficial for the reader to understand and replicate the methodology.
We have provided a link to the code along with the rebuttal
The paper addresses the challenge of fairness transfer in domain generalization, particularly in contexts where multiple sensitive attributes are present and may vary across domains. Traditional domain generalization methods aim to generalize a model's performance to unseen domains but often ignore the aspect of fairness, especially when multiple sensitive attributes are involved.
The authors propose a novel framework capable of handling fairness with respect to multiple sensitive attributes across different domains, including unseen ones. This is achieved through the development of two types of representations: a domain-invariant representation for generalizing model performance and a selective domain-invariant representation for transferring fairness to domains with similar sensitive attributes. A key innovation of the proposed method is its ability to reduce computational complexity significantly.
优点
- The approach to handle multiple sensitive attributes in domain generalization is innovative and addresses a clear gap in existing literature.
- The use of real-world datasets for experimentation enhances the practical relevance of the research.
- Learning two types of representations for generalization and fairness is a thoughtful approach that could have broader applications.
- Reducing the number of required models from to just one is a significant improvement, making the solution more feasible in practical scenarios.
缺点
- There is a lack of detail on the specific fairness metrics employed and how the trade-off between fairness and accuracy is quantitatively managed.
- While the reduction in model count is impressive, there are no details on the scalability of the approach with respect to the size of the data or the complexity of domain environments.
- The definition of "sensitive attributes" is rather arbitrary and vague - is there any specific reason certain attributes (i.e. smiling) count as sensitive?
问题
- Could you elaborate on the robustness of your method against different types of distribution shifts compared to existing methods by providing ablation studies?
We thank Reviewer for their appreciation of our work and their constructive comments. We have posted responses to their queries below.
- There is a lack of detail on the specific fairness metrics employed and how the trade-off between fairness and accuracy is quantitatively managed.
Our ablation studies in Tables 4, 5, and 6 of the paper show that our model's hyperparameters and can be varied to manage the trade-off between fairness and accuracy quantitatively (on CelebA dataset). In general, we noticed slightly better results for performance when and are high and fairness when and are low.
- While the reduction in model count is impressive, there are no details on the scalability of the approach with respect to the size of the data or the complexity of domain environments.
a. Scalability of the approach with respect to the size of the data. We have performed our experiments on data sizes ranging from to and show that our model performs well across these ranges. Below are more details about the data ranges:
- CelebA. Domain: hair color. No of training samples (domain): (black hair), (blonde), (brown), total:
- MIMIC-CXR-Cardiomegaly disease. Domain: age. No of training samples (domain): (age: Under 40), (40-60), (60-80), (80-100), total:
- MIMIC-CXR-Cardiomegaly disease. Domain: rotation. No of training samples for each domain: (), () (), total:
b. Complexity of domain environments. We have shown that our method performs well across different types of domain shifts with different no of sensitive attributes.
- CelebA. Domain shift: hair color. No. of sensitive attributes: 4 (male, big nose, smiling and young).
- MIMIC-CXR-Cardiomegaly disease. Domain shift: age. No. of sensitive attributes: 2 (gender and race).
- MIMIC-CXR-Cardiomegaly disease. Domain shift: image rotations. No. of sensitive attributes: 3 (age, gender and race).
- The definition of "sensitive attributes" is rather arbitrary and vague - is there any specific reason certain attributes (i.e. smiling) count as sensitive?
Technically sensitive attributes are those attributes that are subject to societal bias like age, gender, race, etc. We use experiments on the MIMIC CXR dataset where such attributes are present. However, to have a comprehensive set of experiments, we also decided to run experiments on CelebA which is a popular dataset in the field of fairness [4],[10],[11]. The target attribute for the dataset CelebA is Attractiveness. Past papers [4],[10],[11] have chosen Male as the sensitive attribute due to less no of images corresponding to males versus females in the data. Since we are considering multiple sensitive attributes, we randomly chose 3 other attributes: Big Nose, Smiling, and Young. Also, our intuition was that whether a person is Smiling should not contribute to whether they are Attractive.
[4] Fair Mixup: Fairness Via Interpolation, ICLR 2021
[10] Inclusivefacenet: Improving face attribute detection with race and gender diversity, FAT/ML 2018
[11] Leveling down in computer vision: Pareto inefficiencies in fair deep classifiers, CVPR 2022
- Could you elaborate on the robustness of your method against different types of distribution shifts compared to existing methods by providing ablation studies?
Currently, we only consider covariate shift as the distribution shift in the scope of this paper. We will consider other shifts in a future version of the work.
According to the authors, this work proposes a novel approach to handle multiple sensitive attributes, allowing any combination in the target domain. This approach involves learning two representations: one for general model performance and another for transferring fairness to unseen domains with similar sensitive attributes. The proposed method significantly reduces the model requirement from 2^n to just 1 for handling multiple attributes and outperforms existing methods in experiments with unseen target domains.
优点
- According to the authors, this paper introduces a new setting of fair domain generalization with multiple sensitive attributes.
- Based on the proposed setting, a comprehensive training approach is given.
- The paper is easy to follow.
缺点
- Some statements are over-claimed. In the introduction, "FATDM is the only work that addresses..." this is not true. Several works, other than FATDM, address fairness-aware domain generalization but in various paradigms, such as [1], [2], and [3].
- Figure 2 is unclear to me. What is the unfairness metric "Mean"? Do you mean "mean difference" or others? How do you define "different level of fairness"? Also, the word "level" should be plural "levels". What is the take-home message when observing the drop in performance, and what is the connection between this drop and multiple attributes? How does this observation relate to various domains?
- In the second item of contributions in the Introduction, except the problem mentioned in the first contribution, what is the other problem when you say "both problems"?
- What is the relationship between the target domain \Tilde{d} with source domains? Is the target domain shifted from sources due to covariate shift, too? If not, what assumption do you make on target domains? This lack of clarification and, hence, unclear to me. Besides, giving a brief introduction to covariate shifts is necessary.
- I doubt the novelty of proposing the setting in multiple sensitive attributes. To me, a dataset with multiple sensitive attributes can be easily converted to one with a single sensitive attribute with multiple categories. For example, as stated in the paper, a sensitivity configuration set \mathcal{C}={[0,0], [0,1], [1,0], [1,1]} can be viewed as a set {1,2,3,4} where a single sensitive attribute with four distinct categorical values.
- Does data sample x include sensitive attribute c?
- In Eq.(1), "d'" should be replaced by "d''".
- How to ensure g_\theta encodes an invariant representation across domains? According to Eq.(5), the loss L_{DG} is defined as the expectation across all source domains. Therefore, it is not convincing to me that the generalization encoder can be generalized to an unseen target domain when a covariate shift occurs.
- In the fairness encoder, x is concatenated with c. I am wondering how to do it empirically when x is an image while c is one of the annotations of the image. Please explain your experiments for implementation using the CelebA as an example.
- Speaking of fair machine learning in general, it aims to mitigate spurious correlations between sensitive attributes and model outcomes. Although this work mentions fairness multiple times, it is unclear to me how to mitigate the spurious correlations during training. This work proposes that it "minimize the gap between the domain representations that have the same sensitive attribute configurations and maximize the gap for representations with different sensitive attributes". But this does not ensure unfairness is controllable.
[1] Elliot Creager, Jörn-Henrik Jacobsen, Richard Zemel. Environment Inference for Invariant Learning. ICML 2021.
[2] Changdae Oh, Heeji Won, Junhyuk So, Taero Kim, Yewon Kim, Hosik Choi, Kyungwoo Song. Learning Fair Representation via Distributional Contrastive Disentanglement. ACM SIGKDD 2022.
[3] Chen Zhao, Feng Mi, Xintao Wu, Kai Jiang, Latifur Khan, Christan Grant, Feng Chen. Towards Fair Disentangled Online Learning for Changing Environments. ACM SIGKDD 2023.
问题
See weaknesses.
- I doubt the novelty of proposing the setting in multiple sensitive attributes. To me, a dataset with multiple sensitive attributes can be easily converted to one with a single sensitive attribute with multiple categories.
It may be possible to view multiple sensitive attributes as a single sensitive attribute with multiple categories. However, our contribution is not to just extend domain generalization and fairness to a multi-attribute setting. In a multi-attribute setting, each domain can have a different set of sensitive attributes (which we assume are decided by a user based on the application at hand). Hence, we also address the problem of the target domain having a different set of sensitive attributes from that of the source. We discuss these contributions under paragraphs and of the introduction section of the paper. \
Our baseline is a naive modification of the existing approach FATDM (designed for a single sensitive attribute) to cater to a multi-attribute setting. We compare it against our method SISA which considers the differences in the sensitivity of the attributes across different domains. From the experimental results, we show that our method is better at both performance on the target domain and fairness measures for all subsets from a possible set of sensitive attributes in the target domain.
- Does data sample include sensitive attribute ?
No, is an image of size xx for MIMIC CXR and xx for CelebA. is a binary vector of size x where is the total number of possible sensitive attributes. We provide it to the model as additional meta information.
- In Eq.(1), "d'" should be replaced by "d''"
No, we use and and and to denote the differences in the density transformation model and its output. However, both and are from .
- How to ensure encodes an invariant representation across domains? According to Eq.(5), the loss is defined as the expectation across all source domains. Therefore, it is not convincing to me that the generalization encoder can be generalized to an unseen target domain when a covariate shift occurs.
Prior works [7],[8],[9] have theoretically proved that error on the target domain is upper bounded by the training error on the source domains, pairwise divergences among the source domains, and divergence between the source and the target domain. As we do not have access to target domains, we follow previous works and minimize the pairwise divergence among the source domains through domain invariant representation learning () and the training error on the source domains () for a lower error on the predictions of the target domain.
[7] Generalizing to unseen domains via distribution matching, 2019
[8] Fairness and accuracy under domain generalization, ICLR 2023
[9] On Learning Domain-Invariant Representations for Transfer Learning with Multiple Sources, NeurIPS 2021
- In the fairness encoder, is concatenated with . I am wondering how to do it empirically when is an image while is one of the annotations of the image. Please explain your experiments for implementation using the CelebA as an example.
is a binary vector of size x. We reshape it to xx by repeating its values over dimensions and and adding a rd dimension. Then it is concatenated to input as an additional channel. So the total dimension of the input ( = xx for MIMIC CXR and xx for CelebA). We have provided the code to get more insight into this.
- Speaking of fair machine learning in general, it aims to mitigate spurious correlations between sensitive attributes and model outcomes. Although this work mentions fairness multiple times, it is unclear to me how to mitigate the spurious correlations during training. This work proposes that it "minimize the gap between the domain representations that have the same sensitive attribute configurations and maximize the gap for representations with different sensitive attributes". But this does not ensure unfairness is controllable.
Our focus is on the notion of fairness that aims to balance classifier errors across population subgroups, for example, by matching error rates across genders or different racial groups.
However, based on the review, we have added Table 13 to the supplementary section of our paper showing the Pearson correlation measure between the sensitive attributes and the target attribute predictions of our model and ERM. Our model reduces the correlation between the sensitive attributes and the model prediction.
We thank Reviewer 1 for their constructive comments to improve our paper. We provide our responses to their queries below.
- Some statements are over-claimed. In the introduction, ”FATDM is the only work that addresses...” this is not true. Several works, other than FATDM, address fairness-aware domain generalization but in various paradigms, such as [1], [2], and [3]
We thank Reviewer 1 for pointing us to papers [1], [2], and [3]. We agree that they are all broadly under fairness-aware domain generalization and we have added these papers to our related works section in the revised version of the paper.
However, we would like to emphasize the difference between these papers and FATDM. Papers [1] and [2] reduce the correlations between target attributes and sensitive groups and their goal is to improve the test accuracy of the predictions of the worst-case represented sensitive group. (E.g. removing correlations of land birds with land).
FATDM lies in the line of works like [4], [5], [6] which achieve fairness by reducing bias against any instance of a sensitive group by equalizing the predictions for different instances of the sensitive groups via optimizing metrics like equal opportunity, equalized odds or demographic parity (E.g., predicting with similar true positive and false positives whether gender=Male or Female). [3] optimizes equalized odds and is similar to FATDM, however, their focus is to address the online learning problem.
[1] Environment Inference for Invariant Learning, ICML 2021
[2] Learning Fair Representation via Distributional Contrastive Disentanglement, KDD 2022
[3] Towards Fair Disentangled Online Learning for Changing Environments, KDD 2023
[4] Fair Mixup: Fairness Via Interpolation, ICLR 2021
[5] Equality of opportunity in supervised learning, NeurIPS 2016
[6] Empirical Risk Minimization Under Fairness Constraints, NeurIPS 2018
- Figure 2 is unclear to me. What is the unfairness metric "Mean"? Do you mean "mean difference" or others? How do you define "different level of fairness"? Also, the word "level" should be plural "levels". What is the take-home message when observing the drop in performance, and what is the connection between this drop and multiple attributes? How does this observation relate to various domains?
The unfairness metric Mean is the Mean difference metric (Equation below). We have revised Fig 2 in the paper based on this comment.
By different levels of fairness, we mean that the inclusion of additional sensitive attributes can have additional terms to optimize fairness such that performance may be compromised. We have an example below with gender followed by gender and race.
$
\mathrm{Mean Difference(g)} = \frac{1}{N}\sum_{i=1}^{N}\sum_{y \in \mathcal{Y}}((h(\mathbf{z}_i)\mid y, g=\mathrm{F}) - (h({\mathbf{z}_i})\mid y, g=\mathrm{M}))^2
$
$
\begin{split} \mathrm{Mean Difference(g,r)} = \frac{1}{N}\sum_{i=1}^{N}\sum_{y \in \mathcal{Y}}((h(\mathbf{z}_i)\mid y, g=\mathrm{F}), r=\mathrm{B}) - (h({\mathbf{z}_i})\mid y, g=\mathrm{M},r=\mathrm{W}))^2 \ +((h(\mathbf{z}_i)\mid y, g=\mathrm{M}), r=\mathrm{B}) - (h({\mathbf{z}_i})\mid y, g=\mathrm{M},r=\mathrm{W}))^2 \ + ((h(\mathbf{z}_i)\mid y, g=\mathrm{F}), r=\mathrm{B}) - (h({\mathbf{z}_i})\mid y, g=\mathrm{F},r=\mathrm{W}))^2 \ + ((h(\mathbf{z}_i)\mid y, g=\mathrm{M}), r=\mathrm{B}) - (h({\mathbf{z}_i})\mid y, g=\mathrm{F},r=\mathrm{W}))^2 \ + ((h(\mathbf{z}_i)\mid y, g=\mathrm{M}), r=\mathrm{B}) - (h({\mathbf{z}_i})\mid y, g=\mathrm{M},r=\mathrm{B}))^2 \ + ((h(\mathbf{z}_i)\mid y, g=\mathrm{M}), r=\mathrm{W}) - (h({\mathbf{z}_i})\mid y, g=\mathrm{F},r=\mathrm{W}))^2 \end{split}
$
where is the latent representation of the input .
Take home message when observing the drop in performance:
Increasing the number of sensitive attributes can compromise the accuracy as shown in Figure 2 in the paper. We assume that an attribute's sensitivity is a user preference and do not expect each domain (of domain generalization) in the data to have the same set of sensitive attributes. When only a subset of attributes (gender or race or none) is sensitive in a domain, enforcing fairness on all possible sets of sensitive attributes (gender and race) can lead to an unnecessary dip in performance.
- What is the other problem when you say both problems
The first problem is getting a good generalization performance on a target domain. The second problem is getting a good fairness measure with respect to all the sensitive attributes in the target domain.
- Is the target domain shifted from sources due to covariate shift, too?
yes. However, the sensitive attributes of the target domain are decided by the user. We do not assume it to be due to the covariate shift.
We provide an anonymous Google Drive link for sharing the source code and a few trained models for reproducibility.
This paper received the following ratings: 1, 6, 3, 3. The main issue, evidenced by all reviewers, regards the poor clarity of the presentation/organization of the paper, even lacking of details, which prevent the full understanding of the work. Further, the motivations are not well addressed, the novelty aspects are unclear, and experimental validation is insufficient, including poor ablations. Overall, the number of flaws raised by the reviewers are indeed large. Authors provided answers to these issues, but they have not succeeded to convince the reviewers to raise their scores.
In the end, given all these problems, this paper cannot be considered acceptable for publication at ICLR 2024.
为何不给更高分
Too large number of issues and rebuttal was not considered adequate, even if only one reviewer seemed to acknowledge its reading. Overall, I think this work cannot be accepted in these conditions.
为何不给更低分
N/A
Reject