Fairness without Harm: An Influence-Guided Active Sampling Approach
We develop a sampling algorithm such that the ML model jointly learned with training data and unlabeled influential data is fairer with high accuracy.
摘要
评审与讨论
In this paper, the authors propose Fair Influential Sampling(FIS) to achieve a better Pareto Frontier of the fairness-accuracy tradeoff through sampling training data. To maintain fairness without diminishing the accuracy, training data examples are validated through a validation dataset. In this case, privacy is maintained by not annotating sensitive attributes in the large training dataset, but doing so for the validation dataset. An example data is selected as training data based on its influence on the accuracy and fairness using the validation dataset. This research also claims that by acquiring new data the proposed algorithm can decrease the group gap by utilizing the fairness components and can also prevent the negative impact of distribution shift by utilizing accuracy influence components, which in theory should improve fairness without harming the accuracy.
优点
The proposed sampling approach attempts to mitigate group fairness without compromising both accuracy and privacy.
As the FIS does not require annotation of sensitive attributes for the large training dataset a substantial amount of cost and labor is negated.
The work addresses potential distribution shift issues effectively to maintain fairness and accuracy.
Even though the method is validated on some specific datasets the basic mechanism of the FIS has the potential to be applicable in different sectors where maintaining fairness is a crucial factor.
The maintenance of privacy supports ethical AI application development practices.
缺点
The success of the FIS is heavily dependent on the refinement level of the validation set. If the validation set is not a proper representation of the large training dataset then it may compromise the accuracy and fairness.
The paper's sampling strategy mostly depends on particular influence measurements. The sensitivity of these matrices for different data distributions should be thoroughly explored.
If there is bias present in the validation dataset, unfairness may be amplified by the FIS method.
The proposed method relies substantially on a clean and informative validation dataset. Which might be hard to secure in certain cases.
The paper does not mention several recent studies in the related work section, which could provide additional context or alternative methodologies for addressing fairness without demographic data. Here are a few examples:
[1] Zhao, Tianxiang, et al. "Towards fair classifiers without sensitive attributes: Exploring biases in related features." Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 2022.
[2] Yan, Shen, Hsien-te Kao, and Emilio Ferrara. "Fair class balancing: Enhancing model fairness without observing sensitive attributes." Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2020.
[3] Chai, Junyi, Taeuk Jang, and Xiaoqian Wang. "Fairness without demographics through knowledge distillation." Advances in Neural Information Processing Systems 35 (2022): 19152-19164.
The authors' method overlooks the presence of non-demographic features that could still relate to demographic information, posing a privacy issue even without direct inclusion of demographic information.
Although in “Empirical Results” section states that 4 datasets were used, according to Appendix E, only 3 datasets(CelebA dataset, Adult dataset, COMPAS dataset) were used for experimental purposes.
In 4.2, it is stated that line no 9 updates U and Soriginal, but according to the pseudo code this process is being executed in line no 11.
问题
Please refer to the “Weaknesses” section.
局限性
In this paper, the authors propose FIS to preserve the fairness of the model and the privacy of individuals. However, several limitations can be observed in the sampling process. Firstly, the authors should consider the computational expenses of the algorithm involving different scales of datasets. Moreover, as sampling relies on the validation dataset, its fairness may be jeopardized if the validation dataset is skewed. Additionally, it would be much more efficient to develop an algorithm to select a validation dataset that contains little to no bias. The sampling method also does not explain how fairness is maintained in the case of expanding datasets.
We want to thank the reviewer for their positive feedback and comments. We will address individual comments below.
Response to the refinement of validation set: Thank you for raising this concern. For this issue, we did some ablation study in Section 6.3 to explore the impact of the validation set size on our algorithm’s performance, shown in Fig. 2. Our experimental results also demonstrate the superiority of our sampling method over this new baseline.
Response to fairness metric sensitivity of sampling strategy: Thank you for this comment. We want to highlight that our method is general and suitable for all fairness-metric-based fairness losses. It allows for the adjustment of the fairness loss component to suit various fairness metrics as needed.
Response to bias validation set: Thank you for your feedback. In this work, we mildly assume that the validation and test sets come from the same distribution. Consequently, any bias present in the validation set will also be reflected in the test set. Therefore, it is impossible to accurately evaluate fairness on a biased test dataset. Nonetheless, we can discuss some cases where the validation set is biased while the test set is not. Such biases might arise from the small size of the validation set. To illustrate the impact of validation set size, we conducted experiments shown in Fig. 2. If the bias is incurred due to noisy sensitive attributes, we can apply loss correction techniques to these attributes. We welcome the reviewer to provide a practical case of bias for further discussion.
Response to the clean and informative validation set: Thank you for raising this concern. In practice, when dealing with a noisy validation set, we can effectively address the issue by applying a loss correction method to the fairness loss. This adjustment is based on the label transition matrix [1], which helps account for the noise in the labels. Therefore, this practical issue can be heavily alleviated, which can be addressed in future work.
Response to recent studies: Thank you for providing more references. We will update accordingly in the revised version.
Response to the correlation of non-demographic features and demographic information: Thank you for bringing attention to this important aspect. We acknowledge this as an important consideration in data privacy. It is true that some non-demographic features may correlate with sensitive attributes. However, it's crucial to clarify that our method specifically limits its query with the data. We only query the true labels for a selected subset of unlabeled examples. This approach means that while correlations between non-demographic variables and sensitive attributes may exist in the underlying data, our method itself does not introduce any additional privacy leakage beyond what's inherent in the original dataset. On the other hand, our work’s primary focus is not on addressing privacy concerns related to demographic information. Rather, our main objective is to reduce fairness disparities while maintaining a favorable utility trade-off. We achieve this without explicitly incorporating any additional sensitive information into the process and without directly engaging with the complex privacy implications of demographic data usage.
To address this privacy concern, we can analyze it via differential privacy. Consider a function that maps non-demographic information (features and label ) to sensitive attribute . Due to insufficient information, is not deterministic, making it unable to precisely estimate . If a deterministic mapping function existed, it would unavoidably lead to privacy leakage. Therefore, we assume that given the same features and label, the function output might vary. For example, in the Adult dataset, includes age and credit history, is the income, and is the gender. In practice, both men and women have a probability of earning more than 50 K (i.e., ) given the same age, credit history, etc (same feature ). Suppose and for all (where ). Here, represents the sensitive attribute variable, which is unknown to . Then, we have where . In practice, if the mapping function is too strong, i.e. is too large, we can add additional noise to reduce their informativeness and therefore better protect privacy.
Response to line number and number of datasets: Thank you for pointing out these typos. We will update it in the revised version.
[1] Weak proxies are sufficient and preferable for fairness with missing sensitive attributes, ICML 2023.
This paper addresses the challenge of achieving fairness in machine learning (ML) without compromising model accuracy. The authors propose a novel active data sampling algorithm that mitigates group fairness disparity by acquiring more data without requiring sensitive attribute annotations for training, thus protecting privacy. The algorithm scores each new example based on its influence on fairness and accuracy, evaluated on a small validation set, and selects examples for training accordingly. Theoretical analysis demonstrates how acquiring more data can improve fairness without degrading accuracy, and provides upper bounds for generalization error and risk disparity. Extensive experiments on real-world data validate the effectiveness of the proposed approach.
优点
Clear & meaningful motivation: The proposed fair influential sampling (FIS) method avoids the need for sensitive attribute annotations during the sampling or training phases, thereby protecting privacy.
Sound Theoretical Contributions: The paper includes a thorough theoretical analysis, showing how the proposed algorithm can improve fairness without harming accuracy. It also provides upper bounds for generalization error and risk disparity, enhancing the method's credibility.
Applicability: Based on the method design & setup, it seems it's not limited to single sensitive variable, rather multiple discrete ones. Therefore, this renders the method considerably better applicability compared to most other fairness methods.
缺点
-
Insufficient discussion of fairness metrics: The choice and justification of the fairness metrics used in the evaluation could be more thoroughly discussed. A deeper explanation of why these particular metrics were selected and how they effectively measure fairness would provide a clearer understanding of how fairness is quantified in this study and the implications of these choices. For example, why choose risk disparity as the fairness definition, and how does it compare & contrast with other metrics, like equalized odds, or counterfactual fairness metrics.
-
As the authors point out, The proposed algorithm relies on a clean and informative validation set that contains sensitive attributes of data examples. This dependency may not always be practical or feasible in real-world scenarios, where obtaining such validation sets can be challenging due to privacy concerns and data availability.
-
Unclear experiment results presentation and significance: I find key result in table 1 confusing. For example, how is fairness_violation defined, and what is the highlighting scheme, and why is the row of DP of CelebA - Attractive not highlighted.
问题
-
Could you provide more discussion on the choice of fairness definition used in your evaluation? I have skimmed section B of your appendix, but why is risk disparity selected as the fairness definition throughout, and how does risk disparity compare with other fairness metrics like equalized odds or counterfactual fairness metrics?
-
Your proposed algorithm relies on a clean and informative validation set that contains sensitive attributes. How do you address the potential challenges in obtaining such validation sets in practical scenarios, especially considering privacy concerns and data availability? What are the implications if the validation set is not fully representative or contains noisy labels, and how might this affect the performance and fairness of your algorithm?
-
Based on my understanding of equation 5, your method is compatible with multiple (discrete) sensitive variables. It that correct?
-
For table 1: How is fairness_violation defined? Why is the row of DP of CelebA - Attractive not highlighted? What if the highlighting rule?
Writing suggestion: I am confused with the highlighting scheme of Table 1. It would be better if it's explained in the caption right above.
局限性
The authors mentioned limitation that it needs validation set though did not attempt to address it.
We want to thank the reviewer for their positive feedback and comments. We will address individual comments below.
Re to W1: Thank you for raising this concern. We want to clarify that the definitions of well-known fairness metrics, such as DP and EOd, are provided in Section 3 (Preliminaries). These metrics are used for experimental evaluation in Section 6. In this work, we focus solely on active sampling to build a fairer dataset, which is then used to train the model through standard Empirical Risk Minimization (ERM) with Cross-Entropy (CE) loss. Without relying on any additional assumptions about the model or dataset, an intuitive approach is to analyze the loss function. Risk disparity can thus serve as an intermediate-term for theoretical analysis. Proposition 3.1, in Line 162, clearly illustrates the connections between risk disparity and other metrics. Additional theoretical proof can be found in Appendix B.
Re to W2: In practice, when dealing with a noisy validation set, we can effectively address the issue by applying a loss correction method to the fairness loss. This adjustment is based on the label transition matrix [1], which helps account for the noise in the labels. Therefore, this practical issue can be heavily alleviated, which can be addressed in future work.
Re to W3: Thank you for raising this concern. The term "fairness violation" is widely used to quantify the absolute differences based on fairness metrics, such as DP. As outlined in the caption of Table 1, we specifically bold the results that have lower fairness violation meanwhile the accuracy is almost the same as the accuracy obtained by random baseline. For example, in the row for DP on the Celeba-smiling dataset, compared to the random baseline results (0.837, 0.133), we highlight FIS because it achieves a lower DP value (0.084(FIS) < 0.133(random)) while maintaining high accuracy (0.848(FIS) > 0.837(random)). Another baseline, JTT, achieves a lower DP value (0.077(JTT-20) < 0.133(random)), but its accuracy drops significantly (0.698(JTT-20) < 0.837(random)). When applying this criterion to the row for DP on the CelebA dataset, we observe that no results meet our highlighting scheme.
Re to Q1: Please refer to the response to W1.
Re to Q2: Please refer to the response to W2.
Re to Q3: Thank you for your attention. Your understanding is correct. Eq. (5) is compatible with sensitive variables.
Re to Q4 & writing suggestion: Please refer to the response to W3.
[1] Weak proxies are sufficient and preferable for fairness with missing sensitive attributes, ICML 2023.
For your response to W1
I see that the connection with our metrics is mentioned in Line 162. But what I see missing is that, why risk disparity is of interest of this paper? Or what is the context? This is important for readers not for technical reasons but to understand the significance of the metric and thus the proposed method. For example, I could tell the author try to explain it with Figure 1 but still not clear after reading the introduction. Later text jumped into technical details.
For your response to W2 and your response to reviewer cxN7
It's not clear what do authors mean by correction method, nor mentioned in the paper. This is an important piece and should be included at least in the appendix.
For your response to W3
Still I find the presentation of results can be improved.
For your response to Q3
I see that is an advantage of the proposed methods, as many methods could only deal with single sensitive variables. But wouldn't it be more convincing if you use datasets with multiple sensitive variables (both Compas and Adult has only one)?
Overall, I appreciate the rebuttal but I incline to keep my score.
We would like to thank Reviewer wZhW for the time and effort invested in reviewing this work. Regrettably, we are not allowed to update the manuscript during the rebuttal period. Overall, to prevent any confusion or misunderstandings, we will include more specific clarifications in the future version of our work to clearly address these concerns, especially for the presentation.
I see that the connection with our metrics is mentioned in Line 162. But what I see missing is that, why risk disparity is of interest of this paper? Or what is the context? This is important for readers not for technical reasons but to understand the significance of the metric and thus the proposed method. For example, I could tell the author try to explain it with Figure 1 but still not clear after reading the introduction. Later text jumped into technical details.
Thank you for your further feedback about this. In the above response to W1, we discuss and clarify why we use risk disparity. Do the statements above effectively address the reviewer's concern? We will add these statements in the future version to clearly highlight this.
It's not clear what do authors mean by correction method, nor mentioned in the paper. This is an important piece and should be included at least in the appendix.
Loss correction methods are techniques used in machine learning to address and mitigate the impact of errors or biases in the training data, particularly when the labels are noisy or incorrect. These methods aim to improve the performance and generalization ability of models by adjusting the loss function during training, so that the model can better handle the presence of mislabeled data [1, 2]. We will include and discuss more about the case of noisy validation set in the revised version.
[1] Learning with Noisy Labels, NeurIPS 2013.
[2] peer loss functions: learning from noisy labels without knowing noise rates, ICML 2020.
Still I find the presentation of results can be improved.
We will include a clear definition of fairness violation in the paper. Additionally, we will straightforwardly include the test_acc in the table to enhance the presentation in the future version.
I see that is an advantage of the proposed methods, as many methods could only deal with single sensitive variables. But wouldn't it be more convincing if you use datasets with multiple sensitive variables (both Compas and Adult has only one)?
Thank you for your insightful comment! First, we want to clarify that the main goal of our work is focused on improving fairness without using training sensitive attributes, rather than multiple sensitive variables. Nonetheless, the perspective on the compatibility of multiple sensitive variables is so interesting and meaningful. We believe that focusing on multiple sensitive attributes and then conducting experiments about multiple sensitive attributes is a valuable avenue for future research.
Group/statistical notions of fairness are known to be at odds with the accuracy metric. The authors propose an active sampling strategy to improve the tradeoff between fairness and predictive performance. This sampling approach only requires the sensitive information to be known in a small validation set. They estimate the influence of a potential instance on improving the accuracy and fairness, and choose those that most positively influence both objectives.
优点
S1 - Improving the accuracy fairness trade-off is a very relevant problem. Besides, doing it in the scenario when only minimal sensitive information requires to be known extends its applicability to real-world scenarios.
S2 - Figure 1 offers a nice overview to situate the problem and provide an intuition regarding the objective of the work.
S3 - In general terms, the readability of the text is quite good. The ideas are presented clearly, with cohesive explanations throughout. The text is well divided into sections and paragraphs, which helps maintain a consistent flow and makes it easy to follow and understand.
S4 - The paper demonstrates commendable attention to reproducibility by providing detailed information regarding the experimental setup.
缺点
W1 - The related works section needs improvement, starting from the most general topics (active learning and fair classification), to more specific areas of research. Additionally, the claim made by the authors in lines 107-108 is not true. Those approaches do not strive to achieve uniform accuracy over all groups, but rather to improve the predictive performance of the classifier on the worst performing groups. Indeed, it is in the metrics considered by the authors where parity among groups is required. Besides, the acronyms DP and EOp are used before defining them. Moreover, when talking about [39] in line 92 it is not which problem the recent work aims to address.
W2 - In section 4.1 you introduce the influence of the accuracy component but what you are looking into is the loss, and the loss need not be equal to accuracy. Maybe you could rename it and call it the loss/risk component, or change the definition otherwise.
W3 - When active sampling is used the iid condition is not satisfied anymore, but the authors do not mention this in the work nor address this challenge.
W4 - One of the main claims made in the work is that this strategy is able to improve the fairness guarantees of classifiers without sacrificing accuracy. The authors even claim the latter can be seen from Table 1. However, in the empirical validation there are no numerical results regarding the accuracy, only fairness-related metrics are shown. Consequently, this claim is impossible to validate from the results provided in the main paper.
W5 - The comparison with respect to the JTT method is unfair. In fact, this procedure has a strategy to choose the optimal weight that needs to be given to the misclassified instances, and is learned in the training process. Thus determining the weight to be 20 is unfair.
W6 - It is not very clear how much data/information is given to each of the methods, which makes it difficult to verify whether the empirical validation is fair.
W7 - The main objective of the active sampling paradigm is to achieve superior performance (in this case better accuracy-fairness trade-off) with less datapoints, as they are chosen effectively. However, there is no empirical validation nor theoretical statement that supports the latter.
Additional minor comments:
W - In the paragraph labelling from section 4.2, the explanation that introduces the paragraph is not very clear, and it is in fact misleading… it would be nice to improve its clarity.
W - In line 235 you talk about ‘most confident labels’, but maybe employing the word confidence is not the most appropriate.
W - The authors do not explicitly state that the fairness loss is something to be minimized.
W - The authors do not clearly state how the hyperparameters of the rest of the methods are tuned. W - In the paragraph proposed algorithm from section 4.2, the authors talk about tolerance but there is no such parameter in Algorithm 1.
W - You are talking about the trade-off between accuracy and fairness but in line 89 you use the word generalization, which is not a synonym but a word with a completely different meaning.
W - The text contains several typos. For instance, between lines 83-85, work(s) need to be plural, and train examples are better references as training examples. Also the authors employ ‘to us’ in line 113, and it would be better to replace it with a more formal phrase such as ‘to this work’.
W - In line 86 when you talk about the trade-off it would be nice to include some references (as in the introduction).
W - Check how the references are written, missing capital letters in [49] and [23]; no publish information in [57], etc.
W - Regarding the titles of the sections, only the first letter of the first word needs to go in uppercase.
W - is not defined.
问题
Q1 - What do > symbols refer to in equations (2) and (3)? Vectorial product?
Q2 - According to the definitions of Lemma 4.1 and 4.2 the value of the influence depends on the size of the validation set. Why don’t you use an average so that this characteristic does not influence?
Q3 - In the section proposed algorithm you describe an iterative procedure to find the top r instances whose fairness influence based on the true labels meet the conditions, but you are using proxy labels and not the true labels. Then, how do you know the latter?
Q4 - Why do you need to make the assumption that the train/test distributions are drawn from a series of component distributions?
Q5 - Regarding Theorem 5.2 you talk about the group gap, and claim that the more balanced the data is (...). But which gap/balance do you refer to? Is it in terms of representation or performance?
Q6 - Regarding the tabular data, you resample the data to balance the class and group membership; but what is the reason behind this procedure? What happens if there is no such balance?
局限性
Limitations are clearly stated by the authors in Section 7.
We want to thank the reviewer for their detailed feedback and comments. We will address individual comments below.
Re to W1: We want to clarify that we do not assert that "Those approaches do not strive to achieve uniform accuracy over all groups." Rather, our claim is that these approaches primarily focus on enhancing the worst-affected group performance, and they assess this improvement using accuracy-level metrics. However, this does not necessarily align with popular fairness metrics such as DP. We appreciate you pointing out the need to define the acronym DP clearly. We will ensure this is corrected in the revised version of the paper. Regarding reference [39], as indicated by its title, this paper's starting point is to achieve fairness at no utility cost (predictive utility, synonymous with accuracy). This recent work also motivates us to do active sampling without retraining. Therefore, this paper is within the scope of the fairness-accuracy tradeoff.
Re to W2: Thank you for your feedback. We use the terms "accuracy" and "fairness" to highlight the goals of different components within our model, which helps in distinguishing their roles more clearly. The ‘loss’ component may be inaccurate because both the fairness and accuracy components are based on the loss.
Re to W3: We want to clarify that we do not make any additional assumptions about the iid condition between the train and test data distributions. Our default setting assumes the potential for a distribution shift between two distributions. Our main observations (Lines 297-308) confirm that the proposed method demonstrates robustness under non-iid conditions, effectively addressing potential distribution shifts.
Re to W4: Thanks for your feedback. To illustrate the tradeoff between accuracy and fairness, we consistently present the results in the format: (test_accuracy, fairness_violation). We highlight this form in the captions of tables.
Re to W5: We want to clarify that the JTT weight is a hyperparameter, which can be easily verified because the weight for misclassified examples is input in Algorithm 1. The value of 20 is the same setting as JTT. Here is the original statement presented in the JTT paper: “For JTT, we additionally tune the number of epochs of training the identification model and the upsampling factor . For the final experiments we tune over .”
Re to W6: All baselines including our method are done completely in the same data settings. Due to the space limit, we provide only a brief overview of the three datasets in Section 6.2 (line 331 and line 343). More data information can be found in Appendix E.1 (dataset and parameter settings).
Re to W7: We want to clarify that our main goal is to reach a better accuracy-fairness tradeoff point while not disclosing more sensitive attributes, instead of focusing on less data points. Nevertheless, in Section 6.3, we conduct experiments to illustrate the impact of the solicited data size, shown in Fig. 2.
Re to m-W1: We first use pseudo-labels as in Eq. (5) to find appropriate examples due to the labeling budget, then query the ground-truth labels of the selected examples for training purposes.
Re to m-W3: We want to clarify that the CE loss is the only thing that needs to be minimized during training, rather than fairness loss. The fairness loss is only used to compute the fair influence score. As we highlight in this work, the selected examples should contribute to a decrease of fairness loss by only minimizing the CE loss.
Re to m-W4: There are not many hyperparameters for tuning. In our work, the learning rate and warm-up epoch are fixed.
Re to m-W5: Thank you for pointing out this. The tolerance is used for the final model selection, that is, .
Re to m-w2, m-W6 to m-W10: Thank you for pointing out these typos. We will update it in the revised version.
Re to Q1: The <> symbol refers to the vector product.
Re to Q2: We can use an average value for calculation. Mathematically, the only difference between calculating the average gradient and the summation of gradients from the validation set is the normalization factor. However, the absolute values of influence scores are not critical in themselves because these scores are only for ranking purposes.
Re to Q3: In the active sampling setting, the true labels will be inquired for selected data. Due to the large size of the unlabeled set, we first use the proxy labels to select potential “good” examples for inquiring true labels to reduce the labeling budget. Once received the true labels, due to the difference between true and proxy labels, we double-check whether the influence scores of those samples calculated on true labels meet the condition.
Re to Q4: We want to clarify that it is not an assumption but rather a general distribution property. Without making any additional assumptions about the training or testing distributions, this property acts as a direct tool for assessing the differences between the train and test distributions.
Re to Q5: The group gap is for the term . Achieving more balance means reducing the distance or disparity in distribution between and , thus allowing for a reduction in the first term.
Re to Q6: Resampling the data to balance data across groups is a regular preprocessing method, which helps prevent overfitting to the majority class. Class imbalance can introduce biases between groups, affecting model performance and fairness. By ensuring a balanced distribution of classes within each group, we can more accurately and fairly evaluate the model's performance.
The authors propose a method called Fair Influential Sampling (FIS), which selects candidate data points for training based on their estimated improvement on both fairness and accuracy. This estimation considers the influence of a new data point as the average gain in accuracy and fairness achieved by performing a gradient descent step with the new data point. This average is empirically computed using a labeled validation dataset. The gain in loss (or fairness) is approximated as the inner product between the gradient of the loss (or fairness) with respect to the parameters evaluated on the new data point and each validation point.
The proposed influence metric does not require access to the sensitive attributes of the candidate data points. Additionally, when labels for the target variable are unavailable for candidate data points, the authors suggest producing proxy labels by annotating the most confident class. This allows the estimation of the accuracy influence score. If a data point is selected for further training, a user must then perform the actual labeling.
The selected data points are those that improve the fairness influence score without negatively impacting the average accuracy score on the validation set. Experimental results demonstrate that the proposed approach can select training data points that enhance fairness without reducing accuracy compared to a random sampling strategy. Furthermore, under the constraint of maintaining an accuracy equal to or better than a "random" model, FIS achieves superior fairness performance compared to other competing methods.
优点
The paper is well-written, addressing a relevant problem with clear formulation and proposed methods. The authors provide theoretical justification for the approximations used in computing their influence scores and offer formal analysis on the impact of adding more data on model generalization.
Experimental results demonstrate that the proposed active sampling approach performs as expected. It does not harm accuracy compared to a base ERM model or random sampling. Moreover, the selected data points improve fairness under this constraint. Additionally, the proposed approach does not necessarily require labeled data, which is advantageous as only ground truth labels for the accepted data are needed. This can motivate the use of this method.
That being said, I am not entirely familiar with the extensive state-of-the-art literature on active sampling to fully evaluate the novelty of the proposed approach.
缺点
It seems to me that the data selection strategy based on the proposed influence scores is approximating the process of training with the validation set. I believe it would be beneficial to compare the performance of a model trained with this labeled validation set (without sampling and with random sampling) to demonstrate the advantage of using a data sampling strategy based on the validation set instead of simply including it as part of the training data.
问题
Could the authors indicate if the results in Table 1 are computed with respect to a test dataset that does not include the one used for validation and sample selection? I ask this because in line 330, the paper states that "10% of the test data is randomly set aside as a hold-out validation set." I assume this portion is not used for evaluation.
Typo in Eq 4, shouldn’t \ell be \phi?
局限性
yes.
We want to thank the reviewer for their positive feedback and comments. We will address individual comments below.
Response to weakness: Thank you for raising this good point. We appreciate the feedback and apologize for not emphasizing this aspect in the main text of our paper. We have conducted several experiments, which are detailed in the appendix, to highlight this key point. Specifically, in Appendix E.5, we introduced a new baseline, termed 'random+val', which involves training the model on a dataset created through random sampling combined with a full validation set. As shown in Table 8, including the validation data as part of the training set will degrade the violation of EOp. Our experimental results demonstrate the superiority of our sampling method over this new baseline.
Response to Q1: Thank you for pointing out the concern regarding data leakage. The "original" test dataset has been carefully divided into two independent portions: a new test dataset and a validation set. Therefore, the validation data will not be used for evaluation.
Response to Q2: Equation (4) is correct. In this equation, represents the typical cross-entropy loss, while denotes the fairness loss. Intuitively, we use the inner product of the gradients based on these two types of loss functions for sampling, as shown in Eq. (4). For instance, if the gradient of the fairness loss aligns with that of the accuracy loss for a particular sample, this indicates that incorporating this sample into further training could simultaneously reduce both accuracy loss and fairness loss.
This paper proposes FIS, a tractable active data sampling method that only requires group annotations on the validation set rather than the entire training set. The proposed algorithm evaluates and scores each data sample based on its influence on fairness and accuracy using the validation set, then selects a subset of examples to be added to the training set. The influence is evaluated by comparing the sample's gradient to the one derived from the validation set. This work formally analyses how more data improves fairness, and provides an upper bound on the generalisation error and risk disparity. The effectiveness of FIS is empirically validated on real-world datasets against various baselines.
- Merits: The reviewers found the paper to be well-written with clear a strong motivation, addressing a relevant problem with minimal sensitive information required, enhancing its real-world applicability. It presents solid theoretical analysis, clear formulations, and sufficient empirical results. The paper also provides enough details to ensure reproducibility.
Revisions for the camera-ready: The authors should incorporate their responses from the rebuttal and discussion periods into the revised version. In particular:
-
Related work: The related work section needs enhancement with some missing literature mentioned by the reviewers [1]-[3] and additional context on risk disparity (i.e., W3 from Reviewer wZhW) in the early sections.
-
Text revisions: Some claims need revision (e.g., lines 330 (Q1 from Reviewer KagD), lines 107-108 (W1 from Reviewer Pisk)), and a clear definition of fairness violation should be included (Re to W3, Reviewer wZhW). Also, please fix typos such as m-w2, m-W6 to m-W10 from Reviewer Pisk, and the number of datasets in the main text.
-
Results presentation: SAC, Reviewers wZhW and Pisk, and AC highlighted that it is difficult to assess the significance of the proposed approach based on how the trade-off results are presented in Table 1. The presentation of Table 1 needs improvement. This could be addressed by turning these results into a trade-off curve (using data also from Figures 2, 4, 5, and 6 if needed) or points in the trade-off plane. If not feasible, please adjust the bolding scheme and provide more context in the main text and figure caption based on your discussions with Reviewer wZhW in W3. Additionally, the authors should include pointers and a brief description of the results presented in the appendix within the main manuscript.
-
Limitations: Some reviewers raised concerns about the need for a representative validation dataset, but this issue is already discussed by the authors (lines 372-373) and rebuttal. However, the authors should add their Response to the clean and informative validation set to Reviewer cxN7 in the respective discussion on the main text, or to the appendix.
-
Appendix: Include Re to W5 to reviewer Pisk and Response to the correlation of non-demographic features and demographic information to Reviewer cxN7 in the Appendix, along with relevant pointers in the main text.
[1] Zhao, Tianxiang, et al. "Towards fair classifiers without sensitive attributes: Exploring biases in related features." Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 2022.
[2] Yan, Shen, Hsien-te Kao, and Emilio Ferrara. "Fair class balancing: Enhancing model fairness without observing sensitive attributes." Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2020.
[3] Chai, Junyi, Taeuk Jang, and Xiaoqian Wang. "Fairness without demographics through knowledge distillation." Advances in Neural Information Processing Systems 35 (2022): 19152-19164.