PaperHub
5.3
/10
Rejected3 位审稿人
最低5最高6标准差0.5
5
6
5
3.0
置信度
正确性2.3
贡献度3.0
表达3.0
ICLR 2025

Towards Real World Debiasing: A Fine-grained Analysis On Spurious Correlation

OpenReviewPDF
提交: 2024-09-25更新: 2025-02-05
TL;DR

We raise and explore the problem of debiasing under real-world scenarios.

摘要

关键词
spurious correlationdataset biasdebias

评审与讨论

审稿意见
5

This paper studies the suspicious correlation problem. The authors first identify the unrealistic suspicious correlation patterns in benchmark datasets. Based on this observation, the authors reveal that the gap between benchmark and real-world datasets mislead the development of debiased methods. The authors further propose a new debiasing method, enhancing exitsing methods across different biases.

优点

  1. The authors identify a key problem in existing benchmark datasets, which mislead the development of existing methods.
  2. The analysis on existing metrics provides insight in understanding existing benchmarks and developing better methods to handle realistic bias.
  3. The authors propose a new debiasing method named bias capture with feature destruction to destroy the effect of target features.

缺点

  1. The motivation of different choices of metrics is not clear.
  2. Some of the assumptions in analysis are too strong or unrealistic.

问题

  1. Line 198, why choosing KL divergence here, but using total variation in later analysis.
  2. Line 198, since the KL divergence is non-symmetric (and probably ill-defined when encountering zero probability), will other metrics, e.g., JS divergence, be a better metric.
  3. Eq.(1): can you switch the role of yty^t and ysy^s, i.e., $KL(p(y^s), P(y^s|y^t))? What’s the difference between these two formulations?
  4. Is there any justification on whether the two proposed metrics, i.e., magnitude and prevalence, are (1) complementary/orthogonal, i.e., they depict dataset property from quite different aspects, and (2) complete, i.e., they can completely depict the dataset properties.
  5. Line 269, data distribution: the analysis assumes that both target and suspicious attributes are binary, can you analysis generalized to multi-class/non-binary settings?
评论

Weakness 2: Concern for the assumptions in theoretical analysis

Question 5: Line 269, data distribution: the analysis assumes that both target and suspicious attributes are binary, can you analysis generalized to multi-class/non-binary settings?

Using binary tasks as the setup for theoretical analysis is not our "strong assumption" but rather a common practice in the field of machine learning. In fact, many works in this field that conduct theoretical analysis are based on binary tasks. To name a few that have recently been accepted by ICLR:

  • "Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors", ICLR 2024 (spotlight)
  • "On the Foundations of Shortcut Learning", ICLR 2024 (spotlight)
  • "Feature reconstruction from outputs can mitigate simplicity bias in neural networks", ICLR 2023

In fact, the multi-class task can be considered a special and simplified case of the multi-label classification task, where multiple labels/classes can be true simultaneously. Specifically, the multi-label task is essentially an ensemble of binary classification tasks, and is implemented by conducting binary classification for each of the multiple classes. This is also one of the reasons why theoretical analysis in the binary setting is a common practice in this field and considered sufficient.

Furthermore, as stated in Section 2.2, the magnitude of bias for each feature/class varies largely in both binary and multi-class tasks, thus bias should be analyzed for each feature/class separately rather than averaging across all features. Specifically, for multi-class tasks, we should also cast them into binary settings according to the specific feature/class we are analyzing, i.e. the feature of interest v.s. not the feature of interest, similar to the idea of multi-label task rather than multi-class task.

Additionally, thanks to the advice proposed by reviewers yHqh and WrYr, we have notably broadened the scope and impact of this work to the NLP domain and bias detection tasks as well, details of which can be seen in Section 5 of the revised paper. Thus we kindly ask you to take these notable improvements into consideration as well during revisiting the scores.

评论

Thank you for your comments and questions! We provide point-by-point responses below.

Weakness 1: Request for clarification on the motivation of different choices of metrics.

Question 1: Line 198, why choosing KL divergence here, but using total variation in later analysis.

In Section 2, we define the empirical version of bias magnitude with KL divergence because it is general and applicable to not only discrete distributions but even continuous distributions as well. This is also why KL divergence is widely adopted by researchers in many fields.

However, when conducting theoretical analysis, the key is to capture the essense of the problem with clean expressions rather than overly pursue redundant complexity. Thus we further defined a simplified version of bias magnitude based on total variation, which is appropriate for the binary setting of our theoretical setup. (Note, binary theoretical setup is not our assumption but rather a common practice for theoretical analysis in the field of machine learning. We will further elaborate on this in our response to Weakness 2) After all, simplification is a common practice for theoretical analysis.

Furthermore, in our case, the total variation distance is a reasonable simplification of the KL-divergence-based metric under the binary setting, for it also fully satisfies the 4 key points we proposed in Section 2.2 that an appropriate measure should satisfy.

Question 2: Line 198, since the KL divergence is non-symmetric (and probably ill-defined when encountering zero probability), will other metrics, e.g., JS divergence, be a better metric.

As stated in the 4th point of the 4 key points we proposed in Section 2.2 that an appropriate measure should satisfy, the conditional distributions on spurious features can be viewed as diverged distributions from the marginal distribution, thus directional and the other way around is just not reasonable. Consequently, given their directional relationship, the asymmetric KL divergence is an appropriate choice.

As for the case of encountering zero probability is not likely to happen, at least not within the interest of this work, given the popularity of KL divergence in various fields and problems.

Question 3: Eq.(1): can you switch the role of yt and ys, i.e., $KL(p(y^s), P(y^s|y^t))? What’s the difference between these two formulations?

As stated in the 2nd point of the 4 key points we proposed in Section 2.2 that an appropriate measure should satisfy, the status of spurious attribute and target attribute is also asymmetric. As stated in various works [1,2,3], the spurious features should be easier/more available than the target features to act as a shortcut [1] and their correlations will not be captured by the model in the other way around [2]. Consequently, when measuring spurious correlations, the spurious features should be given as a condition because they are more available to the model when learning the biases.

[1] "On the foundations of shortcut learning", ICLR 2024

[2] "Learning from failure: Debiasing classifier from biased classifier", NeurIPS 2020

[3] "What shapes feature representations? Exploring datasets, architectures, and training", NeurIPS 2020

Question 4: Is there any justification on whether the two proposed metrics, i.e., magnitude and prevalence, are (1) complementary/orthogonal, i.e., they depict dataset property from quite different aspects, and (2) complete, i.e., they can completely depict the dataset properties.

In terms of complementary/orthogonal, bias magnitude and prevalence are not completely orthogonal, as the prevalence is dependent on the magnitude. Specifically, magnitude measures the level of bias at the feature level and prevalence at the dataset level, and it's obvious that a dataset cannot be biased if all of its features are unbiased.

As for the completeness of the metrics, we do not consider bias magnitude and prevalence complete, given the fact that data is an extremely complicated subject that remains to be further studied. Neither do we believe any metrics can "completely depict" the properties of data.

In conclusion to weakness 1, we believe we have clearly and explicitly stated the motivation of the metrics in the 4 key points we proposed in Section 2.2. Please let us know if there are any further questions.

评论

thank you for the detailed response. I will keep my score for now.

评论

Thank you for your response. Can the reviewer please elaborate on the reasons why a negative attitude is still held towards the paper? We believe all concerns proposed by the reviewer are well-addressed (if not, please let us know), and the contributions of the paper are further solidified by the extension to the NLP domain and bias detection task, which broadens the scope and impact of the work.

Furthermore, we suspect that it's because the 4 key points in Section 2 are not bold and emphasized enough given their importance that has caused the reviewer to be confused about the logic and motivation of the work. Consequently, we further highlighted the 4 key points in Section 2 to avoid any further confusion for the readers.

评论

Dear reviewer vvhf:

Again, we appreciate the time and effort the reviewer has devoted to reviewing the paper and would like to ask for detailed feedback on the remaining concerns for acceptance of the paper.

We suspect that the reviewer was previously worried about whether the concerns raised by the other two reviewers are well-addressed, thus keeping the score "for now".

To keep the reviewer updated, we note that both reviewers yHqh and WrYr have considered most of the concerns addressed without any further concerns or questions after reading the rebuttal, and both reviewers yHqh and WrYr have raised the score.

Specifically, reviewer WrYr raised the score and is in favor of acceptance. Reviewer yHqh considers most of the concerns well addressed and raised the score by two points without proposing any further questions or concerns.

Despite our suspicion, if there are any specific additional questions or concerns regarding the paper, we highly value the reviewer's feedback and are more than willing to address them during the extended discussion stage.

Best regards,

The authors

审稿意见
6

The authors mainly aim to address the discrepancy between benchmark datasets and real-world datasets. Specifically, they observe that real-world datasets exhibit low prevalence, while commonly used datasets have high prevalence. They theoretically support this observation by demonstrating that high prevalence can occur under two unrealistic assumptions. However, in low-prevalence scenarios, the auxiliary models might fail because bias-neutral and bias-conflicting samples constitute the majority, leading these models to learn target attributes too quickly. To address this issue, the authors propose DiD, which disrupts the auxiliary models from learning target attributes through patch shuffling. In experiments, they demonstrate that their method significantly improves upon existing methods across all settings.

优点

  • By analyzing the datasets from the perspective of prevalence and magnitude, the authors identify significant differences between real-world datasets and benchmark datasets.
  • The authors provide theoretical support for this observation.
  • The paper is well-written.

缺点

  • Although a theoretical background is provided, the experimental evidence appears limited. It would strengthen the work to demonstrate low prevalence on additional datasets, such as an NLP dataset and CelebA, commonly used in the spurious correlation domain. In particular, CelebA is not suggested for evaluating the methods for handling spurious correlations, yet it has a substantial amount of data and exhibits strong biases despite being a natural dataset. Demonstrating that its prevalence aligns with the claim would strengthen the argument. Furthermore, it would be helpful to show the performance in CelebA and NLP tasks as well.
  • For B2T [1], which does not rely on the easy-to-learn property of bias, there seems to be no reason for it to fail in low prevalence settings. A discussion of the advantages of the proposed method over B2T in such scenarios would be required. Additionally, since B2T is a recent approach, it would enhance the paper to include it as a baseline on natural datasets.
  • DiD lacks general applicability. Specifically, it relies on the assumption that bias will be insensitive to patch-shuffling, while the target will be sensitive to it. However, this approach may not work for other types of semantic biases beyond background or perturbation. Furthermore, this method seems to be difficult to use in NLP tasks.
  • The attempt to improve the performance in low-bias settings while keeping the performance in high-bias settings is not the first [2].
  • In Table 3, although the proposed method is designed to perform well in low prevalence settings, it shows lower performance than ERM in the unbiased setting.
  • I couldn’t find a clear description of the metric used for Waterbird in Table 2 in the main paper. Given the reported performance, it seems that either a different metric from the commonly reported worst-group accuracy was used, or the setting doesn’t align with prior methods. It would enhance the evaluation to show whether worst-group accuracy improves when using the same experimental setup as JTT [3].

[1] Kim, Younghyun, et al. "Discovering and Mitigating Visual Biases through Keyword Explanation." CVPR, 2024.

[2] Jung, Yeonsung, et al. "Post-Training Recovery from Injected Bias with Self-Influence.".

[3] Liu, Evan Z., et al. "Just train twice: Improving group robustness without training group information." ICML, 2021.

问题

  • Including results from GroupDRO [1] as an upper bound for comparison would be helpful.

[1] Sagawa, Shiori, et al. "Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization." ICLR, 2020.

评论

Weakness 5: Concern for the performance in the unbiased setting.

In Table 3, although the proposed method is designed to perform well in low prevalence settings, it shows lower performance than ERM in the unbiased setting.

We note that DiD is not a self-contained method for debiasing but rather a plug-in module for the existing debiasing method to improve the performance based on the original performance of the base method. In other words, the performance/effectiveness of DiD should be measured as the performance gain from the base method for debiasing.

Once we established the correct measure for the effectiveness of DiD, we can see from Table 4 that DiD is in fact particularly effective in improving the performance in the unbiased setting. Additionally, in theory, the ERM method should be optimal under the unbiased setting where no distribution shift between the train and test set exists, and any sample reweighing in fact introduces an artificial distribution shift in this case, causing performance degradation. However, in reality, we do not know how severe or if the data is biased at all, thus the bias severity should not be given as a prior and conclude ERM is better.

Furthermore, as you pointed out, the final performance of debiasing is still lower than ERM on unbiased settings despite the great performance boost by DiD. We believe the message here is not the ineffectiveness of DiD but rather the following:

  • Existing methods emphasize too much on hypothesized HP distributions deviating from reality.
  • Despite DiD being an effective fix to existing methods, it is still not completely satisfactory, requiring future efforts into this problem.

These are the two key messages we are attempting to send to the community, representing the core contributions of this work.

Weakness 6: Request for the evaluation of WaterBirds using the same experimental setup as JTT.

I couldn’t find a clear description of the metric used for Waterbird in Table 2 in the main paper. Given the reported performance, it seems that either a different metric from the commonly reported worst-group accuracy was used, or the setting doesn’t align with prior methods. It would enhance the evaluation to show whether worst-group accuracy improves when using the same experimental setup as JTT [3].

The metrics we used for WaterBirds are the worst group accuracy. However, the experimental setup differs from that in JTT as we used ResNet18 for the experiments while JTT used ResNet50. We have added the metrics details for Table 2 in our revised paper.

Following the reviewer's advice, we further provide results on WaterBirds following the exact same experimental setup and metrics as JTT to enhance the evaluation of DiD. As shown in the table bellow, DiD is still effective in the exact same experimental setup as JTT.

Bias supervisionWaterBirds
Avg Acc.Worst-group Acc.
ERMNo78.8231
JTTNo90.9965.26
+DiDNo+3.45+17.45
Group DROYes92.8983.49

Question: Request for adding GroupDRO as a baseline.

Including results from GroupDRO [1] as an upper bound for comparison would be helpful.

As shown above, we have adopted GroupDRO as a baseline for all the additional experimental results across image and NLP datasets. For more details, please refer to Section 5.2 of our revised paper.

评论

I thank the authors for their detailed replies to my questions and concerns. Most of my concerns are addressed and I will increase my score.

评论

Weakness 3: Concern for the general applicability of DiD on NLP tasks.

DiD lacks general applicability. Specifically, it relies on the assumption that bias will be insensitive to patch-shuffling, while the target will be sensitive to it. However, this approach may not work for other types of semantic biases beyond background or perturbation. Furthermore, this method seems to be difficult to use in NLP tasks.

We further demonstrate the adaptability of the DiD method in the NLP domain. Specifically, we first introduce the common biases within the NLP domain followed by a simple design of feature destruction method in the NLP domain. Then, we validate the effectiveness of the destruction methods with empirical results.

The commonly used NLP datasets for debiasing are MultiNLI and CivilComments-WILDS dataset. Specifically, the bias within the MultiNLI dataset is the correlation between the negation words and the entailment task and the bias within the CivilComments-WILDS dataset is the correlation between words implying demographic identities and the toxicity task. The target features of both datasets are semantic information of the sentences where the position of words matters, and the spurious features are the individual words that are insensitive to positions. Furthermore, such position sensitivity difference between target and spurious features within NLP biases is not limited to these two datasets but rather quite common. For example, CLIP has also been found with the "bag of words" phenomenon[1], which ignores the semantic meaning of the inputs and relies on words individually for prediction. As a result, a straightforward approach for feature destruction is to shuffle the words within the sentences.

With the appropriate feature destruction method for NLP tasks, we further examine the effectiveness of DiD following the settings of JTT[1], which is a classic DBAM method applied to the NLP domain. As shown in the following table, DiD effectively improves the debiasing performance even in the NLP domain.

Bias supervisionMultiNLICivilComments-WILDSCelebA
Avg Acc.Worst Acc.AvgWorst Acc.AvgWorst Acc.
ERMNo80.176.4192.0650.8795.7545.56
JTTNo80.5173.0291.2559.4980.4973.13
+DiDNo+1.06+2.71+0.38+6.41+6.43+8.5
Group DROYes82.1178.6783.9280.291.9691.49

Please refer to Section 5.2 of the revised paper for more details. Again, we believe the adaptability of our approach in real-world datasets from various modalities considerably strengthens the impact of this work thanks to the reviewers' advice, as most previous works focus on a single modality.

[1] "When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It? ", ICLR 2023

Weakness 4: Concern about comparison to previous work focusing on low-bias settings.

The attempt to improve the performance in low-bias settings while keeping the performance in high-bias settings is not the first [2].

While the paper [2] does stress the problem of bias severity and the problem of preciseness in bias detection, they did not dive deeper into the problem, thus failing to answer the critical questions underlying these problems which we put a lot of effort into. To name a few of the critical questions:

  • When we talk about debiasing on low-bias settings, why should we emphasize low-biasing settings in the first place? (In comparison, our work provides grounded empirical and theoretical results to answer the question, i.e. low-bias settings are more aligned with real-world biases.)
  • When we talk about the severity of biases, is the naive measure of bias severity in previous works really convincing and captures the complexities of real-world biases? (In comparison, we provided a fine-grained analysis and evaluation framework for measuring the "severity" of the datasets.)

In conclusion, while [2] shares a similar motivation as ours in our technical design stage in section 3, their motivation is given as granted. In comparison, our motivation is derived from our observation of real-world datasets and is well established through thorough empirical and theoretical results.

In fact, as the motivation in [2] is rather given as granted and lacks further justification, we believe the results in our work strongly support and justify their motivation.

[2] Jung, Yeonsung, et al. "Post-Training Recovery from Injected Bias with Self-Influence.".

评论

Thank you for your comments and questions! We provide point-by-point responses below.

Weakness 1: Request for strengthening the experimental evidence with CelebA and NLP datasets.

Although a theoretical background is provided, the experimental evidence appears limited. It would strengthen the work to demonstrate low prevalence on additional datasets, such as an NLP dataset and CelebA, commonly used in the spurious correlation domain. ... Furthermore, it would be helpful to show the performance in CelebA and NLP tasks as well.

Following the reviewer's advice, we further demonstrated the low prevalence characteristic on 1 additional real-world image dataset (CCelebA) andelebA) and 2 real-world NLP dataset (MultiNLI and CivilComments-WILDS). Please refer to the Figure 2(b) of the revised paper.

The results further demonstrated that the low bias magnitude and low bias prevalence characteristic is not a mere exception but a manifestation of underlying principles that generalize to various modalities (image, language, and structured data). This further broadens the impact of this work thanks to the reviewers' advice.

Furthermore, as the reviewer suggested, we further validated the effectiveness of DiD on CelebA and the two NLP datasets in our following response to Weakness 2 and Weakness 3. Please refer to Section 5.2 of the revised paper for detailed results.

Weakness 2: Request for comparison to the recently proposed B2T method.

For B2T [1], which does not rely on the easy-to-learn property of bias, there seems to be no reason for it to fail in low prevalence settings. A discussion of the advantages of the proposed method over B2T in such scenarios would be required. Additionally, since B2T is a recent approach, it would enhance the paper to include it as a baseline on natural datasets.

Firstly, we point out that, B2T is also a debiasing with biased auxiliary model (DBAM) method, thus still implicitly relies on the easy-to-learn property of bias. Specifically, the CLIP score based bias keyword identification proposed in B2T also requires the training of a biased auxiliary model to define the error dataset, which is a critical part for the CLIP score calculation. Although the auxiliary model is trained with erm, i.e. trained with CE loss rather than GCE loss in other works, they still implicitly assume the bias is learned or captured by the auxiliary model. This is because only if the auxiliary model is biased, can the error set defined based on it contain bias information which can be further extracted based on the CLIP score.

In comparison, DiD is not a self-contained debiasing method, but rather a plug-in module for any given DBAM method including B2T. Consequently, the improvement of B2T and DiD is rather orthogonal and well-compatible with each other, making the direct comparison between the two methods rather difficult (if necessary at all).

Furthermore, as suggested by the reviewer, we evaluate the effectiveness of DiD when applied to B2T on a natural dataset CelebA. As B2T is an upstream method focusing on bias identification for downstream supervised debiasing methods, we evaluate how DiD further improves the bias identification ability of B2T. Specifically, in CelebA dataset, the bias keyword "actor" (a proxy for male) for the Blond class, and the bias keyword "actress" (a proxy for female) can be viewed as ground truth for the bias identification task defined in B2T. We use CLIP score and subgroup accuracy to indicate the confidence of B2T on the ground truth. As shown in the following table, the bias identification ability of B2T is also effectively boosted by DiD.

Blond: ActorNot Blond: Actress
Clip Score \uparrowSubgroup Acc.\downarrowClip Score\uparrowSubgroup Acc.\downarrow
B2T0.12586.712.18897.11
B2T + DiD0.18885.292.29795.81

To further demonstrate that DiD in deed effectively improves the quality of the error dataset through the biased auxiliary model, we adopt the worst group precision and recall in the error set as metrics, following JTT. As shown in Appendix Figure 3 of the revised paper, DiD consistently improves both the precision and recall during the training of the auxiliary model.

For more details, please refer to Appendix E.3 of our revised paper. We believe showcasing that the effectiveness of DiD is not limited to the debiasing task, but also adaptable in the relevant bias detection task further solidifies the contribution of the work, broadening the scope and impact, thanks to the reviewers' advice.

评论

We sincerely appreciate the reviewer’s insightful suggestions and the decision to raise the scores.

Given that all the sub-scores are rated "3: good", yet the overall rating is "6: marginally above the acceptance threshold", we wonder if there are any further concerns from the reviewer. Any further suggestions or questions regarding the paper are welcomed and we are more than willing to address them in the remaining time of the rebuttal stage.

审稿意见
5

This paper examines whether existing benchmarks can effectively capture biases present in real-world datasets. To address this, the authors introduce a new evaluation framework that assesses dataset bias from the perspectives of magnitude and prevalence. Experimental results indicate that current real-world datasets primarily exhibit a low magnitude, low prevalence distribution, highlighting a gap compared to synthetic benchmarks. Furthermore, the authors discuss the debiasing effects of existing DBAM methods under three different bias distributions. They also propose a new debiasing method, DiD, which enhances the bias capture module and improves the accuracy of bias feature learning through a feature destruction approach.

优点

  1. The motivation for this research is sufficient, as there is widespread concern about whether existing benchmarks can truly reflect the biases present in real-world datasets.
  2. This paper presents a fine-grained framework that evaluates dataset bias through the metrics of magnitude and prevalence, offering a multi-dimensional approach.
  3. The authors highlight unique characteristics of real data distribution, which is essential for understanding the effectiveness of debiasing methods in practical scenarios.
  4. They enrich the diversity of synthetic datasets, adding valuable resources for further research in debiasing.

缺点

  1. Clarity and Presentation Issues
    • Line 160: Are you referring to "measure spurious correlation according to the probability of the correlated class ata_t within samples with biased feature asa_s"?
    • Figure 2b may contain a scaling error on the x-axis that could impact result interpretation.
    • Some descriptions are ambiguous. For instance, the term “realistic” in Table 2 might misleadingly imply real-world data, whereas it refers to synthetic datasets. Clarifying this distinction would improve reader understanding. In Appendix D.2, the BAR dataset is described as "a real-world dataset," while in Section 2.1, it is referred to as a "semi-synthetic dataset."
  2. Limitations in the Scope of the DiD Method
    • The feature destruction in DiD method has a limited scope of applicability. While we acknowledge its rationale in visual contexts, identifying or destructing target features is challenging in more general scenarios. Could the authors provide potential solutions for feature destruction in structured real-world datasets (e.g., Adult and COMPAS) or within the NLP domain?
  3. The experiments are insufficient
    • The selection of baselines is limited. Why are comparisons made only with DBAM methods? Would it not be beneficial to include comparisons with data generation methods as well?
    • As mentioned in Section 2, real-world data is primarily characterized by LP bias distributions. In the results presented in Table 1, why does the DiD method underperform compared to the ERM method under LP distribution?
    • Considering the importance of BC samples in debiasing, why does the DiD method exhibit negative transfer in accuracy on BC samples in some experiments?

问题

  1. In Section 4.2, although BN samples have lower weights under LP conditions, is it possible that, due to their being learned more frequently than other samples, the model still captures the information contained in BN samples without overlooking it?
  2. As mentioned, real-world data primarily exhibits LMLP bias distributions. Are there experimental results for Emphasis on BN samples under LMLP conditions?
  3. Could the authors provide experimental results for the DiD method on real-world datasets to validate its effectiveness on actual data?

If the authors are willing to address our concerns, we are open to revising the scores during the rebuttal stage.

评论

Question 1:

In Section 4.2, although BN samples have lower weights under LP conditions, is it possible that, due to their being learned more frequently than other samples, the model still captures the information contained in BN samples without overlooking it?

As you pointed out, due to the large proportion of the BN samples in LP distributions, the model can still learn from BN samples despite lower weights. However, the final model is learned from all the training samples. This means while BA samples guide the model towards relying on spurious features, BN and BC samples guide the model towards target features. Thus from an adversary perspective, the weight of the BN samples matters, at least lowering their weights is likely to cause performance degradation.

Additionally, from an empirical perspective, as shown in Table 1, LfF indeed underperforms ERM in LP distributions, and reweighted data is their only difference compared to ERM. This means if the model is capable of effectively learning from BN samples with low weights, such a large performance gap should exist.

In conclusion, we consider the hypothesis proposed by the reviewer not likely to be true given our analysis and the empirical evidence shown.

Question 2:

As mentioned, real-world data primarily exhibits LMLP bias distributions. Are there experimental results for Emphasis on BN samples under LMLP conditions?

To answer the reviewer's question, we have further examined the effectiveness of DiD in emphasizing BN samples under the LMLP conditions. As shown in Figure 9 in the Appendix of the revised paper (page 27), DiD consistently emphasizes BN samples under the LMLP distribution across datasets and algorithms.

Question 3:

Could the authors provide experimental results for the DiD method on real-world datasets to validate its effectiveness on actual data?

We further provide experimental results on 2 real-world NLP dataset (MultiNLI and CivilComments-WILDS) and 1 real-world image dataset (CelebA) as follows. The results consistently demonstrate the effectiveness of DiD on real-world datasets across different modalities.

Bias supervisionMultiNLICivilComments-WILDSCelebA
Avg Acc.Worst Acc.AvgWorst Acc.AvgWorst Acc.
ERMNo80.176.4192.0650.8795.7545.56
JTTNo80.5173.0291.2559.4980.4973.13
+DiDNo+1.06+2.71+0.38+6.41+6.43+8.5
Group DROYes82.1178.6783.9280.291.9691.49

Please refer to Section 5.2 of the revised paper for more details. Again, we believe the adaptability of our approach in real-world datasets from various modalities considerably strengthens the impact of this work, as most previous works focus on a single modality.

评论

Weakness 3.2: Concern about the performance compared to ERM

As mentioned in Section 2, real-world data is primarily characterized by LP bias distributions. In the results presented in Table 1, why does the DiD method underperform compared to the ERM method under LP distribution?

The statement that "the DiD method underperform compared to the ERM method under LP distribution" is in fact inaccurate and misleading. We note that DiD is not a self-contained method for debiasing but rather a plug-in module for existing debiasing methods to improve the performance based on the original performance of the method. In other words, the performance/effectiveness of DiD should be measured as the performance gain from the base method for debiasing.

Once we established the correct measure for the effectiveness of DiD, we can see from Table 1 that DiD is in fact particularly effective on LP distributions, by which the real-world data is characterized.

Furthermore, as you pointed out, the final performance of debiasing is still lower than ERM on LP distributions despite the great performance boost by DiD. We believe the message here is not the ineffectiveness of DiD but rather the following:

  1. Existing methods emphasize too much on hypothesized HP distributions deviating from reality.
  2. Despite DiD being an effective fix to existing methods, it is still not completely satisfactory, requiring future efforts into this problem.

These are the two key messages we are attempting to send to the community, representing the core contributions of this work.

Weakness 3.3: Concern for negative transfer in accuracy on BC samples

Considering the importance of BC samples in debiasing, why does the DiD method exhibit negative transfer in accuracy on BC samples in some experiments?

The negative transfer phenomenon appears on LfF when combined with BiasEnsemble and DiD. The root cause of this is likely because, as an early work, LfF has been found to easily overfit to upweighted BC samples [6,7,8], which is precisely the motivation of many following works such as DisEnt. As a result, when combined with BiasEnsemble and BE, both of which are targeted to identify BC samples more accurately, the exclusive emphasis on BC samples further strengthens, causing LfF to overfit and underperform from its original version. In comparison, DisEnt does not suffer from the problem due to its design to avoid overfiting. Additionally, such a phenomenon is likely to be solved by simply lowering the BC emphasis level of LfF through hyper-parameters, but we keep all the experiments to the default setting for consistency.

In conclusion, the phenomenon is mainly due to the limitations of the early work LfF, as claimed in the following works [6,7,8], rather than our approach.

[6] "Learning debiased representation via disentangled feature augmentation.", NeurIPS 2021

[7] "BiaSwap: Removing Dataset Bias with Bias-Tailored Swapping Augmentation", ICCV 2021

[8] "BiasAdv: Bias-Adversarial Augmentation for Model Debiasing", CVPR 2023

评论

Weakness 3.1: Concern about limited baselines

The selection of baselines is limited. Why are comparisons made only with DBAM methods? Would it not be beneficial to include comparisons with data generation methods as well?

We suspect that it is some of our inaccurate and misleading statements in the limitation section that have caused the misunderstandings of the reviewer. We have corrected the statements along with further clarification in the revised paper. Here, we would like to clarify the misunderstanding as follows.

First of all, and most importantly, DBAM is not merely one of the many lines of work in the field that we choose to work on, but rather our summarization of, as far as we know, most (if not all) the methods in the field of debiasing w/o bias supervision. In other words, despite various technical designs discussed in the related work section, most (if not all) existing debiasing w/o bias supervision methods involve the part of training an auxiliary model for bias capturing. Even more recent method B2T[3] mentioned by reviewer WrYr still falls into the scope of DBAM. In fact, such commonly shared DBAM paradigm make sense as the task requires no information concerning the bias. Thus, based on the safest assumption that biases have to be able to be learned by a model, training an auxiliary model to capture the biases and further provide certain forms of supervision for debiasing is reasonable if not unavoidable.

As for the data generation method mentioned in the limitation section of the previous version, we originally referred to Biaswap[4] at the early stage of this work. But as we looked into the design details, we found that Biaswap is in fact a typical DBAM sharing the same bias capturing design as LfF. Furthermore, in terms of data generation, the DisEnt baseline adopted in this paper is also a data generation method and is very similar to Biaswap. Nonetheless, it's our negligence that caused us to not update the limitation section as this paper evolves and we apologize for our mistakes.

Despite the above, the JTT baseline we added for NLP tasks is also a DBAM method but differs from LfF in its auxiliary model training scheme and sample reweighing method. (results in our response to the Weakness 2) We believe this can also be considered as diversifying our baselines in some sense.

Additionally, we further test the effectiveness of DiD by applying it to the relevant task of bias detection[3, 5] (which can be considered as baselines in another relevant task). Such methods also involve a biased auxiliary model for the identification. To test the effectiveness of DiD on bias identification tasks, we apply DiD to the recently proposed B2T [3] method. Specifically, B2T identifies keywords by calculating their CLIP score, whose calculation involves a biased auxiliary model to define an error dataset, similar to JTT. A keyword is identified as biased if it has a higher CLIP score and the subgroup defined by it should have lower accuracy.

Following [3], we use CelebA as the dataset for bias detection, where the keyword "Actor" (a proxy for Male) is considered ground truth for class Blond, and the keyword "Actress" (a proxy for Female) is considered ground truth for the class not Blond. As we can see in the table bellow, by applying DiD to the training of the auxiliary model, we effectively improve both metrics CLIP score and subgroup accuracy, enhancing B2T's bias detection ability.

Blond: ActorNot Blond: Actress
Clip Score \uparrowSubgroup Acc.\downarrowClip Score\uparrowSubgroup Acc.\downarrow
B2T0.12586.712.18897.11
B2T + DiD0.18885.292.29795.81

For more detailed results on our adaption to the bias identification task, please refer to the Appendix E.3 of our revised paper.

To sum up, it is our inaccurate statement in the limitation section that has misled the reviewer to question the diversity of the baselines included in this paper, while in fact, our baselines cover various technical routes in the domain. To further address the reviewer's concern, we further include baselines in the NLP domain (JTT) and supervised debiasing task(Group DRO). We even applied DiD to the relevant bias detection task (B2T), adding up to 8 baselines.

We believe our adaption to methods beyond the debiasing task and image modality notably broadens the impact of this work thanks to the reviewers' advice.

[3] "Discovering and Mitigating Visual Biases through Keyword Explanation", CVPR 2024

[4] "BiaSwap: Removing dataset bias with bias-tailored swapping augmentation", ICCV 2021

[5] "FACTS: First amplify correlations and then slice to discover bias", ICCV 2023

评论

Thank you for your comments and questions! We provide point-by-point responses below.

Weakness 1: clarity and presentation issues

Thanks again for pointing out the typos and ambiguity within the paper. We have corrected and clarified them in the revised version of the paper marked in blue, specifically:

  • The typo in line 160 is corrected to "measure spurious correlation according to the probability of the correlated class a_t within samples with biased feature a_s"
  • We do not observe a scaling error on the x-axis and we suspect that the misunderstanding is due to the dashed lines representing real-world datasets are not bold enough, thus not noticed by the reviewer. To avoid such misunderstanding, we increased the line width and made the legend smaller to avoid occlusion within the figure.
  • We have rephrased our description of the BAR and NICO dataset and clearly defined them as semi-synthetic datasets to avoid ambiguity. In section 5, where Table 2 is, we now refer to these two datasets as "datasets with more complex visual features". We believe this is more accurate and prevents further misunderstanding.

Weakness 2: Concern about the scope of the DiD method

The feature destruction in DiD method has a limited scope of applicability. While we acknowledge its rationale in visual contexts, identifying or destructing target features is challenging in more general scenarios. Could the authors provide potential solutions for feature destruction in structured real-world datasets (e.g., Adult and COMPAS) or within the NLP domain?

We further demonstrate the adaptability of the DiD method in the NLP domain. Specifically, we first introduce the common biases within the NLP domain followed by a simple design of the feature destruction method in the NLP domain. Then, we validate the effectiveness of the destruction methods with empirical results.

The commonly used NLP datasets for debiasing are MultiNLI and CivilComments-WILDS dataset. Specifically, the bias within the MultiNLI dataset is the correlation between the negation words and the entailment task and the bias within the CivilComments-WILDS dataset is the correlation between words implying demographic identities and the toxicity task. The target features of both datasets are semantic information of the sentences where the position of words matters, and the spurious features are the individual words that are insensitive to positions. Furthermore, such position sensitivity difference between target and spurious features within NLP biases is not limited to these two datasets but rather quite common. For example, CLIP has also been found with the "bag of words" phenomenon [1], which ignores the semantic meaning of the inputs and relies on words individually for prediction. As a result, a straightforward approach for feature destruction is to shuffle the words within the sentences.

With the appropriate feature destruction method for NLP tasks, we further examine the effectiveness of DiD following the settings of JTT[2], which is a classic DBAM method applied to the NLP domain. As shown in the following table, DiD effectively improves the debiasing performance even in the NLP domain.

Bias supervisionMultiNLICivilComments-WILDS
Avg Acc.Worst Acc.AvgWorst Acc.
ERMNo80.176.4192.0650.87
JTTNo80.5173.0291.2559.49
+DiDNo+1.06+2.71+0.38+6.41
Group DROYes82.1178.6783.9280.2

For more details, the above content is updated in section 5 of the revised version of the paper where changes are marked in blue. We believe the adaption to the NLP domain considerably broadens the impact of this work thanks to the reviewers' advice.

[1] "When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It? ", ICLR 2023

[2] "Just Train Twice: Improving Group Robustness without Training Group Information", ICML 2021

评论

Dear reviewer yHqh:

We hope that all the misunderstandings, concerns, and questions have been well-addressed in our detailed response. If the reviewer has any further questions regarding the paper, we are more than willing to address them in the remaining time of the rebuttal stage.

Thank you again for your attention and the time you've invested in reviewing our work. We look forward to your valuable feedback.

Best regards,

The authors

评论

Thank you for the response and the time you've invested in reviewing the paper! We are glad that the reviewer considered most of the concerns addressed and decided to raise the score.

However, we notice that a negative attitude is still held towards the acceptance of the paper, given that the motivation/strengths/contributions are all well-acknowledged by the reviewer (which we sincerely appreciate) and most of the concerns addressed.

In fact, among the reviewers, the strengths and contributions are consistently acknowledged, and concerns are addressed without any further questions or concerns from any of the reviewers after reading the rebuttal.

  • In terms of contribution, all reviewers rated "3: good".
  • In terms of presentation, all reviewers rated "3: good". Reviewer WrYr considers "The paper is well-written."
  • In terms of soundness, we carefully addressed all the reviewers' concerns and questions without any further questions or concerns from any of the reviewers after reading the rebuttal. Reviewer WrYr rated "3: good" after the rebuttal.

Thus, we are eager to know the major concerns the reviewer still has that make the reviewer consider the paper below the threshold of acceptance. Your feedback is highly valued. Thanks to the extension of the discussion stage, we are more than willing to address any additional concerns or questions from the reviewer.

评论

Thanks for the authors' responses which addressed most of my concerns. During the rebuttal phase, the authors supplement a significant number of experiments, especially in the NLP domain, making the paper more comprehensive. About the ERM experimental results, the authors provide the appropriate explanations. Overall, the motivation behind the analysis is sufficient and meaningful, and the proposed method seems effective, though not perfect. I will raise my score to 5. If there are any questions, I can further discussions with the other reviewers.

评论

As the deadline for the authors to upload a revised paper has passed and the second stage of the discussion starts, we summarize the major revisions of the paper along with the discussions we have had with the reviewers so far.

Revisions of the paper

  • We have further demonstrated the scope and general effectiveness of the proposed method by extending the method from the image to the language domain. (Thanks to reviewers yHqh and WrYr)
  • We further included evaluations of the proposed method on 1 real-world image dataset and 2 real-world language datasets. (Thanks to reviewers yHqh and WrYr)
  • We have included 3 additional baselines from various domains and tasks as suggested by reviewers yHqh and WrYr, which further strengthens our evaluation.
  • We have included additional empirical analysis to address concerns/questions from reviewers yHqh and WrYr. Specifically, BN weights in LMLP distributions, distribution analysis of 3 additional real-world datasets, and evaluation on the WaterBirds dataset in the setting of JTT.
  • Additional clarifications are included to avoid misunderstandings similar to the reviewers. Typos and ambiguity are corrected thanks to suggestions from reviewers yHqh and vvhf.

Feedback from the reviewers so far

Both reviewers yHqh and WrYr considered most of their concerns addressed with no additional concerns/questions and raised the scores.

Reviewer vvhf keeps the score without mentioning any unaddressed or additional concerns/questions.

We briefly sum up the reviews and comments from the reviewers so far as follows:

  1. In terms of contribution, all reviewers rated "3: good".

    • Reviewer yHqh considers "the motivation behind the analysis is sufficient and meaningful", "essential for understanding the effectiveness of debiasing methods in practical scenarios", and "adding valuable resources for further research in debiasing".
    • Reviewer WrYr considers "the authors identify significant differences between real-world datasets and benchmark datasets", and "they demonstrate that their method significantly improves upon existing methods across all settings".
    • Reviewer vvhf considers "The authors identify a key problem in existing benchmark datasets, which mislead the development of existing methods.".
  2. In terms of presentation, all reviewers rated "3: good". Reviewer WrYr considers "The paper is well-written."

  3. In terms of soundness, we carefully addressed all the reviewers' concerns and questions without any further questions or concerns from any of the reviewers after reading the rebuttal. Reviewer WrYr rated "3: good" after the rebuttal.

We are truly grateful for all the reviewers' valuable advice and the reviewers' appreciation of the work, especially the consistent acknowledgment in terms of contribution, considering our motivation "sufficient and meaningful".

Looking forward to further feedback

Despite what we and the reviewers have accomplished so far as mentioned above, we want to make sure if there are any further concerns/questions from the reviewers, especially from reviewers yHqh and vvhf since they still hold a negative attitude towards acceptance without mentioning their unaddressed or additional concerns/questions. If any, we are more than willing to address further questions or concerns regarding the paper.

Again, we sincerely appreciate all the reviewers for their time and effort devoted to reviewing and discussing the paper.

AC 元评审

This paper examines dataset bias in practical contexts, pinpointing the shortcomings of current benchmarks. The authors put forward a detailed framework for bias assessment, introduce two bias metrics, and a debiasing technique called Debias in Destruction (DiD). After the rebuttal phase, the paper ended up with a borderline score below the acceptance criteria. I have read the comments and the paper. I believe that there are still some issues to be addressed:

  1. Limited Theoretical Scope (Reviewer vvhf): The theoretical discussion is mainly confined to binary classification tasks using a simple metric. It's unclear if the theory could be extended to multiclass scenarios, creating a substantial gap between theory and application. The authors did not provide proof or even a sketch to extend the theoretical result in the rebuttal phase.

  2. Insufficient Novelty and Generalizability (Reviewers WrYr, yHqh): Though DiD provides a new viewpoint on debiasing, its dependence on conventional methodologies (CE and GCE), like feature destruction, limits its novelty. Additionally, the paper does not adequately clarify how effective feature destruction is in debiasing compared to theoretical expectations.

  3. Limited Empirical Evidence: The experimental analysis is relatively narrow, with few and simple competitors.

Above all, I tend to recommend rejecting this paper. The authors are encouraged to resubmit their work after considering the suggestions.

审稿人讨论附加意见

My decision is based on the following major issues:

  1. Limited Theoretical Scope (Reviewer vvhf): The theoretical discussion is mainly confined to binary classification tasks using a simple metric. It's unclear if the theory could be extended to multiclass scenarios, creating a substantial gap between theory and application. The authors did not provide proof or even a sketch to extend the theoretical result in the rebuttal phase.

  2. Insufficient Novelty and Generalizability (Reviewers WrYr, yHqh): Though DiD provides a new viewpoint on debiasing, its dependence on conventional methodologies (CE and GCE), like feature destruction, limits its novelty. Additionally, the paper does not adequately clarify how effective feature destruction is in debiasing compared to theoretical expectations.

  3. Limited Empirical Evidence: The experimental analysis is relatively narrow, with few and simple competitors.

最终决定

Reject