Rethinking the Bias of Foundation Model under Long-tailed Distribution
We explore how the imbalance of foundation models impacts downstream imbalanced tasks in PEFT-based methods
摘要
评审与讨论
This paper addresses the challenge of learning on long-tail data (and the bias of foundation models). The author defines the imbalance problem as parameter imbalance and data imbalance. They propose a backdoor adjustment method to address the imbalance problem. Experiments conducted on different long-tailed datasets demonstrated the effectiveness of the proposed method.
给作者的问题
-
How do you formulate the incomplete semantic factors? In your experiments, is the maximum number limited to three? Are there any limitations?
-
Why you select CLIP, OpenCLIP, and MetaCLIP to approximate the incomplete semantic factor? What model can be used and what model cannot?
-
The final equation pretty sounds like a simple mixture of the large FMs, like MoE, so what's the benefit of the proposed method compared to typical MoE methods? How about compared with the MoE results of CLIP, OpenCLIP, and MetaCLIP? For example, you can use the inference results from the finetuned CLIP, OpenCLIP, and MetaCLIP and make a voting machine from their results (without further training) and compare it to your model.
-
You define the imbalanced factor, then how and where it can be used? Will this score get significantly changed on different datasets even if they have the same categories?
论据与证据
Yes
方法与评估标准
Yes
理论论述
N/A
实验设计与分析
Yes
补充材料
Yes
与现有文献的关系
The author defines the imbalance problem as parameter imbalance and data imbalance, and propose a method to address both.
遗漏的重要参考文献
N/A
其他优缺点
-
The concept of parameter imbalance and data imbalance is quite important to community, they should be well addressed.
-
This paper provides sufficient details and experiments, and I can imagine the author has invested significant time in it. However, some definitions remain unclear or not well-formulated, and further revisions may be needed.
其他意见或建议
N/A
Thank you for your insightful feedback. Below, we summarize your points in quotes, followed by our corresponding replies.
How to formulate incomplete semantic factors and the question of their maximum number.
The incomplete semantic factor () represents the semantic region in the image that the foundation model prefers (the foundation model relies on it to make the final prediction). For example, in the first row of Fig. 4, OpenCLIP predominantly attends to the head (), whereas MetaCLIP primarily focuses on the body (), which is further supported by the experimental evidence in Fig. 8. Actually, different values of correspond to specific semantic regions, and the possible values of are infinite and are not limited to 0 (head) or 1 (body). In this way, the granularity of can be further refined, depending on the number of models. However, we set the maximum number limited to three as a trade-off between performance and costs. More details are in Sec.E.5.
The selection of foundation model and what model can be used and what model cannot?
As shown in Fig.5 in the paper, the path represents the incomplete semantic factor that arises due to parameter imbalance, stemming from the imbalance in the pre-training data of the foundation model. Consequently, we have chosen CLIP, OpenCLIP, and MetaCLIP because they are pre-trained on distinct datasets, resulting in varying degrees of parameter imbalance and differing incomplete semantic factors.
In causal theory, Eq. 7 should cover the value space of as fully as possible for accurate causal effect estimation. Since has infinite values, we approximate it finitely, with broader coverage improving estimation accuracy. In this way, model selection should take into account the path . If two foundation models are trained on similar pre-training datasets, only one should be selected, as choosing both would not significantly increase the coverage of the value space of .
Compare with MoE voting
In Eq. 7, represents the prior distribution of the confounder. In large-scale datasets, this can be assumed uniform, making the final implementation resemble a voting process (like MoE) among fine-tuned CLIP models. However, unlike MoE, backdoor adjustment provides theoretical guidance for expert selection. For example, the parameter imbalance results in different experts exhibiting distinct tendencies in their prediction distributions: OpenCLIP exhibits a "head of the object" preference (), while MetaCLIP exhibits a distinct "body of the object" preference (), as shown in Fig.8. Guided by backdoor adjustment, if an additional foundation model that prioritizes the body is introduced, it should be combined with OpenCLIP rather than MetaCLIP. This is because pairing it with OpenCLIP broadens the range of confounders accounted for, whereas MetaCLIP does not provide such an expansion. To verify this, we introduce CLIP-CP, pre-trained on the CommonPool datasets, and combine it with OpenCLIP and MetaCLIP, respectively. The results are shown as follows:
| OpenCLIP | MetaCLIP | |
|---|---|---|
| 51.6 | 51.6 | |
| +CLIP-CP | 51.9 | 51.6 |
This experiment shows that combining CLIP-CP with OpenCLIP is more effective. Additionally, following the experiments in Sec. E.6, we calculate the average confidence scores for the head images (=0.2732) and body images (=0.7122) based on CLIP-CP. These results confirm that CLIP-CP exhibits a "body of the object" preference, consistent with MetaCLIP, and demonstrate the effectiveness of backdoor adjustment in model selection.
Questions about imbalanced factors
Imbalanced factors (IF) measure dataset imbalance, particularly in downstream tasks. For example, the imbalanced factors for ImageNet-LT, Places365-LT, and iNaturalist2018 are 256, 996, and 500, respectively. Datasets with higher IFs tend to have larger performance disparities between head and tail classes, so addressing these imbalances is crucial for better performance.
For pre-training parameter imbalance, due to the inaccessibility of pre-training data, we can only measure this imbalance by giving a specific downstream dataset. For example, we estimate the label prior using Eq. 3 from the paper on the Places365-LT dataset for different foundation models and then calculate the IF. As shown below, different foundation models exhibit varying degrees of imbalance when evaluated on the same downstream dataset.
| CLIP | OpenCLIP | MetaCLIP | |
|---|---|---|---|
| IF | 57.50 | 63.25 | 60.20 |
This metric can vary when samples are drawn from different distributions (domains), even if the category set is the same. For example, in a food classification task, dumplings and noodles may be more common in images from China, while hamburgers are more frequent in images from America. These cultural differences can lead to varying levels of imbalance.
We acknowledge that we will incorporate all the discussions in the paper.
Thanks for the clarification! Most of my concerns are well addressed.
Thank you for your recognition. We are glad that our explanation has solved your concerns and improved the quality of our work.
This paper examines the inherent biases introduced by the imbalanced training data used to pre-train foundation models, and how these biases affect downstream long-tailed learning tasks. The authors find that during fine-tuning, parameter imbalance (the imbalance in the pre-trained model parameters) plays a more critical role than data imbalance, and existing re-balancing techniques are ineffective at addressing parameter imbalance. To tackle both parameter and data imbalances, the authors propose a causal learning-based backdoor adjustment method that learns the true causal effect between input samples and labels, rather than just fitting the correlations in the data. This method achieves significant performance improvements on several long-tailed benchmark datasets compared to state-of-the-art methods.
给作者的问题
It seems that the tasks are all on image datasets, what do authors think the parameter imbalance in LLM?
论据与证据
The claims made in the paper appear to be well-supported by the analysis and experimental evidence provided in the contexts. The authors have conducted a thorough investigation of the biases in foundation models and proposed a novel causal learning-based solution that demonstrates clear performance gains on the evaluated benchmarks.
方法与评估标准
Yes, the methods and evaluation criteria employed in the paper are well-suited for the problem of long-tailed learning in the context of fine-tuning foundation models, as they address the key challenges and biases identified in the introduction.
理论论述
I did not see the theoretical claims in this paper, definitions and simple derivations are not theoretical claims.
实验设计与分析
I check all, and overall the experimental designs and analyses presented in the paper appear to be sound and well-justified, with a thorough investigation of the proposed method's performance and its comparison to related work.
补充材料
yes, all
与现有文献的关系
The paper formally defines parameter imbalance and data imbalance, providing a structured way to analyze the impact of these biases, which extends the approach introduced in OLTR (Liu et al., 2019)
The paper proposes a novel backdoor adjustment method to mitigate the negative effects of parameter imbalance and data imbalance, which is distinct from previous causal-based approaches, such as those in (Tang et al., 2020) and (Zhu et al., 2022)
The paper's exploration of the causal relationships between incomplete semantic factors, input samples, and labels contributes to the growing body of work on applying causal reasoning to address challenges in long-tailed learning
Overall, the paper builds upon and extends the existing literature on long-tailed learning, foundation model biases, and the application of causal reasoning to address imbalance-related challenges in machine learning.
遗漏的重要参考文献
there are some related papers, but are not published, so it is fine.
其他优缺点
I don’t see any major drawbacks, but I have a few concerns:
-
The improvement seems minor—only 1.5%. Could this improvement be purely due to randomness?
-
Figures should be self-explanatory, allowing readers to understand them without referring to the main text. Please make all figure legends clearer.
-
While the intuition and motivation of the paper are strong, the methodology remains unclear. For sec4.2, the explanation of the proposed methods is not easy to follow. I suggest making it more straightforward, for example, by clearly outlining the entire fine-tuning pipeline or algorithm.
其他意见或建议
None.
We sincerely appreciate your thoughtful feedback. In the following, your questions are summarized in quotes, followed by our point-by-point responses.
Could the improvement be purely due to randomness?
To ensure that the observed performance improvement is attributable to the advantages of our method and not random variation, we conduct experiments with five different random seeds and report the mean and standard deviation for both LIFT and our method on Places365-LT and ImageNet-LT. As shown below, our method outperforms LIFT with a smaller standard deviation (reported in brackets), demonstrating that the performance gains are not due to randomness.
| Places365-LT | ImageNet-LT | |
|---|---|---|
| LIFT | 51.4 (0.09) | 77.0 (0.04) |
| Ours | 53.03 (0.08) | 79.59 (0.03) |
In addition, we also conduct a significance test (T-test) between the results of our method and the LIFT, obtaining a and for Places365-LT and ImageNet-LT, respectively. This indicates that our method is statistically significant, with a probability of making no more than a 5% error.
Please make all figure legends clearer.
After conducting a comprehensive review, we have carefully refined the captions and legends for each figure. Given that figures 3, 5, and 6 previously lacked sufficient detail, we present the revised results below.
Figure 3: The performance of different groups with (a) CE and (b) LA on the Places365-LT dataset. The three rows, from top to bottom, represent the performance of P-Many, P-Medium, and P-Few, respectively. The three columns, from left to right, represent the performance of D-Many, D-Medium, and D-Few, respectively.
Figure 5: The framework of our proposed method. (a) In the confounded setting, influences both and , leading to the backdoor path , thereby introducing confounding bias. (b) After intervening on , its parent nodes and are severed, eliminating the unstable backdoor path and leading to a more reliable estimation of the causal effect between and .
Figure 6: The performance of different groups with our method on Places365-LT. After applying backdoor adjustment, performance improves across different groups, particularly for samples in both P-Few and D-Few.
Due to word limits, additional revisions of figure captions are not listed here. We promise that we will carefully rewrite all figure legends and captions in the original paper.
Clearly outlining the entire fine-tuning pipeline or algorithm.
During the fine-tuning phase, our method can be decomposed into two stages. In the first stage, we apply logit adjustment to fine-tune various foundation models (CLIP, OpenCLIP, and MetaCLIP), addressing the data imbalance present in the downstream dataset. In the second stage, we input each test sample into the fine-tuned models, obtaining output logits without additional fine-tuning. As described in Eq. 7, we then ensemble these logits using the importance weight to correct for parameter imbalance, ultimately producing the final prediction score.
We promise to give an entire algorithm pipeline in the original paper.
What do authors think the parameter imbalance in LLM?
Thank you for your insightful comments! We also believe that current LLMs suffer from parameter imbalance. We analyze this issue from two perspectives, the influence of parameter imbalance without fine-tuning and with fine-tuning under downstream data:
(1) Without Fine-Tuning: Since LLMs are trained on diverse corpora, some domains (e.g., news articles, Wikipedia, and general web content) dominate the pre-training process. As a result, the model parameters may encode more precise representations of frequent patterns while underrepresenting rare or specialized knowledge. This imbalance becomes apparent when applying LLMs to highly domain-specific tasks, such as specialized scientific research or low-resource languages, where the model struggles due to insufficient parameter representation.
(2) Fine-tuning: When we use LLMs for customized tasks, such as role-playing or simulating a specific individual's tone, parameter imbalance can also have an impact. Since pre-training data is often biased toward general linguistic patterns, the model may struggle to capture highly personalized or domain-specific styles. For example, if an LLM is fine-tuned to simulate the conversational style of a historical figure, but the pre-training corpus contains limited examples of their actual patterns, the model may default to generic language structures instead of faithfully mimicking the target style, making it difficult to fully adapt to the new task.
We promise that we will add this discussion to the original paper
This paper studies the impact of foundation models' biases—trained on disproportionately distributed data—on downstream tasks with imbalanced labels. The authors characterize two types of imbalance: (1) parameter imbalance, which has its roots in the pre-training stage, and (2) data imbalance, which exists in the downstream task. They show through experiments that different numbers of parameters affect performance more than imbalanced data. They found that standard methods to remedy this, like Logit Adjustment, are not a fix. To repair these issues, the paper suggests a new backdoor adjustment method based on causal inference. This method tries to mitigate the confounding effect of what the authors refer to as "incomplete semantic factors" (an artifact of parameter imbalance) by ensembling semantic signals from a diverse set of foundation models. Large-scale experiments on ImageNet-LT, Places365-LT, and iNaturalist2018—and ablation studies—demonstrate that the method improves accuracy, particularly for tail classes, and obtains a more balanced performance overall.
Update after rebuttal
I stay with my original rating, as I agree with the comments of author's rebuttal.
给作者的问题
.
论据与证据
The submission's main points are:
- The observation that downstream data imbalance matters less than parameter imbalance (which is from pre-training) is backed by looking at classes (binned as D-Many, D-Medium, D-Few etc. for parameter groups) and comparing performance under different training strategies.
- The experiments reveal that methods such as Logit Adjustment can neutralize data imbalance but are not effective in the case of parameter imbalance.
- A novel method of backdoor adjustment is presented that helps in mixing predictions from several simple models, and this is found to work better in the majority of tests.
方法与评估标准
The method of conducting this research is novel and well-suited to the problem. Applying causal inference—a backdoor adjustment, in particular—to address the two causes of imbalance is a sensible approach to enhancing existing re-balancing techniques. Evaluation on established long-tailed benchmarks (ImageNet-LT, Places365-LT, iNaturalist2018) using conventional metrics (overall accuracy and class-wise breakdowns) is appropriate and provides a solid basis for assessing the efficacy of the proposed approach.
理论论述
There are no proofs; (this is fine). The theoretical claims are straightforward intuitions about causal ML that I agree with.
实验设计与分析
The experimental design is sufficiently comprehensive. Three standard long-tailed datasets are used. Measuring performance across different class groupings (D-Many, D-Medium, D-Few) shows how well bias is reduced. The method is compared with appropriate baseline methods. Analyses of the number of incomplete semantic factors (M) helps elucidate the method's effectiveness.
补充材料
There is no supplementary material.
与现有文献的关系
The paper builds upon prior work in long-tailed learning and bias reduction. It mentions current methods that tackle data imbalance, like Logit Adjustment, re-weighting, and re-sampling procedures. However, it puts emphasis on the bias caused by foundation models. The causal method is in line with current trends in the application of causal inference for machine learning tasks (as seen in the work of Tang et al. and Zhu et al.). By leveraging concepts from foundation model adaptation (including PEFT approaches) and long-tailed learning, the paper highlights its contributions in these research areas.
A broader discussion that encompasses methods from invariant risk minimization or other approaches to mitigate bias based on causation might better position this work within the current literature.
遗漏的重要参考文献
While the paper cites many pertinent sources, it could be enhanced by citing invariant risk minimization approaches to mitigate bias in deep learning.
其他优缺点
Strengths: The paper presents a novel causal framework for analyzing bias in foundation models by focusing on an unexplored dimension (parameter imbalance). Experiments are conducted on a range of benchmarks with detailed class-wise comparisons and ablation studies. The use of backdoor adjustment in the training process is a sensible way of reducing bias.
Weaknesses: I don't see any glaring weaknesses.
其他意见或建议
.
Thank you for your valuable comments and acknowledgement of our work! In the following, we summarize a series of works on Invariant Risk Minimization (IRM) to reduce bias and briefly introduce the differences between our work and existing studies.
Invariant Risk Minimization (IRM) aims to enhance out-of-distribution generalization by identifying and optimizing invariant features, thereby reducing bias in deep learning models [1, 2, 3]. In linear systems, IRM has strong theoretical guarantees and a clear connection to causal theory. Building upon IRM, numerous variants have emerged [4, 5, 6, 7], aiming to address some of IRM’s challenges, such as its failure in nonlinear tasks [8], the requirement for extensive domain information [2], and optimization difficulties in deep neural networks [9].
While IRM aims to identify and optimize invariant features across domains to reduce bias, our method leverages a causal learning framework that identifies and mitigates biases introduced by spurious correlations, which are induced by the backdoor path . Rather than assuming the existence of invariant features, we treat the incomplete semantic factor as a confounder and apply a backdoor adjustment method to learn the true causal effect, offering a more flexible solution to both parameter and data imbalance.
[1] Arjovsky, Martin, et al. "Invariant risk minimization." arXiv preprint arXiv:1907.02893 (2019).
[2] Lin, Yong, et al. "ZIN: When and how to learn invariance without environment partition?." Advances in Neural Information Processing Systems 35 (2022): 24529-24542.
[3] Deng, Yihe, et al. "Robust learning with progressive data expansion against spurious correlation." Advances in neural information processing systems 36 (2023): 1390-1402.
[4] Ahuja, Kartik, et al. "Invariant risk minimization games." International Conference on Machine Learning. PMLR, 2020.
[5] Krueger, David, et al. "Out-of-distribution generalization via risk extrapolation (rex)." International conference on machine learning. PMLR, 2021.
[6] Robey, Alexander, George J. Pappas, and Hamed Hassani. "Model-based domain generalization." Advances in Neural Information Processing Systems 34 (2021): 20210-20229.
[7] Ahuja, Kartik, et al. "Invariance principle meets information bottleneck for out-of-distribution generalization." Advances in Neural Information Processing Systems 34 (2021): 3438-3450.
[8] Rosenfeld, Elan, Pradeep Ravikumar, and Andrej Risteski. "The risks of invariant risk minimization." International Conference on Learning Representations (2021).
[9] Chen, Yongqiang, et al. "Pareto invariant risk minimization: Towards mitigating the optimization dilemma in out-of-distribution generalization." The Eleventh International Conference on Learning Representations (2023).
This submission rethinks the inherent bias of foundation models on downstream tasks. Specially, the authors examined the imbalance bias from foundation models as parameter imbalance and data imbalance, and found the latter can be alleviated by the conventional methods, while the former cannot. Based on this observation, the authors proposed a new method from the perspective of causal learning, which is verified with about 1.67% improvement on each dataset.
Three reviewers reviewed this submission and gave the ratings of 4, 3, 4. After the rebuttal, one reviewer confirmed the authors' substantial effort while the other two reviewers keep silence. After AC carefully checked the reviewers' comments and the authors' feedback on improving the clarification and more experiments, it seems that most questions about this submission have been addressed. Therefore, based on the reviewers' suggestion and the authors' rebuttal, AC tends to recommend acceptance.