6.0

/10

Poster5 位审稿人

最低3最高4标准差0.4

3.0

置信度

创新性2.2

质量2.4

清晰度2.4

重要性2.4

NeurIPS 2025

LT-Soups: Bridging Head and Tail Classes via Subsampled Model Soups

Masih Aminbeidokhti,Subhankar Roy,Eric Granger,Elisa Ricci,Marco Pedersoli

OpenReview PDF

提交: 2025-05-05更新: 2025-10-29

TL;DR

LT-Soups merges CLIP models fine-tuned on balanced subsets and retrains the classifier on the full dataset, achieving SOTA head/tail accuracy trade-offs across five benchmarks.

摘要

Real-world datasets typically exhibit long-tailed (LT) distributions, where a few head classes dominate and many tail classes are severely underrepresented. While recent work shows that parameter-efficient fine-tuning (PEFT) methods like LoRA and AdaptFormer preserve tail-class performance on foundation models such as CLIP, we find that they do so at the cost of head-class accuracy. We identify the head-tail ratio, the proportion of head to tail classes, as a crucial but overlooked factor influencing this trade-off. Through controlled experiments on CIFAR100 with varying imbalance ratio ($\rho$) and head-tail ratio ($\eta$), we show that PEFT excels in tail-heavy scenarios but degrades in more balanced and head-heavy distributions. To overcome these limitations, we propose LT-Soups, a two-stage model soups framework designed to generalize across diverse LT regimes. In the first stage, LT-Soups averages models fine-tuned on balanced subsets to reduce head-class bias; in the second, it fine-tunes only the classifier on the full dataset to restore head-class accuracy. Experiments across six benchmark datasets show that LT-Soups achieves superior trade-offs compared to both PEFT and traditional model soups across a wide range of imbalance regimes.

关键词

imbalance classificationfoundational modelfine-tuningmodel merging

评审与讨论

审稿意见

评分: 4置信度: 42025-07-01

The paper introduces LT-Soups, a two-stage model soups framework designed to address long-tailed (LT) class imbalance in machine learning. It focuses on the problem where head classes dominate, and tail classes are severely underrepresented, a scenario common in real-world datasets. Previous work on parameter-efficient fine-tuning (PEFT) methods like LoRA and AdaptFormer show improved tail-class performance but often at the cost of head-class accuracy. The authors propose LT-Soups, which averts this trade-off by combining model soups and tailored fine-tuning techniques. The framework works in two stages:

优缺点分析

Strengths: Originality: The introduction of LT-Soups as a two-stage framework that balances performance across both tail and head classes is novel. It builds on the understanding that class imbalance is a complex, multi-dimensional problem and uses a dual-axis framework for imbalance characterization. Robustness: The method shows significant improvement across both synthetic and real-world long-tailed datasets, including scenarios with varying imbalance ratios and head-tail class distributions. Practical Application: The proposed method shows better adaptability and performance stability across diverse datasets, making it applicable to real-world tasks, especially those with imbalanced data distributions.

Weaknesses: Complexity: While the approach is robust, it involves two stages of fine-tuning, which could increase computational costs and complexity compared to simpler methods like PEFT or traditional model soups. Limited Statistical Analysis: The paper lacks error bars or statistical significance tests in the experimental results, which could make it difficult to assess the variability and robustness of the reported results fully. Assumptions: The reliance on the head-tail ratio and imbalance ratio, while insightful, might oversimplify the complexity of real-world class distributions. A more flexible or generalized model could be explored to account for other potential distributional shifts.

问题

Real-World Generalization: How does LT-Soups perform when tested on highly skewed datasets or under conditions with noisy labels or missing data? Would the approach still provide strong results?

Statistical Robustness: The paper lacks statistical significance measures such as error bars. Could you provide additional experiments or statistical tests to support the stability of your findings across multiple runs or data splits?

Alternative Subsampling Strategies: Have you considered experimenting with more advanced or adaptive subsampling techniques, such as those based on active learning or dynamic class balancing during training? Could they offer better performance than the current progressive subsampling strategy?

局限性

yes

最终评判理由

The authors’ response addressed my concerns, and I will maintain my positive rating in the final evaluation.

格式问题

n/a

作者回复

2025-07-31

We thank the reviewer for their positive feedback on the originality, robustness, and practical applicability of our method. Below, we address their concerns in detail.

Statistical Robustness: The paper lacks statistical significance measures such as error bars. Could you provide additional experiments or statistical tests to support the stability of your findings across multiple runs or data splits?

Please note that we followed the experimental protocol established in prior benchmarks [1,2]. While it is not feasible to run statistical tests across all datasets within the rebuttal timeline, we report results on TinyImageNet-LT using three different random seeds in the table below. As shown, the performance variance is negligible, supporting the robustness of our method. If the reviewers believe it would add value, we would be happy to include statistical tests on the remaining datasets in the final version of the paper.

Method	All	Head	Tail
PEFT	77.0 ± 0.31	82.7 ± 0.44	73.8 ± 0.38
Model Soups	77.5 ± 0.04	86.0 ± 0.04	72.9 ± 0.03
LT-Soups	78.5 ± 0.16	85.4 ± 0.4	74.6 ± 0.15

[1] Liu et al. CVPR 2019. [2] Shi et al. ICML 2024.

Real-World Generalization: How does LT-Soups perform when tested on highly skewed datasets or under conditions with noisy labels or missing data? Would the approach still provide strong results?

Please note that two of the benchmark datasets—iNaturalist2018 and NIH-CXR-LT—are real-world datasets with naturally long-tailed distributions. They are characterized by markedly different values of imbalance ratio ( $\rho$ ) and head-to-tail ratio ( $\eta$ ): $\rho=500$ , $\eta=0.11$ for iNaturalist2018 and $\rho = 6491$ , $\eta = 5.66$ for NIH-CXR-LT. While incorporating additional distribution shifts, such as noisy labels or missing data, is a promising direction for future work, we follow prior work in the long-tailed learning literature [1,2] and focus solely on the imbalance aspect.

[1] Liu et al. CVPR 2019. [2] Shi et al. ICML 2024.

Alternative Subsampling Strategies: Have you considered experimenting with more advanced or adaptive subsampling techniques, such as those based on active learning or dynamic class balancing during training? Could they offer better performance than the current progressive subsampling strategy?

Thank you for the suggestion. We agree that more advanced subsampling strategies, such as selecting samples based on information content (e.g., based on entropy), could potentially improve performance and represent a promising direction for future work. However, in this study, we prioritize simplicity to keep our core message focused and interpretable. Notably, even our basic subsampling strategy already outperforms existing baselines, demonstrating the effectiveness of our overall approach.

审稿意见

评分: 4置信度: 42025-07-03

This paper addresses the trade-off between head- and tail-class performance when fine-tuning large pre-trained models on long-tailed (LT) datasets. The authors begin by introducing a new metric, the head-tail ratio (η), which they use alongside the standard imbalance ratio (ρ) to more comprehensively characterize the structure of long-tailed distributions. Through experiments, they show that Parameter-Efficient Fine-Tuning (PEFT) methods excel in tail-heavy scenarios, while full fine-tuning is more advantageous when head classes are dominant.

To resolve this issue, the authors propose LT-Soups, a two-stage model merging framework. In the first stage, instead of training models on the entire imbalanced dataset, LT-Soups creates a series of data subsets with progressively increasing imbalance ratios. It fine-tunes a full model on each subset and then merges the weights of these "specialist" models via recursive weighted averaging to form a unified feature extractor. In the second stage, to recover head-class information potentially lost during subsampling, the framework freezes the merged backbone and retrains only the classifier layer on the full training set. Experimental results across several synthetic and real-world long-tailed datasets demonstrate that LT-Soups achieves a superior performance balance compared to PEFT and traditional model soups across diverse long-tailed regimes.

优缺点分析

Strengths

The paper's analysis of the head-tail performance trade-off is insightful. The introduction of the head-tail ratio (η) as a new analytical dimension provides a more fine-grained framework for understanding the complexities of long-tailed recognition.
The paper's experimental section is robust. The authors validate their method on a wide range of standard benchmarks and use well-designed ablation studies to demonstrate the contribution of each component. The visualizations effectively support the core claim that different methods are optimal for different imbalance structures.

Weaknesses

While LT-Soups is an effective framework, its core ideas are more of an intelligent combination of existing techniques—Model Soups, data subsampling, and two-stage training—rather than a fundamentally new concept. The innovation lies in the specific recipe, which, while successful, may not meet the high bar for conceptual novelty at a top-tier venue.
The framework requires training N×M models in the first stage. Although parallelizable and trained on smaller subsets, the total computational budget is significantly higher than for single-run methods like PEFT or Full-FT. A direct comparison of total GPU hours is missing, making it difficult to assess the practical cost-benefit trade-off.
The recursive, ordered merging strategy is shown empirically to be superior to alternatives. However, the paper lacks a deep, intuitive explanation for why this curriculum-like fusion (from balanced to imbalanced) is the optimal approach. It remains unclear if this is a robust principle or an empirical finding specific to this setup.
Lack of discussion on recent literature. For ensemble long-tail recognition: Self-supervised aggregation of diverse experts for test-agnostic long-tailed recognition, Mdcs: More diverse experts with consistency self-distillation for long-tailed recognition. For recent literature: LTRL: Boosting Long-tail Recognition via Reflective Learning, DiffuLT: Diffusion for Long-tail Recognition Without External Knowledge, Harnessing Hierarchical Label Distribution Variations in Test Agnostic Long-tail Recognition.

问题

None

局限性

yes

最终评判理由

I thank the authors for their detailed and timely response. After reviewing their feedback, I believe that most of my concerns have been adequately addressed. I have therefore decided to raise my original rating, leaning toward acceptance.

格式问题

None

作者回复

2025-07-31

We thank the reviewer for their thoughtful feedback and encouraging comments on our analysis, rigorous methodology, and well-designed ablation studies. We address their questions below:

W1. While LT-Soups is an effective framework, its core ideas are more of an intelligent combination of existing techniques.

We emphasize that our contribution goes beyond simply proposing a final model. Specifically:

A new lens for analyzing long-tailed learning: We introduce the head-to-tail ratio ( $\eta$ ) as a complementary axis to the standard imbalance ratio ( $\rho$ ), offering a more comprehensive view of class imbalance. We argue that effective long-tailed methods must perform well across the full ( $\rho$ , $\eta$ ) spectrum—a perspective that enhances understanding of real-world imbalance. Through this lens, we examine state-of-the-art methods such as a very strong competitor, LIFT [1], which advocates parameter-efficient fine-tuning (PEFT) for long-tailed tasks, and Model Soups [2], which uniformly average fully fine-tuned models on the full imbalanced dataset. This analysis reveals performance trade-offs that are often have overlooked in conventional evaluations.
LT-Soups (a principled adaptation of Model Soups): Building on these insights, we propose LT-Soups, which adapts the strengths of Model Soups to long-tailed settings. Model Soups boosts head-class performance but significantly degrades tail-class accuracy due to overexposure to head-heavy distributions. PEFT, on the other hand, limits parameter updates and struggles to adapt to head classes. LT-Soups addresses both issues by fully fine-tuning models on subsets with progressively increasing imbalance and combining them via recursive weight averaging. This balances head-class adaptation with tail-class generalization
Comprehensive evaluation: Across six benchmark datasets, LT-Soups outperforms PEFT in overall accuracy (by 1.3 to 2.3 percentage points) on five of them, while achieving comparable tail-class performance. It also consistently surpasses Model Soups in overall accuracy (by 1.3 to 2.3 points) and yields substantial gains in tail-class accuracy (by 2.6 to 10.9 points), with only a modest drop in head-class performance (up to 3.2 points).

[1] Shi et al. ICML 2025. [2] Wortsman et al. ICML 2022.

W2. A direct comparison of total GPU hours is missing, making it difficult to assess the practical cost-benefit trade-off.

The full computational analysis, including total wall-clock time for the parallel version of LT-Soups, is already provided in Appendix B. In response to the reviewers’ request, we also report the total GPU hours, assuming access to only a single GPU with modest memory.

To recap, LT-Soups consists of two stages: Stage 1 trains a series of specialist models on subsampled datasets with varying imbalance ratios. Stage 2 involves retraining only a linear classifier on the full dataset, which is computationally lightweight. Both stages use the same number of epochs.

Tables 1 and 2 report the Stage 1 fine-tuning time per subsample (each $\rho$ corresponds to one subsample) and the total GPU hours when LT-Soups is trained on a single GPU for the ImageNet-LT and NIH-CXR-LT datasets. As we can see, the computational cost of LT-Soups depends primarily on the degree of imbalance in the dataset. For example, in ImageNet-LT (which comes with \rho=256), following our subsampling strategy, the total absolute GPU time is ~8 hours. LT-Soups is costly in this case because with our largest \rho=64, it includes 80% of the entire training data. In contrast, in NIH-CXR-LT ( $\rho = 6401$ ), the largest subset ( $\rho = 256$ ) includes only 24% of the data, resulting in substantial savings, up to 18x lower GPU hours compared to Model Soups of the same size.

Table 1: Computational cost per ImageNet-LT subset for LT-Soups with $M=2$ .

$\rho$	1	2	4	8	16	32	64	Total GPU hours
GPU Hours	0:12:19	0:18:44	0:32:00	0:54:00	1:26:32	2:02:00	2:31:09	7:56:50

Table 2: Computational cost per CXR-LT subset for LT-Soups with $M=2$ .

$\rho$	1	2	4	8	16	32	64	128	256	Total GPU hours
GPU Hours	0:01:01	0:01:05	0:01:21	0:02:38	0:04:06	0:06:16	0:10:26	0:16:46	0:25:04	1:08:12

We conclude by referencing Table 9 in the Appendix, which compares the total computational cost of LT-Soups and Model Soups (both with parallelization) against other baselines on ImageNet-LT and NIH-CXR-LT. Notably, although LT-Soups is a full-rank method, it converges within 10 epochs on CXR-LT, while LoRA—despite its parameter efficiency—requires 50 epochs, resulting in longer wall-clock time and higher total GPU hours. Thus, in response to the reviewer’s concern, the claim that “the total computational budget is significantly higher than single-run methods” does not always hold; LT-Soups often reduces computational cost through subsampling, with the budget depending on the choice of $N$ (Tab. 8 Appendix).

Method	Wall-clock time (H:M:S)	Epochs
ImageNet-LT
Full-FT	1:37:56	10
Model Soups	1:37:56	10
LoRA (rank=64)	1:25:33	10
LT-Soups	1:45:38	10
CXR-LT
Full-FT	0:53:43	10
Model Soups	0:53:43	10
LoRA (rank=64)	2:14:32	50
LT-Soups	0:32:17	10

W3. The paper lacks a deep, intuitive explanation for why this curriculum-like fusion (from balanced to imbalanced) is the optimal approach.

We believe our recursive weight averaging (WA) is a principled approach, as it enables an effective trade-off between stronger adaptation—potentially biased toward head classes—and more balanced generalization. It can be interpreted as an exponential moving average (EMA) over fine-tuned models sorted by increasing imbalance severity, with a tunable parameter that adjusts the influence of more balanced (but smaller) versus less balanced (but larger) subsets. In contrast, uniform WA applies a simple arithmetic mean, giving equal weight to all models regardless of their imbalance level.

In all of our experiments in the paper, we use only two values for $\lambda$ : 0.3 and 0.7, corresponding to high and low adaptation needs, respectively. Intuitively, when the target dataset is close to the pre-training weights, the value of the $\lambda$ becomes less important as even small datasets are enough for adaptations. However, when the shift becomes larger, subsets with more data (albeit biased towards head classes) become crucial.

The table below confirms our hypotheses. In particular, we compare recursive WA and uniform WA across two datasets with different similarities compared to CLIP-pretrained weights (according to the zero-shot performance). On ImageNet-LT, which is already well-aligned with CLIP-pretrained features, there is little to no difference between the two averaging schemes. However, for datasets that require significant adaptation [1], such as iNaturalist2018, recursive WA yields clear benefits by leveraging information from more data-rich subsets.

Table 1: Performance comparison under uniform and recursive WA across three datasets.

Method	TinyImageNet-LT				iNat2018
	All	Many	Med.	Few	All	Many	Med.	Few
Uniform	78.5	83.4	78.4	72.9	74.7	67.4	75.8	75.3
Ours	78.6	85.0	78.3	71.5	78.2	76.7	78.5	78.2

W4. Lack of discussion on recent literature.

Thank you for the suggestions. Given that our focus was on using pre-trained CLIP models for LT recognition, we have restricted the comparison to the recent literature that uses CLIP for LT recognition. In response to the reviewer’s request, we have expanded the related work section and now provide a brief comparison of LT-Soups with the methods mentioned.

SADE [1], Mdcs [2], and DirMixE [5] are Mixture-of-Experts (MoE) methods that aggregate diverse experts trained with different logit adjustment (LA), targeting distributions like uniform, long-tail, and inverse long-tail. Unlike these methods, which require deploying all experts at inference, LT-Soups uses weight averaging to collapse models into a single, inference-efficient solution. Moreover, as noted in [6], fully fine-tuning foundation models like CLIP with LA loss alone is insufficient, as it can lead to inconsistent class-conditional distributions, particularly for tail classes.

Reflective Learning [3] promotes consistency across training iterations by minimizing KL divergence between predictions and soft labels induced from feature similarity. In this light, the EMA mechanism in LT-Soups can be viewed as a lightweight form of Reflective Learning or self-distillation [7].

[1] Zhang et al., NeurIPS 2022. [2] Zhao et al., ICCV 2023. [3] Zhao et al., ECCV 2024. [5] Yang et al., ICML 2024. [6] Shi et al., ICML 2024. [7] Allen-Zhu et al., ICLR 2023.

2025-08-05

审稿意见

评分: 4置信度: 32025-07-03

They authors propose LT-Soups, a two-stage procedure, where:

Progressive subsampling + model-averaging. Multiple CLIP copies are fully fine-tuned (in parallel) on subsets whose imbalance ratios grow geometrically, then exponentially averaged to couple "tail-friendly" and "head-friendly" specialists
Classifier re-tuning. Only the final linear head is re-trained on the full data with logit adjustment to regain head-class sharpness

优缺点分析

Strengths:

it's a very simple approach
it's efficient since it can be parallelized
consistently outperforms PEFT and classical model-soup baselines on overall, head and tail accuracies
i really admire that the authors released code in the supplementary

Weakness:

although the method works well, it's currently hard for me to pinpoint exactly where are the gains coming from -- look at questions section, some ablations will help disentangle the multiple components of the method

问题

i'd like to see some small ablations if the authors can provide these :)

Performing PEFT, followed by performing the same final layer only tuning with logit adjustment that is performed in LT-Soups -- would this help in closing gap in head accuracy for PEFT ?
Performing Model Soups as done in Wortsman et al. [2022], followed by performing the same final layer only tuning

Above two ablations would tell if its the new model soup strategy which is driving the gains or the final logit adjustment.

perhaps one more thing to test would be:

take a subsampled dataset that has the least imbalance, perform vanilla model soups (Wortsman et al. [2022]) on it or PEFT, followed by final layer only tuning -- this ablation would tell us that does one need multiple models at different levels of imbalance in the dataset, or does the least imbalanced dataset is enough

let me know if these make sense.

局限性

yes

最终评判理由

i'd say the paper is borderline accept.

Reasons for not higher score:

the novely is a little limited, in the sense, the method seems like a mix-and-match of existing techniques -- this is not a bad thing per say. But mix-and-match in a very specific way adds to the complexity of the method. Overall method is still simple, but its not "aesthetic".

Reasons for not lower score:

authors provide extensive experiments
authors did perform the clarifying experiments during rebuttal which helped in understanding
authors released their code during submission

格式问题

N/A

作者回复

2025-07-31

We thank the reviewer for their insightful feedback and for recognizing the simplicity and effectiveness of our work. Our responses are below:

Q1. Performing PEFT and Model soups, followed by performing the same final layer only tuning with logit adjustment that is performed in LT-Soups -- would this help in closing gap in head accuracy for PEFT/Model soups ?

We found that additional final-layer tuning with logit adjustment on PEFT and Model Soups has little to no effect. The table below summarizes the results on TinyImageNet-LT. We hypothesize that, unlike these baselines, LT-Soups does not fully exploit the entire training set, due to the downweighting effect introduced by weight averaging. Consequently, fine-tuning the final layer helps LT-Soups recover head-class sharpness and improves overall performance.

Table 1: Comparison of baselines with and without Classifier Re-training

Method	All	Head	Tail
PEFT	77.1	83.0	73.9
PEFT + Classifier re-training	77.0	83.0	73.8
Model Soups	77.6	85.9	73.0
Model Soups + Classifier re-training	77.6	85.5	73.4
LT-Soups Stage 1 (without Classifier re-training)	78.1	84.9	74.5
LT-Soups	78.6	85.0	75.2

Q2. Take a subsampled dataset that has the least imbalance, perform vanilla model soups on it or PEFT, followed by final layer only tuning -- this ablation would tell us that does one needs multiple models at different levels of imbalance in the dataset, or does the least imbalanced dataset is enough

Table 3 in the main paper has already addressed this question in the context of Model Soups. Per your request, we extend the same experiment to the PEFT baseline, with full results shown in the table below.

Specifically, we compare LT-Soups with Model Soups and PEFT baselines that follow the same two-stage framework as LT-Soups. The only difference lies in the first stage: PEFT trains a single model on a subset with a fixed imbalance ratio, while Model Soups averages 16 models, all trained on subsets sharing the same imbalance ratio.

Results show that performance varies significantly depending on the imbalance ratio used. For instance, Model Soups with $\rho = 8$ yields the highest tail accuracy (75.0), while using the full dataset ( $\rho = 100$ ) results in the highest head accuracy (85.9). A similar trade-off is observed with PEFT. However, under any single imbalance setting, both baselines fall short of LT-Soups, which achieves performance of 78.6 (overall), 85.0 (head), and 75.2 (tail).

Rather than optimizing for a single point on the imbalance spectrum, LT-Soups averages across subsets with varying imbalance levels. This enables it to integrate the strengths of both low- and high- $\rho$ models, resulting in a more balanced head–tail trade-off overall.

Imbalance ratio ( $\rho$ )	1	2	4	8	16	32	64	100
Model Soups
All	71.7	75.9	76.0	77.2	77.2	77.3	77.9	77.6
Head	74.6	78.6	78.7	81.0	82.8	84.7	85.5	85.5
Tail	70.1	74.4	74.6	75.0	74.1	73.3	73.7	73.4
PEFT
All	73.9	74.3	74.8	76.2	76.4	77.1	77.0	77.0
Head	75.9	75.8	77.0	78.2	80.4	81.8	81.9	83.0
Tail	72.8	73.5	73.6	75.1	74.2	74.5	74.4	73.8

2025-08-05

Thanks for providing these!

I’d say add a discussion in the paper about this — it will help clarify things to the reader about what is important in making LT soups work.

I will maintain my positive score, thanks :)

2025-08-05

Thank you for the suggestion; we will include it in the final version. Thanks for appreciating our work.

审稿意见

评分: 3置信度: 12025-07-07