PaperHub
5.0
/10
Poster3 位审稿人
最低4最高6标准差0.8
4
5
6
3.7
置信度
正确性2.7
贡献度2.3
表达2.7
NeurIPS 2024

DiffuLT: Diffusion for Long-tail Recognition Without External Knowledge

OpenReviewPDF
提交: 2024-05-10更新: 2024-11-06
TL;DR

We introduce a pipeline for long-tail recognition that uses a diffusion model trained solely on the long-tailed dataset to generate a balanced proxy.

摘要

关键词
Long-tail learning; long-tail classificationdiffusion model

评审与讨论

审稿意见
4

This article uses a Diffusion model to generate tail class samples, achieving balance among classes at the data level and thereby addressing the long-tail distribution problem. Additionally, the authors believe that AID samples are the most beneficial in enhancing classifier performance; therefore, they design an L_AID loss to constrain the diffusion model to generate richer AID samples. Qualitative and quantitative comparisons have been made on three benchmark datasets, showing the efficiency of their method.

优点

  1. The paper is clearly organized and well written.
  2. The experimental results look good.
  3. The motivation is reasonable.

缺点

  1. The innovation is limited. Essentially, the authors' method designs a reconstruction loss to constrain the generative model to learn a distribution more consistent with real distributions, and many existing studies have proposed similar ideas.
  2. The authors emphasize that this method has generalizability in real-world applications; however, I hold a negative view on this point. I believe that the performance improvement observed in this paper is fundamentally due to training data and test data having identical distributions. Therefore, constraining the diffusion model to learn a distribution closer to training data benefits classifier learning. However, in real-world applications where limited training data may not share identical distributions with potentially infinite test data, using this method might lead to negative impacts.
  3. On what principle was Equation 4 determined? Why define [df, 2df] as AID? What impact would modifying this equation? For instance, would defining [df, \sqrt(2)df] as AID affect performance?
  4. In Table 6, why does combining ID and AID result in worse performance than using only AID for CBDM?
  5. Authors should provide ablation studies for Nt.

问题

Please refer to weaknesses for corresponding questions.

局限性

The authors have addressed the limitation of this method. This method will significantly increase training time.

作者回复

W1

The innovation is limited. Essentially, the authors' method designs a reconstruction loss to constrain the generative model to learn a distribution more consistent with real distributions, and many existing studies have proposed similar ideas.

The focus of our approach is not on training a generative model to learn a distribution that closely matches real-world data. Instead, we emphasize the relationship between generative models and discriminative models, addressing two questions that are yet to be fully explored: Can a generative model without external knowledge improve a classification model? And what type of generated data is most beneficial?

Previous research has primarily relied on pretrained diffusion models, which we believe limits practical applicability and may lead to data leakage. We would appreciate it if you could provide examples of existing studies with similar concepts. To the best of our knowledge, no such prior art exists.

W2

The authors emphasize that this method has generalizability in real-world applications; however, I hold a negative view on this point. I believe that the performance improvement observed in this paper is fundamentally due to training data and test data having identical distributions. Therefore, constraining the diffusion model to learn a distribution closer to training data benefits classifier learning. However, in real-world applications where limited training data may not share identical distributions with potentially infinite test data, using this method might lead to negative impacts.

As mentioned in previous publicity in CVPR [1], the long-tailed setting assumes that the distribution of the label p(y)p(y) differs between the training and test sets, but the conditional distribution p(xy)p(x|y) remains unchanged. These default settings hold true in current long-tailed recognition settings, and also can serve as the potential theoretical foundation for many long-tailed literatures [2][3]. The scenario you mentioned, where the training and test samples do not share the same distribution, might be different from the definition of long-tailed learning.

To address concerns about the potential negative impact of distribution shifts, we conducted a firmly experiments: we trained a ResNet-50 backbone on both the ImageNet-LT dataset and ImageNet-LT augmented with data Dgen\mathcal{D}_{gen} generated by our DiffuLT method. We then performed linear probing (the transfer learning settings) on the CIFAR100-LT with an imbalance ratio of 0.01.

DiffuLTCIFAR100-LT Acc (%)
×62.5
70.9

Despite the distribution differences between ImageNet-LT and CIFAR100-LT, our method still showed improvement. Therefore, we believe our method does not negatively impact model performance when distributions change.

Regarding real-world applications, our paper demonstrates that our methods can benefit medical imaging, remote sensing, and general image classification under long-tailed settings. We recognize that the understanding of "real-world application" may vary, and we are open to further discussion if you could specify your interpretation of this term.

W3

On what principle was Equation 4 determined? Why define [df, 2df] as AID? What impact would modifying this equation? For instance, would defining [df, \sqrt(2)df] as AID affect performance?

The definition of AID samples in the paper can be seen as `somewhat arbitrary'. However, we prioritized efficiency, performance, and simplicity in arriving at this definition. An ablation study shows that this approach achieves the best results with the highest efficiency.

AID rangeDgen‖\mathcal{D}_{gen}‖Acc (%)ΔAcc/Dgen\Delta \text{Acc}/‖\mathcal{D}_{gen}‖
[df,2df][d_f, \sqrt{2}d_f]5,21241.25.51×1045.51\times 10^{-4}
[df,2df][d_f, 2d_f]11,88645.25.78×1045.78\times 10^{-4}
[df,3df][d_f, 3d_f]15,38441.72.23×1042.23\times 10^{-4}

We opt not to use the first definition because the range should be as wide as possible to ensure efficient generation, while avoiding the production of too many unused samples. Although a more precise search might yield better results, we prioritize simplicity and avoid complex parameter tuning, leading us to set the region as defined.

W4

In Table 6, why does combining ID and AID result in worse performance than using only AID for CBDM?

Actually, this is one of the key ideas of our paper. In this setting, the number of generated samples is consistent, although the generation frequency varies due to the different distributions of data types. Our key observation, as shown in Table 3 of paper, is that AID samples contribute more to classification performance compared to ID samples. Since the "ID & AID" setting includes fewer AID samples than the "AID" setting, it results in a worse outcome.

W5

Authors should provide ablation studies for Nt.

We conducted an ablation experiment for NtN_t under the CIFAR100-LT setting with an imbalance ratio of 0.01.

NtN_tCIFAR100-LT Acc (%)
30048.5
40050.4
50051.5
60051.8
70049.2

The performance shows only a slight improvement after Nt>500N_t > 500 and begins to decrease after Nt>600N_t > 600. We believe the main reason for this is that the class with the most samples has only 500 instances, so choosing Nt>500N_t > 500 inevitably introduces noise. When NtN_t is too large, the impact of this noise outweighs the benefits of the newly added information.

[1] Disentangling Label Distribution for Long-tailed Visual Recognition. CVPR 2021.

[2] Long-tail learning via logit adjustment. ICLR 2021

[3] Balanced Meta-Softmax for Long-Tailed Visual Recognition. NeurIPS 2020

评论

Thank you for the reply, I have read your rebuttal and the comments from other reviewers. I believe the novelty and methodological design of this paper do not convince me to immediately revise the score. I will wait until the Reviewer-AC Discussions phase to decide on the final score.

评论

Dear Reviewer,

We hope that our rebuttal has effectively clarified any confusion and that the additional experiments we provided have strengthened the validity of our approach. We eagerly await your feedback on whether our response has adequately resolved your concerns or if further clarification is needed.

评论

Dear Reviewer,

Thank you for your feedback and patience. We would like to address your concerns regarding the novelty and methodological design of our work.

Novelty

The novelty of our approach lies in two main areas: primarily in the long-tailed learning domain and, to a lesser extent, in the generative modeling domain.

Long-Tailed Learning:

  • Diffusion Model without External Knowledge: Our key contribution is the novel application of diffusion-based models to long-tailed learning without relying on external knowledge. While recent works like [1] and [2] utilize Stable Diffusion to address data scarcity in classification tasks, they often face criticism due to the extensive pretraining on external large datasets (much larger than those for classification tasks), which can overshadow the novelty of their approach. We specifically address the question of whether a generative model, trained only on the limited data available in classification tasks, can still be effective. Our work demonstrates that this is indeed possible for the first time, providing a significant insight, especially when the training data is not composed of natural images (as discussed in our response to W3 of reviewer #vaue). This contribution may open new research avenues within the long-tailed learning community, offering a fresh perspective on the integration of generative models for classification tasks.

  • Understanding Diffusion Models: We also provide an explanation of why diffusion models, even without external knowledge, are effective in this context. Our work is the first to introduce the concept of AID samples and the role of generation model in blending information of each class to generate beneficial and novel samples.

Generative Modeling:

  • A New Objective for Generative Models: Traditionally, models like VQ-GAN are trained with objectives such as reconstruction loss and perceptual loss to generate realistic images. Our work introduces a novel objective: optimizing the generative model to produce images that are beneficial for classification tasks. This represents a significant shift from the traditional goal of generating visually appealing images to a more practical application, where the generated images directly enhance discriminative tasks. This objective is inherently challenging to optimize, and our introduction of AID samples provides a novel pathway for achieving this goal.

Methodological Design

To address your concerns, we highlight the key designs of our method:

New Designs:

  • Classification of Generated Images: We classify generated images based on their distribution in the feature space relative to real data.
  • Novel Loss Function: We introduce a new loss function that uses classifiers to guide the training of the generative model. This simultaneously addresses the long-tailed generation problem and enhances the utility of the generative model for classification tasks.
  • Diffusion Model without External Knowledge: As previously mentioned, our method employs a diffusion model trained without external knowledge to generate images that aid long-tailed learning.
  • Weighted Synthetic Samples: We apply weights to synthetic samples during backpropagation to improve training efficacy.

Adapted Designs:

  • Sample Filtering: We use a trained classifier to filter out harmful samples, similar to existing methods but without relying on models like CLIP.

We hope this addresses your concerns. Please feel free to reach out if further discussion or additional experiments are needed. We look forward to your feedback. Thank you again!

[1] Expanding Small-Scale Datasets with Guided Imagination. NeurIPS 2023.

[2] Feedback-guided Data Synthesis for Imbalanced Classification. NeurIPS Workshop 2023.

审稿意见
5

This paper introduces a novel pipeline for long-tail recognition that differs from conventional strategies by focusing on dataset composition. The method focuses on the dataset and introduces a diffusion model, namely DDPM to generate data for tail classes. The analysis reveals that approximately-in-distribution (AID) samples, which slightly deviate from the real data distribution and blend class information, are essential for enhancing the generative model’s performance in long-tail classification. Experiments conducted on CIFAR 10-LT, CIFAR 100-LT, and ImageNet-LT demonstrate significant performance improvements.

优点

  1. The paper is well-organized and easy to understand.
  2. The paper explores a novel perspective to tackle imbalance issue, by generating data for tail classes before training the classifier, which shows promising results compared to baseline results.
  3. The paper explores the data composition, namely ID, AID and OOD, and gives a detailed analysis of the correlation between generated data composition and classification performance.

缺点

  1. Some typos: line 137 ‘am input’
  2. The baseline setup in Table 1,2,3 requires further explanation.
  3. The paper mentions the time cost is four times longer than baseline, indicating that this method may be impractical for datasets with a significantly large imbalance ratio due to the extended generation and retraining times.
  4. Some baseline results in Tables 8 and 9 are indicated as ‘-’ and ‘ ’; an explanation for these forms of absence is missing.
  5. This method focuses on generating data for tail classes, resulting in significant improvements for these classes compared to other baseline methods. A fairer comparison would be with the same pipeline using different generative models.
  6. A long-tailed diffusion model baseline[1] is missing.
  7. The experiment are mainly conducted on manually-selected long-tailed datasets, it would improve the robustness of results by including real-world long-tailed datasets such as iNaturalist.

[1] Long-tailed Diffusion Models with Oriented Calibration. ICLR 2024.

问题

please refer to the weakness part.

局限性

The limitations are discussed in this paper. No potential negative social impact.

作者回复

W1

Some typos: line 137 ‘am input’

Thank you for your feedback; we'll fix it in the next version.

W2

The baseline setup in Table 1,2,3 requires further explanation.

We are sorry for the confusion. Actually, the baseline setup adheres to the code and protocols from [1], with a 200-epoch training schedule. Training is conducted with a batch size of 128 using an SGD optimizer at a learning rate of 0.1. These detailed settings can also be found in our appendix.

W3

The paper mentions the time cost is four times longer than baseline, indicating that this method may be impractical for datasets with a significantly large imbalance ratio due to the extended generation and retraining times.

Thanks for your review. We argue that since DiffuLT doesn’t rely on external knowledge like pretrained models or additional data, it is highly practical in scenarios where data is limited and that acquiring more training data is costly or impossible. In such cases, training time is not the primary concern, while acquiring satisfactory accuracy is.

To make our arguments valid, we showcase two quantitative examples: medical imaging using the KVASIR dataset and remote sensing with the EuroSAT dataset. We train the classifier on long-tail versions of these datasets with an imbalance ratio of 0.01, where models typically struggle due to the skewed distributions.

ModelLong-tailedKVASIR Acc (%)EuroSAT Acc (%)
Baseline×83.893.2
Baseline52.188.9
DiffuLT66.892.0

As shown in the table, our method significantly mitigates the long-tail issue in these important scenarios, showing our great generalization ability for practical usage. We would provide more such discussions in the final version.

W4

Some baseline results in Tables 8 and 9 are indicated as ‘-’ and ‘ ’; an explanation for these forms of absence is missing.

Thanks for your careful review. Since the comparison results are based on the reported values in their respective papers, any results not shown in the original papers are indicated as “-”.

W5 & W6

This method focuses on generating data for tail classes, resulting in significant improvements for these classes compared to other baseline methods. A fairer comparison would be with the same pipeline using different generative models.

A long-tailed diffusion model baseline is missing.

Following your suggestions, we compare our method with different generation models using the same pipeline in Table 5 of the paper. While T2H, as you mentioned, shows slightly better performance than CBDM, it falls short of our proposed model because it isn't primarily designed for discriminative tasks.

MethodCIFAR100-LT Acc (%)
Baseline38.3
DDPM43.8
CBDM46.6
T2H46.8
DiffuLT51.5

Additionally, using pretrained diffusion models like Stable Diffusion could lead to data leakage, resulting in an unfair comparison, and they are not well-suited for data beyond natural images. We will add these comparisons in our final version.

W7

The experiment are mainly conducted on manually-selected long-tailed datasets, it would improve the robustness of results by including real-world long-tailed datasets such as iNaturalist.

You have raised an interesting point. We have tried our best to train a generation model with iNaturalist dataset, but we found that this dataset is too large that it consumes all of our GPU computing power. We are running DiffuLT pipeline with a manually sampled subset of iNaturalist, but due to time constraint, experiments are not finished at present. We promise to showcase analysis and experiment results in our final version.

In the meanwhile, since iNaturalist is similar to ImageNet-LT, both being natural image datasets, we believe the effectiveness of our model on this type of data is validated. Additionally, the results on KVASIR and EuroSAT, which are real-world long-tailed datasets that have entirely different distributions, further demonstrate the robustness of our model.

[1] Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. CVPR 2020.

[2] KVASIR: A Multi-Class Image Dataset for Computer Aided Gastrointestinal Disease Detection. MMsys 2017.

[3] Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2019.

评论

Dear Reviewer,

We hope that our rebuttal has effectively clarified any confusion and that the additional experiments we provided have strengthened the validity of our approach. We eagerly await your feedback on whether our response has adequately resolved your concerns or if further clarification is needed.

评论

Thanks for the response. Most of my concerns have been addressed and I would like to raise my score to '5'.

评论

Dear Reviewer,

Thank you for raising your score! We're encouraged that our rebuttal addressed your concerns, and we sincerely appreciate your support for our work.

审稿意见
6

This paper introduces a diffusion based long-tail recognition method termed DiffuLT. DiffuLT uses diffusion model to generate supplement samples in order to balance the long-tailed dataset. The authors first discover that approximately-in-distribution (AID) samples are crucial in enhancing the generative model’s performance in long-tail classification, as it blends the useful information from other classes into the tail class. Inspired by such discovery, the authors propose to first train a feature extractor φ0\varphi_0 merely from the long-tail dataset, then utilize φ0\varphi_0 to guide the generation of AID samples, and finally train the desired classification model using the original dataset and newly generated AID samples together. Experiments validate the enhanced performance of DiffuLT.

优点

This paper is well-written in the following dimensions:

  1. Originality: The originality of this paper comes mainly in two aspects. First, the authors discover that AID samples are crucial in enriching tail class samples and enhancing model performance. Second, the authors incorporate revised diffusion model to generate AID samples. This method is original and effective.
  2. Quality: This paper is in good quality. The proposed method is both convincing and effective.
  3. Clarity: From introduction to discovery of the effectiveness of AID samples, then the pipeline of DiffuLT and finally the experiments and ablation studies, the structure is clear and logical.
  4. Significance: The discovery that AID samples are crucial in long-tail recognition performance is quite useful.

缺点

  1. It would be better if you show how “centralized” figure 2 is, for example, add support data of the proportion of ID, AID and OOD data. This figure is not intuitive enough.
  2. Citation suggestions. I believe your work would benefit from referencing some additional literature to provide a more comprehensive context for your study. Specifically, i recommend citing the following articles:
  • A Unified Generalization Analysis of Re-Weighting and Logit-Adjustment for Imbalanced Learning (NeurIPS 23)
  • Deep long-tailed learning: A survey (TPAMI 23)
  • Effective Data Augmentation With Diffusion Models (ICLR 24)
  • Harnessing Hierarchical Label Distribution Variations in Test Agnostic Long-tail Recognition (ICML 24)
  1. Minor typos. Line 137 “as an input”. Line 227 “because it doesn’t rely on any external dataset or model”.

问题

Could you explain how your method differs from other data augmentation methods?

局限性

See Weaknesses.

作者回复

W1

It would be better if you show how “centralized” figure 2 is, for example, add support data of the proportion of ID, AID and OOD data. This figure is not intuitive enough.

Thank you for your feedback. We agree that the centralization of data is not immediately apparent in Figure 2. It becomes clear through the statistics presented in Table 2. We believe that adding annotations to indicate each data type's region in the figure would help readers understand it more effectively.

W2

Citation suggestions. I believe your work would benefit from referencing some additional literature to provide a more comprehensive context for your study.

Thank you. We will include the first and last papers in our comparison results in the experiment, and we will incorporate the middle two works into our related works section in the future version.

W3

Minor typos. Line 137 “as an input”. Line 227 “because it doesn’t rely on any external dataset or model”.

Thanks for your patient review. We’ll address those issues accordingly.

评论

Thank you for your response. I would keep my score.

评论

Dear Reviewer,

Thank you for your prompt response and patience. We greatly appreciate your valuable advice and your support for our paper.

Due to an oversight, we did not provide a satisfactory answer to your question. Recognizing its importance, we would like to address it comprehensively here.

Q1

Could you explain how your method differs from other data augmentation methods?

We conducted a thorough survey of data augmentation methods for long-tail learning, categorizing them into two groups: feature augmentation and image augmentation.

Feature augmentation methods, such as RSF[1] and OFA[2], primarily focus on biasing the decision boundary towards tail classes. These methods are similar to re-weighting or re-sampling techniques and do not inherently increase data diversity. Moreover, because features are indirect and often difficult to map back to images, these methods differ fundamentally from image augmentation techniques.

Image augmentation methods share our motivation of increasing dataset richness. We specifically examined M2m[3], MiSLAS[4], Remix[5], CMO[6], and CSA[7], discussing each as follows:

  • M2m: This method augments tail classes by transforming head-class samples into tail-class ones through perturbation-based optimization. However, the generated samples often closely resemble the originals (the head images), failing to truly enrich the diversity of tail classes.
  • MiSLAS and Remix: These methods use Mixup for augmentation. However, MiSLAS experiments indicate that applying Mixup during classifier learning does not yield significant improvements and may even harm performance, requiring additional strategies to mitigate this issue.
  • CMO and CSA: Both methods introduce rich contexts from majority samples into minority samples. CMO uses CutMix, while CSA employs a more precise segmentation approach. While effective in enriching context, they do not enhance the diversity of objects within tail classes.

Data richness, particularly in tail classes, is crucial for addressing the long-tail problem. While recent methods have acknowledged this and used data augmentation to tackle it, their effectiveness is limited to context enrichment. In contrast, diffusion models in our pipeline represent the most powerful data augmentation method. Unlike mixup-based approaches, diffusion models avoid domain shift by accurately estimating data distributions. Additionally, they enrich both context and objects, offering greater diversity since they combine the entire dataset rather than just two images like CMO or CSA. Therefore, with proper utilization, diffusion models hold the highest potential. Below is a performance comparison of various data augmentation methods, including ours, on CIFAR100-LT with an imbalance ratio of 100, using ResNet-32 as the backbone.

MethodCIFAR100-LT Acc (%)
RSF43.1
OSA42.9
M2m43.5
MiSLAS47.0
Remix46.8
CMO47.2
CSA46.6
Ours51.5

[1] RSG: A Simple but Effective Module for Learning Imbalanced Datasets. CVPR 2021.

[2] Feature Space Augmentation for Long-Tailed Data. ECCV 2020.

[3] M2m: Imbalanced Classification via Major-to-minor Translation. CVPR 2020.

[4] Improving Calibration for Long-Tailed Recognition. CVPR 2021.

[5] Remix: Rebalanced Mixup. ECCV workshop 2020.

[6] The Majority Can Help the Minority: Context-rich Minority Oversampling for Long-tailed Classification. CVPR 2022.

[7] How Re-sampling Helps for Long-Tail Learning? NeurIPS 2023.

最终决定

This work designs a diffusion based long-tail recognition method - DiffuLT, leveraging diffusion models to generate samples to balance the long-tailed dataset. Authors found that approximately-in-distribution (AID) samples are important in enriching tail class samples , and then they proposed to incorporate revised diffusion model to generate such samples. The idea is new and the performance gain is good. The authors' rebuttals have addressed most of the concerns of the reviewers.