PaperHub
4.7
/10
Rejected3 位审稿人
最低3最高6标准差1.2
3
5
6
3.0
置信度
ICLR 2024

Feedback-guided Data Synthesis for Imbalanced Classification

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11
TL;DR

We introduce a framework with classifier-generator feedback mechanism for data synthesis that improves the performance on imbalanced classification problems.

摘要

关键词
Generative ModelsImbalanced ClassificationData SynthesisDiffusion Models

评审与讨论

审稿意见
3

This thesis utilises recent advances in generative modelling to address the shortcomings of synthetic data in representation learning and introduces feedback from downstream classifier models to guide the data generation process. To augment static datasets with useful synthetic samples, the research designs a framework that utilises pre-trained image generation models to provide useful and diverse synthetic samples that are close to the support of real data distributions to improve the representation learning task. This paper lays the groundwork for the effective use of state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications.

优点

  • Originality. The paper designs a diffusion model sampling strategy that uses the feedback of the pre-trained classifier to generate samples that help improve its own performance, which improves the classification performance to a certain extent. Has a certain degree of innovation.
  • Quality. The experimental design of the paper is reasonable, and the feasibility of the method is verified in ImageNet-LT and NICO++.
  • Clarity. The paper well-organized and clearly written.
  • Significance. The ideas proposed in this paper have certain contributions to this field.

缺点

  1. The font format of the article is not uniform. Do the words in italics want to express any special meaning? Make it difficult for readers to read.
  2. The charts are mixed up, for example, Figure 5. Is it a table or a graph? The sizes of some pictures also don’t match.
  3. How about the time complexity of this method?
  4. Are there more evaluation metrics to evaluate the performance of the proposed method versus the baseline method?

问题

Please refer to the weakness.

评论

Thank you for your feedback on our paper. We note a mismatch between your positive evaluation and your recommendation. Below, we address all your concerns and believe that upon reviewing our responses, you might consider revising your score.

The font format of the article is not uniform. Do the words in italics want to express any special meaning? Make it difficult for readers to read.

We have followed ICLR 2024's official style guidelines, including the use of italics for introducing either new concepts (e.g., “usefulness”) or emphasizing key points (e.g. “close to the support of the real data distribution”). If any instances create readability issues please let us know, we will modify them. However, these formatting concerns could be easily rectified and do not reflect the quality of the research presented.

The charts are mixed up, for example, Figure 5. Is it a table or a graph? The sizes of some pictures also don’t match.

We have now made the size of the tables consistent across the paper and modified Figure 5 (now Table 1) to include only the table and moved the figure to appendix.

How about the time complexity of this method?

Thank you for raising this question. We have now included an additional Section (G1) in the Appendix which compares the time-complexity of feedback guidance versus regular sampling.

Are there more evaluation metrics to evaluate the performance of the proposed method versus the baseline method?

We used standard metrics used to report results on ImageNet-LT and NICO++ [1, 2] such as average accuracy, many-med-few accuracy and worst group accuracy. In addition to the benchmark’s metrics, in our submission we include standard generative model metrics to evaluate the effect of our criteria on the quality of the generated images such as FID, coverage and density – see Table 2. Thus, we argue that our evaluation metrics are in-line with both representation learning and generative model literatures. If the reviewer has suggestions on which metrics they would like to see included in the paper, please let us know and we would be happy to include them.

—---------------

[1] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Largescale long-tailed recognition in an open world. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[2] Xingxuan Zhang, Yue He, Renzhe Xu, Han Yu, Zheyan Shen, and Peng Cui. Nico++: Towards better benchmarking for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16036–16047, 2023.

审稿意见
5

With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on long-tailed classification tasks. The authors hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier’s performance. In this work, the authors introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. For the framework to be effective, they find that the samples must be close to the support of the real data of the task at hand and be sufficiently diverse. The authors validate three feedback criteria on a long-tailed dataset (ImageNet-LT) and a group-imbalanced dataset.

优点

  1. The problem definition to encourage the generated samples to be helpful to the classifier, inspired by active learning frameworks, is novel.
  2. The proposed method performs better than the previous sample synthesis-based imbalance classification methods.

缺点

  • The proposed solution for the problem definition is too naïve. For active learning methods, in addition to the confidence-based or entropy-based approach, margin margin-based approach is also possible. For the recent active learning criteria, such as BALD [1], VAAL [2], or MCDAL [3]. To claim the contribution of a complete research paper, the authors should devise an idea to leverage such recent active learning methods to find more novel solutions suitable for this problem. [1] Deep Bayesian Active Learning with Image Data. ICML 2017. [2] Variational Adversarial Active Learning. ICCV 2019. [3] MCDAL: Maximum Classifier Discrepancy for Active Learning. TNNLS 2022.

  • Also, instead of simply comparing among naïve active learning criteria, how about combining multiple losses (at least linear combination in the loss)? That would be more novel than the proposed solution.

  • The experiment is also too weak. For the datasets, The authors only use ImageNet and NICO++. However, according to other recent Long-tailed recognition papers, they usually evaluate their methods on iNaturalist and Place-LT datasets to demonstrate the scalability. At least the authors should have evaluated their method on CIFAR datasets to show the effectiveness of their methods on other datasets.

  • Also, a comparison with more recent state-of-the-art long-tailed recognition papers is missing. For example, CMO [4] is one of the recent long-tailed recognition methods based on sample synthesis. To claim the usefulness of the proposed method, the authors should compare the proposed method with recent long-tailed recognition papers, including [4]. [4] The Majority Can Help The Minority: Context-rich Minority Oversampling for Long-Tailed Classification. CVPR 2022.

  • More analysis of the detailed design choices. For example, how are the hyper-parameters decided, such as w in Eqns (5), (6), (8)? As the authors proposed to add additional criteria, it would be necessary to analyze the effect of w on the performance.

问题

Please refer to the questions in the weakness.

评论

We would like to thank the reviewer for their insightful review. We are glad to see that the reviewer finds our framework novel. We would like to highlight important differences between Active Learning (AL) and our method. The goal of AL is to interactively query a user (or a model) to label new data points from a given static dataset of unlabelled data points, thus the dataset is known but the labels are unknown. However, in our setup the labels are given and the dataset is unknown. The dataset has to be obtained by sampling from a parametric generative model – our framework describes how to class-conditionally sample from a generative model to obtain useful data points to train a classifier. We agree with the reviewer that our criteria bear some similarities with AL acquisition functions. However, we argue that our main contribution is the design of a queryable framework for generative models that “smartly” balances the training data for a classification model, and not the criteria itself. Note that we do not claim the criteria to be our contribution – see contributions list at the end of the introduction. Moreover, in our experiments we show that the criteria we use are effective enough to obtain state-of-the-art results on three challenging datasets (ImageNet-LT and NICO++ and Places-LT). Thus, we disagree with the main premise of the reviewer’s critique that our simple and effective criteria weakens our main contributions. Contrarily, we see this as a proof of strength for our framework as it can reach the state-of-the-art with simple and broadly used criteria. Thus, we kindly ask the reviewer to reconsider their initial recommendation. Below, we provide detailed responses to the reviewer's questions.

.. using the recent active learning criteria, such as BALD [1], VAAL [2], or MCDAL [3]

Thank you for suggesting the relevant literature. We have included them in the future work section (6) in our paper. In general, any differentiable criterion which is a function of the classifier can be applied as a feedback guidance in the sampling process, as long as it is computationally efficient. Among the mentioned methods, in theory, both BALD and MCDAL could be applied as a feedback criterion. However, compared to the entropy, they are computationally 10x and 3x more expensive, respectively, limiting their application in practice. The suggested VAAL method is not applicable in our framework since it is task-agnostic and its acquisition function does not depend on the classifier of the task.

...according to other recent Long-tailed recognition papers, they usually evaluate their methods on iNaturalist and Place-LT datasets to demonstrate the scalability. At least the authors should have evaluated their method on CIFAR datasets...”

Thank you for your feedback regarding our experimental approach. Our paper focused on ImageNet-LT, which has a substantial number of classes (1000), and NICO++, featuring 360 groups. To further provide evidence for our methodology's effectiveness on large-scale datasets, we have applied our framework for the Places-LT[1] dataset. This is a highly imbalanced dataset where the smallest class only has 5 examples and the largest class has 4980 examples with 365 classes overall. We upsample this dataset such that every class has 4980 samples resulting in a dataset of size 1.8 million examples. We compare against the Fill-up [2] work which also uses 1.8 million synthetic samples and other baselines that do not use synthetic data. We observe that feedback guidance with entropy achieves SOTA results on this dataset. Below is the summary of our results and see full details in Section G.4 of the paper.

Method# Syn. dataOverallManyMediumFew
ERMNo30.245.727.38.2
Decouple-LWSNo37.640.639.128.6
Balanced SoftmaxNo38.642.039.330.5
ResLTNo39.839.843.631.4
MiSLASNo40.439.643.336.1
PaCoNo41.237.547.233.9
Fill-Up1.8M42.645.743.735.1
LDM-FG (Entropy)1.8M42.841.744.940.0
评论

..how about combining multiple losses (at least linear combination in the loss)? That would be more novel than the proposed solution.

Thank you for this suggestion. Although our framework has been proven practical and efficient by the extensive experimental analysis provided in the paper, we considered your suggestion and ran an experiment that studies how the combination of different criteria affect the performance of the model.

In this setup, for the three criteria studied in the paper (loss, hardness, entropy), we generate a dataset of size 1.3M by selecting each criteria with 0.33 probability for generating a sample. We tune the hyper-parameters of the classifier. Below, you can find a table that summarizes the results and compares with previous analysis that we had in the paper. You can also find this result and details in Section G.2 of the paper. Overall, the entropy criterion is proven to be the best in all the datasets that we studied and outperforms using a combination of criteria.

Method# Syn. dataOverallManyMediumFew
LDM-FG (Loss)1.3M60.4166.1457.6854.1
LDM-FG (Hardness)1.3M56.7058.0755.3857.32
LDM-FG (Entropy)1.3M64.769.862.359.1
LDM-FG (Combined)1.3M62.3867.6659.9656.23

...how are the hyper-parameters decided..analyze the effect of w on the performance.

As mentioned in the appendix Section G, the feedback guidance coefficient ω\omega is tuned based on the validation accuracy of the classifier. We have now included the ablation study on ω\omega for the task of NICO++ in Table 5, in the Appendix Section G.3. In general, among the several values of ω\omega that we tried, all of them improved against the case where ω=0\omega=0. However, better improvement is achieved through careful tuning. Below you can find the summary of results:

ω\omegaWorst-group accuracy
ω=032.66 ± 1.33
ω=0.0145.60 ± 2.63
ω=0.0349.20 ± 0.97
ω=0.0542.10 ± 1.84

...authors should compare the proposed method with recent long-tailed recognition papers, including [4]..

Thank you for suggesting relevant literature. We have included [4] in our literature review (section A.2) and compared against it in Table 1. CMO[4] method achieves state of the art results on classes “many” among methods that do not use synthetic data. However, our method with entropy outperforms CMO across all cases, including many, med, few and overall.

—-----------------------------------------------------------------

[1] Liu, Ziwei, et al. "Large-scale long-tailed recognition in an open world." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

[2]Shin, Joonghyuk, Minguk Kang, and Jaesik Park. "Fill-Up: Balancing Long-Tailed Data with Generative Models." arXiv preprint arXiv:2306.07200 (2023).

审稿意见
6

The effectiveness of utilizing synthesized data is limited by the lack of feedback. This work proposes a framework to drive the sampling process of a generative model, thereby improving the usefulness of the generated samples.

优点

● the experimental results were stunning, achieving state-of-the-art on ImageNet-LT ● the writing is clear and easy to follow ● the experiment is comprehensive, comparing three types of feedback criteria

缺点

ImageNet-LT is essentially a pseudo long-tail dataset, where the tail classes may not necessarily be the minority in the actual data distribution. Therefore, generative models can sample relatively well. However, for real-world long-tail distributions, is it also difficult for generative models to obtain sufficiently good samples?

问题

ImageNet-LT is essentially a pseudo long-tail dataset, where the tail classes may not necessarily be the minority in the actual data distribution. Therefore, generative models can sample relatively well. However, for real-world long-tail distributions, is it also difficult for generative models to obtain sufficiently good samples?

评论

Thank you for your valuable feedback. We are glad that you found our experimental results “comprehensive/stunning" and the writing “easy to follow”. Below we clarify your question and hope that you consider increasing your score.

ImageNet-LT is essentially a pseudo long-tail dataset, where the tail classes may not necessarily be the minority in the actual data distribution. Therefore, generative models can sample relatively well. However, for real-world long-tail distributions, is it also difficult for generative models to obtain sufficiently good samples?

Thanks for raising this point. As discussed in the limitations paragraphs in the conclusion section of our initial submission: “...our guidance mechanism can only explore the data manifold already captured by the generative model” and this limitation is imposed by the generative models.

Yet, we would like to add that state-of-the-art generative models are capable of generating very unlikely data points under the real world distribution of images - e.g. they are able to generate the rare concept of "an astronaut riding a horse" by combining the common concepts of “astronaut” and “horse”. This is mostly what we observed in our NICO++ experiments, where we saw that generative models can indeed generate new combinations of object and background which is useful to improve generalization.

Our findings on Imagenet-LT, NICO++, and Places-LT (recently added - refer to new results in reviewer’s tjGi answer) show that this inherent potential can be harnessed towards achieving tangible improvements in classification tasks.

评论

We thank all the reviewers for the valuable feedback. We appreciate that the reviewers have found our paper well-organized and clearly written with stunning experimental results, achieving state-of-the-art on ImageNet-LT and that the experimental design of the paper is reasonable while understanding that the problem definition … is novel.

Below is a summary of the changes that are marked in blue in the updated pdf document:

  • A new set of experimental analysis on the Places-LT dataset (Section G.4, Table 7) where we achieve state of the art results on average accuracy on this dataset.
  • Ablation study on the ω\omega hyper-parameter on the NICO++ dataset. (Section G.3)
  • Ablation study on using a combination of the three criteria functions. (Section G.2)
  • Time-complexity analysis of our method (Section G.1)
  • Updating relevant literature (Section A.2), future work (Section 6) and references.
AC 元评审

This work studies long tail (LT) recognition, i.e. an imbalanced classification scenario in which different classes or groups are unequally represented. The proposed framework falls under umbrella of synthetic data augmentation and proposes to leverage one-shot feedback from the classifier to drive the sampling of the generative model.

While the reviewers acknowledged the importance of the problem, they have raised several concerns: 1) a pseudo long-tail dataset ImageNet-LT – see reviewer Reviewer SFA5 comment; 2) comparison to margin-based approaches to evaluate effectiveness of the proposed approach - see Reviewer tjGi concern; 3) lack of more recent benchmark evaluation which was addressed in the rebuttal using Place-LT dataset; 4) presentation clarity – see reviewer Ajag comments.

The rebuttal was able to clarify some questions but did not manage to sway any of the reviewers. In light of an unanimous lack of enthusiasm for this work, a general consensus among reviewers and AC was reached to reject the paper. We hope the reviews are useful for improving and revising the paper.

为何不给更高分

This is not a borderline paper. The only positive reviewer raised an important concern that ImageNet-LT is essentially a pseudo long-tail dataset, where the tail classes may not necessarily be the minority in the actual data distribution. The authors tried to addressed this verbally and by means of a new dataset, however lack of clarity to this question makes it very difficult to assess the benefits of the proposed work. The other reviewers raised many concerns. The rebuttal was able to clarify some of them but did not manage to sway any of the reviewers. In light of an unanimous lack of enthusiasm for this work, a decision was reached to reject the paper.

为何不给更低分

N//A

最终决定

Reject