PaperHub
4.7
/10
Rejected3 位审稿人
最低3最高6标准差1.2
6
5
3
3.7
置信度
ICLR 2024

BLG: BALANCED LANGUAGE DISTRIBUTION AS GUIDANCE FOR ROBUST LONG-TAILED VISION CLASSIFICATION

OpenReviewPDF
提交: 2023-09-18更新: 2024-02-11

摘要

关键词
Long-tailed vision recognitionmulti-modalityoptimal transport

评审与讨论

审稿意见
6

After the advent of vision-language pre-training, numerous works have adapted the pre-trained vision-language model to various vision tasks, including long-tailed recognition. This paper first presents empirical evidence that textual features remain balanced even after fine-tuning in the context of long-tailed classification. Based on this, the authors propose a framework that leverages balanced textual features as a guide to obtain more robust visual features.

优点

  1. The empirical finding that, during the fine-tuning of the entire vision-language pre-trained model on long-tailed data, textual features tend to achieve balance is quite intriguing. This paper goes beyond this observation and contributes to the community by proposing a concrete methodology that leverages balanced textual features to rectify imbalanced visual features.
  2. The thorough ablation study conducted on the elements comprising "Phase B," proposed in this work, effectively underscores that the suggested LOTL_{\text{OT}} and LSCTL_{\text{SCT}} indeed enhance performance.

缺点

  1. The overall structure of this paper, which deals with challenges in contrastive learning methods due to class imbalance and suggests remedies, evokes thoughts of Suh and Seo (2023). Nevertheless, the current paper does not include any discourse on this topic.
  2. Moreover, while one could mention Kang et al. (2021) as a seminal work on achieving a balanced and discriminative feature space in long-tailed classification scenarios, this is also not discussed.

Kang et al., 2021, Exploring Balanced Feature Spaces for Representation Learning.
Suh and Seo, 2023, Long-Tailed Recognition by Mutual Information Maximization between Latent Features and Ground-Truth Labels.

问题

  1. Could you offer some informed speculation about why there is a tendency for textual features to be balanced?
  2. Since comparing performance between different architectures does not hold much significance, it would be better to provide results for RN50 and ViT-B/16 in separate groups.
  3. Does the proposed approach result in any additional training expenses? For instance, what are the costs associated with setting up an optimal transport plan?
评论

Connection with contrastive learning.

We sincerely appreciate the thoughtful reviews and the valuable insights provided by each reviewer. We have carefully considered your comments regarding the connections and differences between our method and previous works based on contrastive learning for imbalanced image classification. We acknowledge the similarities but would like to emphasize the essential differences that set our approach apart.

(1) Distinguishing from Contrastive Learning:

Our method is not solely based on contrastive learning. While we use CLIP and propose the use of a supervised contrastive learning loss function, leveraging the optimal transport matrix to enhance model learning and improve visual features, this aspect represents only a supplementary component within our broader method. Furthermore, our experiments demonstrate that even without the proposed contrastive supervision loss function, our method achieves state-of-the-art performance.

(2) Optimal Transport Matching vs. Contrastive Learning:

When addressing the challenge of improving biased visual features to achieve better balance and separability, we uniquely formulate this problem as an optimal transport matching problem. This approach stands out differently from traditional contrastive learning. Our objective is to minimize the optimal transport distance, ensuring a closer alignment between visual features of the same type and their prototypes while creating a more significant separation between different types.

(3) Leveraging Complementarity of Multi-Modal Information:

In the context of imbalanced image classification, effectively leveraging the complementarity between multi-modal information for robust visual feature learning remains an open challenge. Our experiments in rebuttal reveal that naively employing text features as traditional positive samples or anchor points in contrastive learning methods does not directly contribute to effective model learning. Therefore, we find it imperative to design and implement more nuanced methods to address imbalanced image classification under multi-modal conditions.

We believe these distinctions underscore the uniqueness of our method and its contributions to addressing the complexities of imbalanced image classification. We appreciate your thorough evaluations and welcome any further suggestions or inquiries.

评论

My main issue was the absence of discussions regarding prior works. It is crucial for the researchers to clearly identify any similar existing studies concerning their newly proposed ideas, delineate the points of similarity, and distinctly outline what aspects are novel. The authors addressed this concern through the revision, so I am accordingly raising the score.

Additional Note: it seems that Figure 7 incurs some delays in the image rendering process. One possible solution may be substituting it with a PNG format.

评论

Dear Reviewer V37f,

Thanks a lot for your comments and for raising the score.

Figure 7 exhibits the visualization of our optimal transport matrix, offering a visual testament to the effectiveness of our proposed optimal transport (OT) loss function. It visually clarifies the OT-based loss function's adeptness in guiding the model to align visual features with their respective textual prototypes. Consequently, this illustration stands as evidence for the rationale underlying the refinement of biased visual features through our optimal transport-based approach. Moreover, it underscores the judicious application of the transport matrix to further augment the similarity between akin visual features.

We will update the revision of our paper by incorporating your informative suggestions.

Best regards, The Authors

评论

Thanks for your kind suggestion and for providing related work to help refine our work.

  1. Reply to #w1 and #w2:

Thank you for sharing these papers. We noticed with them before, as they share a similar motivation to ours: learning balanced and discriminative visual features for long-tailed recognition. Despite these similarities, there are significant differences between our approaches.

The first paper maximizes mutual information between visual latent features and ground truth labels, employing a contrastive learning method based on image modality. In contrast, our method leverages complementary textual information, to facilitate the learning of robust visual features. Notably, our approach does not explicitly model or utilize label frequency, as seen in the first paper, where it is used to compute logit adjustment terms and construct class-wise queues. Instead, our focus is on achieving high separability between classes based on textual features. Moreover, our method does not rely on the teacher-student network structure. Furthermore, our proposed supervised contrastive learning loss is only part of our contribution, which serves as a complementary for the OT-based prototype-guided matching. As shown in Table 3, our method can still perform best even without the proposed LSCT\mathcal{L}_{\text{SCT}}.

The second paper employs k-positive samples as anchors for contrastive loss, deviating from using augmented instances. This approach, dependent on the number of samples in the minority class, cannot guarantee the distance between k-positive samples from different classes, potentially causing tail classes to be pushed closer together. In contrast, our method is independent of the number of samples and consistently identifies prototypes with strong discriminability based on textual features. Additionally, our approach is multimodal, utilizing optimal transport to align visual features and prototypes in an unsupervised manner. We introduce class-level supervision signals based on the transport matrix to further enhance feature learning.

In summary, while these referenced papers effectively aim to learn more discriminative visual distributions, our work distinguishes itself through a focus on multimodal data, the incorporation of optimal transport for balanced prototype-guided visual representation learning and class-level supervision signals based on transport plan. We will discuss more detailed differences in the related works .

  1. Reply to #Q1:

Firstly, we argue that text features are a more high-level abstract semantic representation, which inherently has better discriminability. Therefore, they can serve as queries in the Vision Language Model to perform tasks such as classifying the input image features. This discriminability, in our work, is manifested as balance. That is, the language encoder is less affected by the tail class. As shown in the visualization in Appendix A, textual features always maintain good separability and balance, regardless of whether the language encoder of CLIP has been fine-tuned or not. Secondly, when we construct the text prompt for the input image, we use the class name information. This kind of information already provides high separability for textual features. From the text features, we can often directly infer which object a sentence is describing by the classname. Therefore, they can serve as prototypes with good classification properties. We aim to make visual features better cluster around the corresponding prototypes. In addition, when we use balanced textual features to build prototypes, each class uses the same number of text prompts. From this perspective, the prompts between different classes are balanced. The head class does not have more prompts than the tail class. Finally, when we fine-tune CLIP, we only use the InfoNCE loss in the image-retrieval direction, because our goal is to use text to classify images, which can also reduce the negative impact on textual features during Phase A to some extent.

  1. Reply to #Q2.

Thanks for your kind reminder. We will update our tables for a clear comparison in the revision.

  1. Reply to #Q3.

Yes. The introduction of OT loss brings some lightweight additional computational burden. The added computational burden mainly comes from the need to calculate the optimal transport (OT) distance when aligning visual features and prototypes. We have comprehensively compared our method with BALLAD and VL-LTR in Appendix G. As shown in Table 14, our method is comparable to other methods in terms of the training cost. Compared with BALLAD, we added 10 epochs in the training phase and required more time, but we needed less inference time and had a significant improvement in performance (51.9% vs 49.5%). Compared with VL-LTR, we require fewer epochs for training, less training time, and no additional data. Still, we achieve better performance (51.9% vs 50.1%) than VL-LTR.

审稿意见
5

This study discovers that the fine-tuned CLIP's textual features are more balanced and discriminative compared to its visual counterparts. Building on this, the research proposes utilizing balanced textual features as prototypes to guide the learning of robust representations for biased visual features. The CLIP is further fine-tuned through contrastive learning, followed by the optimization of biased visual representations using linear adapters and the introduction of optimal transport distance to help decouple biased visual features. Additionally, a supervised contrastive learning loss based on the transport plan is designed. Experimental results indicate that the approach excels in leveraging visual-language information for imbalanced visual recognition, achieving state-of-the-art performance.

优点

  1. Extensive experiments on ImageNet-LT, Places-LT, and iNaturalist 2018 have demonstrated the effectiveness of the proposed method.
  2. Comprehensive visualizations and ablation studies were conducted to validate the impact of the proposed method.

缺点

  1. The experiment results indicate that the method underperforms for “many” classes in long-tail data.
  2. The proposed method employs a two-stage training process and fine-tunes the Full-CLIP, which requires significant computational resources and has a prolonged training duration.
  3. The proposed method doesn't seem to have a specific design tailored for long-tail data. The approach of using textual features as guidance for better image features can be applied to situations with limited image feature quality for various reasons, such as long-tail, few-shot, noisy data, generated data, low-resolution data, and so forth.

问题

  1. Why does the proposed method underperform in “many” classes of LT dataset? An analysis of the underlying reasons would be appreciated.
  2. The experimental results show that the proposed method underperforms in “many” classes of LT dataset. Does this imply that the method is primarily effective for situations with the few-shot scenario (i.e., “medium” and “few” classes in long-tail datasets)?
  3. Balanced sampling is a fundamental operation in long-tail methods. Why is random sampling used when fine-tuning the CLIP in the initial stage? Is it to intentionally obtain a CLIP encoder with strong biases caused by imbalance?
  4. Reference [1] also leverages CLIP's text features to enhance the discriminative power of image features. In [1], directly using text features and image features in concurrent training a linear classifier can achieve significant improvements in few-shot tasks. However, compared to the method in this paper, the method in [1] is much simpler, with much faster computation speeds and far less computational overhead. [1] https://arxiv.org/abs/2301.06267
评论

Thanks for your kind suggestion and for providing related work to help refine our work.

  1. Reply to #W1, #Q1 AND #Q2:

It involves a trade-off in performance among three different shot divisions in long-tail. Despite this, our method boasts several key advantages:

(1) We view imbalanced recognition as a task primarily concerned with enhancing the classification performance of tail classes, minimizing the performance gap with head classes, and achieving superior overall classification performance. While BALLAD and VL-LTR excel in overall performance by mainly improving head class results, they fall short in addressing the tail. For instance, when based on RN50 and additional texts for training, VL-LTR achieves 77.8% accuracy in the Many-shot in ImageNet-LT but only 50.8% accuracy in the Few-shot, resulting in a substantial 27.0% performance gap. In contrast, our method exhibits a significantly lower performance gap of 8.7%.

(2) Our approach outperforms in medium-shot, few-shot, and overall performance, effectively narrowing the performance gap between these divisions and many-shot. This aligns with our motivation to attain a more balanced visual feature distribution, resulting in a more balanced classification performance. Notably, we achieve a more substantial performance improvement in few-shot scenarios compared to previous methods. On ImageNet-LT, our method showcases a 10% increase in the few-shot category. Similarly, on Places-LT, the improvement exceeds 12%.

Furthermore, we wish to draw a distinction between the few-shot division in long-tailed recognition and few-shot learning in the general sense. The former serves as a means and perspective for performance evaluation after data division under the long-tail problem, indicating performance on minority class samples. In contrast, the latter focuses more on the disparity between base classes and novel classes, involving learning with limited samples after the introduction of new classes to achieve improved performance. In the former, each arrival constitutes a completely imbalanced dataset, requiring the model to learn and classify the entire dataset. Ultimately, our evaluation considers performance under three different divisions and overall performance.

  1. Reply to #W2:

Fully fine-tuning of the CLIP encoders indeed introduces a large computation cost. We notice there are some parameter-efficient methods developed for tuning CLIP encoders. We will focus on this issue and try more efficient ways to improve our method.

  1. Reply to #W3:

Thanks for your kind suggestion. We will try our method on more scenarios, such as few-shot learning and noisy data. Due to the rebuttal time limit, we may not give comprehensive results now.

  1. Reply to #Q3:

We employ a random sampling strategy when fine-tuning CLIP encoders in the first stage, driven by three key considerations.

Firstly, as indicated in related works [1,2], the feature extractor's impact on classification performance may not be significant. The pivotal factor in imbalanced classification accuracy lies in the learning of the classifier. Consequently, utilizing random sampling during encoder training or fine-tuning, even with skewed visual features, remains a viable approach. Following this, methods often choose to fix encoders and employ new sampling strategies to train the classifier. Based on this concept, we adopt a random sampling strategy during encoder fine-tuning in the first stage. In the second stage, we shift to a balanced sampling strategy to refine visual features and train a balanced classifier. This decision aligns with our motivation.

Secondly, experiment results indicate that using a balanced sampling strategy to fine-tune encoders in the first stage is not the optimal approach. Updated experiment results on sampling strategy in Appendix H support this observation. Notably, employing random sampling in the first stage and balanced sampling in the second stage yields the best results, outperforming the strategy of using balanced sampling in both stages. This finding is consistent with conclusions drawn in previous related work, such as BALLAD.

Finally, a new t-SNE visualization result in Appendix H compares visual features extracted from the same class under two different sampling strategies. Even when a balanced sampling strategy is used in Phase A, it does not significantly enhance tail classes' feature distribution distinctiveness. The visual feature distribution obtained by balanced sampling does not markedly differ from that obtained by random sampling on tail samples. However, for head samples, visual features of the head class are less densely clustered than those obtained by random sampling. This underscores the necessity of employing different sampling strategies in distinct stages from an alternative perspective.

[1] Decoupling representation and classifier for long-tailed recognition. [2] Improving calibration for long-tailed recognition.

评论
  1. Reply to #Q4:

Thank you for introducing this related work; we noticed it before and have conducted experiments using a similar approach. Initially, we anticipated that incorporating textual features as input would enhance our classifier's performance. However, in practical implementation, we observed that this method did not yield positive results for our specific task and, in fact, led to some performance degradation, potentially due to overfitting.

As illustrated in the table below, we introduced text features as input while maintaining other structures unchanged to facilitate classifier learning based on the ViT/B-16 and Places-LT dataset. The results indicate that our performance under few-shot scenarios is slightly worse than the outcomes achieved when text prompts are included as part of the input. This observation suggests that utilizing text prompts as input may offer some assistance in learning tail classes. This finding aligns with the propose of [1].

MethodsMany-ShotMedum-ShotFew-ShotOverall
OURS49.852.654.251.9
OURS + Text Inputs49.351.454.451.3

However, it is noteworthy that while this approach may benefit the learning of tail classes, it exerts a significant negative impact on many-shot and medium-shot performance, ultimately resulting in an overall decrease in performance. Given that the primary objective of long-tailed vision recognition tasks is to achieve balanced performance across each class, these methods may not be directly applicable.

We posit two possible reasons for this discrepancy. Firstly, our method targets imbalanced image classification tasks rather than few-shot learning. Consequently, our dataset is much larger, containing more head class samples. In datasets with ample data, adding text prompts as part of the input may not yield additional performance gains. Secondly, the semantic information provided by the text prompt as input is relatively simple. In the context of large and imbalanced vision datasets, relying solely on text prompts as input may not be sufficient to support robust and balanced decision boundary learning. Consequently, we believe there is a need for improved alignment methods to achieve better discriminability and balance between classes.

[1] https://arxiv.org/abs/2301.06267

评论

Dear Reviewer nBTe,

Thank you for your suggestion regarding the applicability of our method to other limited data scenarios in visual reasoning tasks. Following your advice, we integrated our method with Tip-Adapter[1] and PLOT[2] separately. The experimental results demonstrate that our method consistently contributes to improved performance for these few-shot learning approaches.

Methods1 shot2 shots4 shots8 shots16 shots
Tip-Adapter-F89.3389.7490.5691.4492.86
Tip-Adapter-F + OURS89.6190.9891.3592.0893.27
PLOT89.8390.6790.8091.5492.24
PLOT + OURS89.9091.5691.2092.9093.47

This indicates that our motivation and the proposed method are not only effective in enhancing biased visual features for better performance in imbalanced image classification, but are also applicable to few-shot learning tasks. This significantly broadens the universality of our approach. Detailed implementation processes and experimental records have been updated in Appendix I for reference.

评论

Dear Reviewer nBTe,

Given the constrained timeframe for further discussion, we genuinely appreciate any feedback you may have on our revised submission. If there are additional questions or points of clarification needed, we are more than willing to promptly address them.

Again, we express our gratitude for your time, extensive efforts, and valuable insights. Thank you for the opportunity to engage in this discussion.

Best regards, The Authors

评论

Thank you to the authors for your diligent efforts and feedback, which have resolved some of my queries. However, I still hold reservations regarding the following points:

  1. Same as what I mentioned in weakness. As a paper to solve the LTR problem, the proposed method does not appear to have a specific design tailored for handling long-tail data distributions.

  2. The authors have only compared the results of combining their method with [1], without comparing it with the results of using [1] in isolation. Additionally, the results indicate that the method in [1] only slightly decreases the performance for 'many' and 'medium' split sets while enhancing the 'few' set. However, the complexity of the method in [1] is significantly lower than that of the proposed method. Moreover, it employs a CLIP linear probe without the need to fine-tune the entire CLIP model, making it a more efficient and direct approach to leveraging textual features.

[1] https://arxiv.org/abs/2301.06267

  1. There is a substantial domain gap between textual and image features. Is it reasonable to use textual features as initial centroids to guide the clustering of image features? Specifically, while textual features possess strong discriminative power, the distance between textual features and image features (in the test set) remains considerably larger than that between image features in the training and test sets. Using textual features to guide the training set image features might yield more discriminative features and decision boundaries. However, this approach could increase the distance between the original (training set) image features and the (test set) image features, potentially leading to overall poorer performance.
评论

Dear Reviewer nBTe,

Thanks for your insightful question and informative feedback to help refine the paper.

  1. Reply to #W1

Allow us to elaborate on how our approach aligns with the challenges posed by long-tailed recognition (LTR), both in motivation, method design, and experimental outcomes.

(1) Motivation:

Our motivation is grounded in the persistent balance and separability observed in text feature distributions even after processing imbalanced datasets through CLIP encoders. Recognizing the inherent challenges of long-tailed data, we aim to harness this balance in text features to guide the learning of highly discriminative visual representations, thereby enhancing the performance of long-tailed vision recognition.

(2) Method Design:

In the design of our method, we initiate the classifier (prototypes) with balanced text features and employ a balanced data sampling strategy during model training. By utilizing optimal transport to match visual features with their prototypes, our approach aims at achieving a delicate balance between learning discriminative visual features and addressing the imbalances inherent in long-tailed datasets. The incorporation of a residual structure refines biased visual features, promoting their balance and separability in the long-tailed context.

(3) Experimental Insights:

The experimental outcomes corroborate the effectiveness of our method in the long-tailed recognition scenario, showcasing state-of-the-art performance. In response to the reviewer's inquiry about the applicability to limited data scenarios, our additional experiments demonstrate the adaptability of our method to few-shot learning tasks. While the versatility of our approach is evident, it is essential to underscore that the primary strength of our method lies in its substantial improvement in addressing the challenges posed by long-tailed datasets in visual recognition.

In summary, our method exhibits a nuanced approach tailored to the intricacies of long-tailed recognition. By leveraging the balance in text features, our method not only specifically excels in imbalanced vision recognition but also extends its effectiveness to the nuanced domain of few-shot learning, reinforcing its applicability across various challenges in visual recognition tasks.

  1. Reply to #W2

We appreciate the opportunity to provide further clarification and additional insights regarding the comparison with the referenced work [1], and we acknowledge the need for a more comprehensive and accurate depiction of our method's nuances.

Firstly, we would like to revisit the essence of [1]. The primary objective of [1] is to enhance CLIP's performance in few-shot learning tasks by incorporating paired visual and textual features as input. This method utilizes cross-modal information to train a robust classifier, ultimately improving CLIP's classification performance on few-shot tasks. The core of [1] lies in the introduction of textual information as input, a strategy we also explored in our experiments.

In our previous response, we conducted experiments by incorporating the idea of introducing text as input into our existing method, which involves the full fine-tuning of CLIP encoders, the vision adapter, and the utilization of prototype-guided matching loss. Therefore, the outcome of these experiments revealed a marginal decrease in performance compared to our original method.

To provide more clarity on the differences, we provide additional experiments. We utilized the released code from [1] and evaluated its performance on the Long-Tailed Recognition (LTR) task. Without fine-tuning CLIP encoders and without employing any of our proposed losses, [1] achieved an overall performance of only 42.9% on Places-LT based on ViT-B/16. This result underscores that while [1] is effective and insightful for few-shot learning, it is not inherently tailored for the challenges posed by LTR tasks, and its performance does not significantly contribute to improving classification accuracy.

Furthermore, when fine-tuning CLIP encoders and exclusively using cross-modal input for training the classifier, [1] achieved an overall performance of 48.20%. This outcome, although indicative of the effectiveness of [1] in few-shot learning, falls short in comparison to the performance achieved by our method in LTR tasks.

In summary, while [1] offers a valuable approach for few-shot learning, its adaptability to LTR tasks is limited, as evidenced by the comparative experiments. Our method, designed specifically for LTR tasks, demonstrates superior performance and effectiveness in addressing the challenges inherent in long-tailed datasets.

MethodsOverall
Linear Probing + [1]42.89
Fully Fine-tuned + [1]48.20
OURS51.90
评论

We refrain from directly manipulating or altering the data. Instead, our method is crafted with the objective of optimizing the adapter to refine biased visual features, enhancing their balance and discriminative qualities. Subsequently, these refined visual features undergo classification by the classifier. The adapter and classifier undergo continuous updates to adapt to the current input. In essence, our approach initially optimizes the model to acquire improved visual representations. During testing, the trained model fully processes the input data, ensuring that the network maps the test data to the same distribution as the training data, a guarantee established through model optimization.

During Phase B, we implement a balanced sampling strategy to rectify the initial imbalance in the training dataset. This ensures that both the adapter and classifier undergo training in a balanced setting. Our introduced method is then incorporated on top of this balanced training setting. In testing, the dataset itself is balanced, preserving the equilibrium maintained during training and minimizing the gap between training and testing. It's crucial to note that our classification relies on the classifier rather than utilizing the distance between visual features and text features. In addition, the observed performance enhancements across three benchmark datasets underscore the effectiveness of our method. Further ablation studies into the initialization of the classifier are detailed in Appendix E.1, where the experiments show the textual features as the initialization significantly benefits the performance.

审稿意见
3

This study introduces a framework designed to enhance CLIP in addressing long-tailed visual recognition challenges. This framework integrates a supervised contrastive loss mechanism, grounded on the transport plan, to fortify visual feature extraction. Several evaluations conducted on benchmarks corroborate that this proposed method significantly facilitates discriminative visual feature learning and achieves SOTA performance in long-tailed recognition tasks.

优点

  1. The idea of this paper is clear and easy to follow.
  2. Experimental results show the effectiveness of the proposed method.

缺点

  1. Lack of innovation. The approach in this paper provides a more balanced prototype for visual pre-training models to guide the learning of visual feature extractors and designs supervised comparative learning loss. However a similar approach has appeared in previous long-tail methods [1]. The differences in this paper are: 1. A more robust pre-training model, CLIP, is utilized. 2. A text-based prototype design approach is used to replace the target anchor. These innovations are more limited.

  2. Some modules are without good motivation. For example:

  • Why do we need a learnable linear classifier? The purpose of its existence seems to be the matching of visual features with textual features. However, the weights of the classifier will change during training, which does not narrow the gap between template label text and image feature distributions.
  • In the unsupervised prototype-guided feature learning part, why choose cos similarity as the distance metric instead of other metric methods such as minimum entropy?
  • For the modules of learnable classifier, unsupervised prototype-guided feature learning, and supervised contrastive loss, it seems to be an incremental improvement and not complementary, so why use them to train models together?

Reference [1] Li T, Cao P, Yuan Y, et al. Targeted supervised contrastive learning for long-tailed recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 6918-6928.

问题

please refer to Weaknesses.

评论

Thanks for your kind suggestion and for providing related work to help refine our work.

  1. Reply to #W1.

This paper introduces a targeted contrastive learning loss (TSC) aimed at aligning class features with target features on the vertices of a regular simplex. We identify four key distinctions between our approach and TSC:

(1) Learning Targets and Complexity:

TSC employs SGD to learn targets, posing challenges in finding the optimal solution on a hypersphere [1,2]. This method imposes a demanding requirement that is difficult to fulfill perfectly. In contrast, our approach is more straightforward, avoiding complex assumptions and calculations to obtain balanced prototypes. Our prototypes, derived from textual features, carry rich semantic information, enhancing the discriminative power of image features. Unlike TSC, which constructs targets without class semantics, we leverage a more abstract and balanced text modality. Therefore, our methods differ fundamentally, despite sharing a general motivation. Additionally, TSC requires prior knowledge of the number of classes, whereas our method is more flexible.

(2) Implementation Complexity:

TSC faces the challenge of assigning class indices to each unlabeled anchor point, adding complexity to the method. In contrast, our approach constructs prototypes in a more natural and intuitive manner.

(3) Incorporation of Components:

TSC relies solely on supervised contrastive learning loss, while our method incorporates various components. TSC updates the model by minimizing the distance between visual features and assigned targets. In contrast, our method transforms balanced textual prototypes into a linear classifier, employing optimal transport for unsupervised alignment of visual features and prototypes. Furthermore, we introduce a supervised contrastive learning loss based on the transport matrix to enhance supervision signals and intra-class information. Therefore, our contribution extends beyond a supervised contrastive learning loss. From the ablation study, our method can still perform best even without the proposed supervised contrastive loss.

(4) Modality and Multimodality:

TSC operates as an image-only method, while our approach is multimodal. To assess TSC's impact, we integrated its method into our work and conducted a performance comparison. Indicating our proposed three losses as LCEL_{CE}, LOTL_{OT} and LSCTL_{SCT}, we compare the performance replacing LSCTL_{SCT} of LTSCL_{TSC} based on ViT-B/16 and Places-LT dataset. As shown in the table below, using textual prototypes to initialize anchor points and applying TSC in Phase B led to a significant performance drop, demonstrating that our method is not a mere substitution of anchor points in TSC with textual features.

MethodsMany-ShotMedum-ShotFew-ShotOverall
OURS49.852.654.251.9
LCEL_{CE} + LSCTL_{SCT}48.950.850.550.1
LCEL_{CE} + LOTL_{OT} + LSCTL_{SCT}48.951.553.850.5

In summary, these four aspects highlight the substantial differences between our method and TSC, reinforcing the unique contributions of our approach.

[1] https://arxiv.org/pdf/physics/0609231.pdf [2] https://arxiv.org/abs/cond-mat/0506616

评论
  1. Reply to #W2.

The linear classifier plays a pivotal role in this context, serving as a crucial bridge. Its utilization is essential for maintaining coherence between visual and textual features. This necessity arises because, in the second phase, an additional residual module is introduced to refine biased visual features. Subsequently, the linear classifier serves a dual purpose in the final classification and is ultimately employed to guide the refinement of biased visual features using balanced textual initialized prototypes.

When we use textual features to initialize the classifier, our classification process no longer depends on the text prompt and text encoder. In the learning process of the linear classifier, we initialize with textual features and use one-hot labels and cross-entropy loss to optimize it. This approach means that the decision boundary constantly adapts to the input mini-batch. As performance ascends, the alignment between the textual features initialized classification boundary and visual features achieves a more nuanced congruence. Consequently, the task can be comprehended as the reduction of the gap between textual and visual features. Finally, the parameters of the classifier are harnessed for prototype-guided matching to enhance visual features.

In instances where the linear classifier is not employed, our model structure aligns with BALLAD. The results of the ablation study in Table 4 reveal that without the use of a trainable linear classifier, our approach outperforms BALLAD on ImageNet-LT and Places-LT by 1.1% and 0.3%, respectively. Nevertheless, the incorporation of the classifier, accompanied by continuous refinement of the classification boundary to strengthen the coherence between textual and visual features, leads to further performance enhancements.

Furthermore, we supplement our findings by presenting experimental results when the classifier is fixed, and only the optimization of residual links to improve visual features is pursued. As indicated in the table below, when the classifier is frozen, a certain performance impact is observed, resulting in a 1.5% decrease. Therefore, a learnable classifier can help continually and adaptively keep the consistency between textual features (prototypes) and the input visual features, from the perspective of both motivation and experiment results.

MethodsMany-ShotMedum-ShotFew-ShotOverall
Learnable classifier49.852.654.251.9
Frozen classifier49.051.450.850.4

Besides, we update a visualization result in Appendix E.3 to demonstrate how the learnable linear classifier benefits our method and keeps the consistency of visual features and the corresponding textual prototypes (the weight of the classifier).

  1. Reply to #W3.

A cost matrix serves as a representation of the cost function in a discrete optimal transport problem, where goods or resources are distributed among a finite number of sources and destinations. This matrix is structured with a row for each source and a column for each destination, where each entry denotes the cost of transporting one unit of goods or resources from the corresponding source to the respective destination. In practical terms, various measurements capable of computing the distance between two distributions can function as a cost matrix. Examples include cosine similarity, Euclidean distance, or a matrix derived through learning in a specific manner (reverse OT problem).

It is worth noting that minimum entropy might be unsuitable as a measurement for the cost matrix in this context. For our purposes, we opt for cosine similarity due to its universality and superior overall performance compared to other distance metrics. The comparison results of cosine similarity and Euclidean distance, employed as a cost matrix based on Places-LT and ViT/B-16, are presented in the following table:

MethodsMany-ShotMedum-ShotFew-ShotOverall
Cosine Similarity49.852.654.251.9
Euclidean Distance49.052.154.551.7
评论

Reply to #W4.

Our motivation stems from the observation that the output quality of CLIP's textual encoder remains robust even in the presence of imbalanced datasets. We aim to leverage this resilience to guide the learning and enhancement of discriminative visual features, addressing CLIP's performance on imbalanced image classification tasks. In our practical implementation, after fine-tuning CLIP, we freeze the encoders and introduce an additional residual structure with a linear layer. This augmentation aims to refine biased visual features, making them as balanced and separable as their textual counterparts through the application of balanced sampling.

Throughout the implementation process, we utilize textual features to initialize a learnable classifier. Using cross-entropy loss, we optimize the classifier with refined visual features as input. The objective here is to ensure that the refined visual features maintain consistency with their corresponding textual features, ultimately contributing to the final classification. Simultaneously, we treat the classifier parameters as prototypes for visual features. In each input mini-batch, we employ optimal transport to align the two distributions in an unsupervised manner. This approach serves to avoid introducing additional prototype parameters while ensuring that the classifier parameters (prototypes) adapt closely to the current input data during the learning process. This adaptation aims to achieve superior performance and a more compact classification surface. Besides, experiment results also show the proposed prototype-guided matching loss helps the training convergence.

Considering that the optimal transport matching process is instance-level and unsupervised, we then incorporate a supervised contrastive learning loss based on the transport plan. This step introduces more class-level information and supervision signals into the learning process.

In summary, the learnable classifier, prototypes-guided biased visual feature refinement, and supervised contrastive loss are designed to be progressive and complementary components. The learnable classifier focuses on classification, prototypes-guided feature learning closely connects the optimization of classifier parameters (prototypes) with input data, ensuring adaptability and the supervised contrastive loss introduces class information and supervision signals to compensate for the limitation of relying solely on unsupervised information in prototypes-guided feature learning.

评论

In addition, it is important to align the representations learned by the model with the distribution of the test data. How do you ensure it while aligning the long-tailed data features with textual features?

评论

Dear Reviewer pMKo,

Thanks for your insightful question and informative feedback to help refine the paper.

We refrain from directly manipulating or altering the data. Instead, our method is crafted with the objective of optimizing the adapter to refine biased visual features, enhancing their balance and discriminative qualities. Subsequently, these refined visual features undergo classification by the classifier. The adapter and classifier undergo continuous updates to adapt to the current input. In essence, our approach initially optimizes the model to acquire improved visual representations. During testing, the trained model fully processes the input data, ensuring that the network maps the test data to the same distribution as the training data, a guarantee established through model optimization.

During Phase B, we implement a balanced sampling strategy to rectify the initial imbalance in the training dataset. This ensures that both the adapter and classifier undergo training in a balanced setting. Our introduced method is then incorporated on top of this balanced training setting. In testing, the dataset itself is balanced, preserving the equilibrium maintained during training and minimizing the gap between training and testing. It's crucial to note that our classification relies on the classifier rather than utilizing the distance between visual features and text features. In addition, the observed performance enhancements across three benchmark datasets underscore the effectiveness of our method. Further ablation studies into the initialization of the classifier are detailed in Appendix E.1, where the experiments show the textual features as the initialization significantly benefits the performance.

评论

For #W2, from the above explanation, there is no reasonable reason why using text initialization would achieve your goal. It is mentioned that it "initializes with textual features and uses one-hot labels and cross-entropy loss to optimize it." This suggests that the method utilizes long-tailed data as input and employs a CE loss optimization object. This optimization goal may not be appropriate as a means of addressing the mismatch issue between refined visual features and textual features. Can you provide some more plausible explanations?

评论

Dear Reviewer pMKo,

In this clarification and additional information, we sincerely apologize for any lack of comprehensiveness and accuracy in our prior response. As articulated in our motivation, we observed balanced textual features, prompting our desire to utilize them for fostering more discriminative visual feature learning. The goal was to achieve feature disentanglement, particularly for the tail classes within the imbalanced dataset, ultimately leading to improved classification results.

During Phase B, detailed in Section 4.2 of the paper, we initially implemented a balanced sampling strategy to guarantee balanced input data for model training. The application of a vision adapter aimed to refine biased visual features derived from the CLIP vision encoder, with the intention of obtaining balanced refined visual features through this module. Subsequently, we initialized a learnable linear classifier with balanced textual features. The design incorporated a prototype-guided loss based on optimal transport (OT) matching, facilitating the learning of the adapter and classifier and ensuring continuous balance maintenance. Additionally, we introduced a supervised contrastive loss to further enhance this learning process.

Moreover, as depicted in Equation (9) within the paper, we employed LCEL_{CE}, LOTL_{OT} and LSCTL_{SCT} to optimize the adapter and classifier. This comprehensive approach aimed to yield more discriminative refined visual features and enhance the learning of a more robust classifier, ultimately contributing to effective classification. This stands in contrast to a simplistic use of CE alone for updating the model and classifier.

In summary, our approach in Phase B received balanced input data by a balanced sampling strategy, departing from the original imbalanced training set. The deployment of the adapter sought to enhance biased visual features, with the goal of achieving balance through optimization. The initialization of a balanced classifier and the design of our proposed losses further ensured and reinforced this balance, in order to avoid the classifier being biased by the original imbalanced dataset.

评论

Dear Reviewer pMKo,

Given the constrained timeframe for further discussion, we genuinely appreciate any feedback you may have on our revised submission. If there are additional questions or points of clarification needed, we are more than willing to promptly address them.

Again, we express our gratitude for your time, extensive efforts, and valuable insights. Thank you for the opportunity to engage in this discussion.

Best regards, The Authors

评论

We thank the reviewers for their constructive comments which are valuable for us to revise the paper. The typos, relevant missing references and baselines have been cited, discussed and compared in the revised version. In response to the common concerns and specific reviewer comments, we would like to emphasize the contribution and novelty of our work in addressing the challenges of imbalanced vision recognition based on CLIP.

Contribution and novelty. Please let us emphasize our method's contribution and novelty. Our paper studies the problem of how to better leverage vision-language knowledge for robust imbalance classification based on CLIP.

1. Leveraging Multi-Modality Knowledge for Imbalanced Vision Recognition:

Our primary motivation stems from the robustness and balance observed in CLIP's textual encoder even in the presence of imbalanced datasets. We introduce a unique approach to leverage multi-modality knowledge, specifically aiming to guide the learning of discriminative visual features by capitalizing on the balance inherent in textual features.

2. Disentangled Visual Representation Learning Guided by Textual Prototypes:

We propose a series of modules designed to facilitate disentangled visual representation learning. We first fine-tune the CLIP encoders and then fix them. Simultaneously, we treat the classifier parameters as prototypes for visual features. In each input mini-batch, we formulate the issue of helping to disentangle visual features as an optimal transport problem. We design to map the input visual features close to their corresponding prototypes (classifier weights) by minimizing the OT distance in an unsupervised learning way. Finally, as a supplement, we design a supervised contrastive learning loss to enhance the prototype-guided discriminative visual feature learning. This emphasis on leveraging textual prototypes distinguishes our method from existing approaches.

3. Flexible Framework with Other Components for Performance Improvement:

Our framework exhibits flexibility and adaptability, allowing for the integration of image-only methods designed for imbalanced classification. This includes the incorporation of reweighting loss, decoupling methods, and test-time aggregation. This flexibility enhances the potential for further performance improvement, making our method a versatile solution for imbalanced vision recognition.

In summary, our method goes beyond the general idea of learning balanced visual features by introducing multi-modality knowledge, disentangled visual representation learning guided by textual prototypes, and a new simple but flexible framework. Empirical results demonstrate state-of-the-art performance and ablation studies confirm the effectiveness of our proposed modules.

We hope that these clarifications address your concerns and provide a more comprehensive understanding of the contributions and novelty of our work.

评论

Connection with contrastive learning.

We sincerely appreciate the thoughtful reviews and the valuable insights provided by each of you. We have carefully considered your comments regarding the connections and differences between our method and previous works based on contrastive learning for imbalanced image classification. We acknowledge the similarities but would like to emphasize the essential differences that set our approach apart.

(1) Distinguishing from Contrastive Learning:

Our method is not solely based on contrastive learning. While we use CLIP and propose the use of a supervised contrastive learning loss function, leveraging the optimal transport matrix to enhance model learning and improve visual features, this aspect represents only a supplementary component within our broader method. Furthermore, our experiments demonstrate that even without the proposed contrastive supervision loss function, our method achieves state-of-the-art performance.

(2) Optimal Transport Matching vs. Contrastive Learning:

When addressing the challenge of improving biased visual features to achieve better balance and separability, we uniquely formulate this problem as an optimal transport matching problem. This approach stands out differently from traditional contrastive learning. Our objective is to minimize the optimal transport distance, ensuring a closer alignment between visual features of the same type and their prototypes while creating a more significant separation between different types.

(3) Leveraging Complementarity of Multi-Modal Information:

In the context of imbalanced image classification, effectively leveraging the complementarity between multi-modal information for robust visual feature learning remains an open challenge. Our experiments in rebuttal reveal that naively employing text features as traditional positive samples or anchor points in contrastive learning methods does not directly contribute to effective model learning. Therefore, we find it imperative to design and implement more nuanced methods to address imbalanced image classification under multi-modal conditions.

We believe these distinctions underscore the uniqueness of our method and its contributions to addressing the complexities of imbalanced image classification. We appreciate your thorough evaluations and welcome any further suggestions or inquiries.

AC 元评审

Summary

Based on the findings that the text embeddings are more balanced than the image embeddings in the CLIP model, the authors propose to leverage the text embeddings to refine the image encoder that's more robust in the long-tail recognition setting. Specifically, a supervised contrastive loss guided by the optimal transport plan is proposed for training the model.

Strengths

  • Extensive experiments on long-tail benchmarks. Many ablations and visualizations are done
  • High-level motivation of the approach is clear and easy to follow

Weaknesses

  • Limited novelty: using textual initialization for image guidance is not new.
  • The proposed technique in this submission does not obtain significant improvement along this direction.
  • Two-stage training poses significant computation resource requirement; however, the marginal improvement in final accuracy seems not worth it.
  • Detailed design choices are not well explained/justified. Specifically, 1) theoretical guarantee or other justification of the text initialization of the linear layer (beyond empirical effectiveness) is missing, and 2) relationship between L_{CE} with text initialized linear weights and the supervised constrastive loss L_{SCT} is not clear. (by reviewer pMKo)
  • Contradicting results are found after careful inspection of the experiment log. “For example, in ./code/log/BEST_ImageNet-LT_RN50.csv, the initial test accuracies for each part are test_many: 0.645, test_median: 0.673, test_low: 0.740, and the final results have accuracies of test_many: 0.719, test_median: 0.708, and test_low: 0.626, which verifies that the method does not effectively balance the accuracies of each category during the training process but instead reduces the tail category accuracies” by pMKo

AC’s additional comment

Other than the above weaknesses raised by the reviewer, I find the basic assumption of leveraging balanced text embeddings questionable, at least in the current form. In Fig.1, the authors first plotted the t-SNE of test images and found that the features are badly clustered, and then went back to design the training algorithm. This seems problematic to me, since the authors directly observed the test data, and deliberately pass the inferred knowledge from test data to the training algorithm (although in an implicit way). In my opinion, all decisions on the design choices should be based on dev(val) split and perhaps train split, and test data should be kept almost as a black box and evaluated as few times as possible. The test data’s only purpose is to provide a fair and reproducible way to quantitatively compare with previous work, not for making design decisions. After a quick scan of this submission, I did not find any information on dev(val) split. Therefore, I had to assume test set leakage, thus remaining cautious on the true effectiveness / generalization of the proposed techniques.

为何不给更高分

This submission receives both negative and (slightly) positive reviews. After carefully reading the reviews, I am convinced by the negative review that this submission still have significant issues to be addressed. Therefore, my recommendation is reject. For authors to improve this work, please carefully think about my comment in the meta review. Is the proposed method truly tackling the challenges in long-tail recognition? It seems to me that the results presented could be some artifact of test set leakage.

An additional question I have is: Is balanced embedding all we need? If so, it might not be important that the embeddings to initialize the linear weight come from CLIP text encoder, and perhaps any well-disentangled initialization could work. I hope the above question can inspire some ideas for improving this work.

为何不给更低分

N/A

最终决定

Reject