GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning
摘要
评审与讨论
In this paper, the authors provide a pseudo labeling strategy for debiased training. Specifically, the proposed method utilizes decoupled CoOp prompts to initialize the group embeddings for classification, and then proposes a heuristic group-level pseudo labeling technique via pretrained VLMs. The proposed method achieves the state-of-the-art performance on various debias datasets.
优点
-
The proposed method is simple yet effective. Without group-level ground-truth, GroupCoOp can effectively mitigate the bias in training data within each class and obtain better results.
-
The writing is polished and easy to understand.
缺点
My main concern is the ablation study with different pseudo labeling strategies (Table 8). Specifically, the quality of pseudo labels from k-means should be better than other strategies intuitively, since it can accurately localize different groups from a specific class. Nevertheless the experimental result is opposite. However, the explanation in line 517 cannot convince me, since class imbalance can be adjusted by sampling or logit adjustment technique. I recommend the authors to provide more empirical analysis (e.g., extend Fig. 4 by comparison between group result from GroupCoOp and that from k-means, or classical failure cases) to prove the effectiveness of GroupCoOp pseudo labeling strategy. Moreover, theoretical proof of why GroupCoOp is better than k-means is also preferred (optional).
问题
I'm also curious about the results in Table 6 (c). According to Sec. 4, you optimize a learnable prompt for each group. In theory, the learnable prompt should be equal with a linear vector regarding the same group. Nevertheless, the experimental result is opposite. I recommend the authors to provide a more detailed analysis for this phenomenon.
This article presents a novel GroupCoOp method for alleviating the negative effect of spurious correlation between objects and backgrounds in VLM classification. GroupCoOp adopts a parameter-efficient learning paradigm, where group-wise learnable text prompts are trained for group-wise class classification. Experiment results on 5 different spurious correlation benchmarks demonstrate that GroupCoOp achieves state-of-the-art results compared to prior arts. Rich analysis experiments demonstrate the effectiveness of leveraging rich text features in GroupCoOp.
优点
- Novelty: Although leveraging PEFT for group robustness is not proposed for the first time [1], GroupCoOp proves the effectiveness of leveraging the rich text features in VLMs for group robustness. Analysis experiments proved that GroupCoOp is especially effective under label-scarce scenarios, and is not an ensemble of multiple CoOp models.
- Strong experiment results: On a series of group robustness benchmarks, GroupCoOp consistently outperforms existing works by a large margin with a highly limited training parameter budget.
[1] Contrastive adapters for foundation model group robustness
缺点
- Insights toward learning a group-wise prompt collection: Learning a group-wise prompt collection is proved successful in GroupCoOp. The authors attribute this to the rich text features in the VLM. In my opinion, the correlation between the proposed method and the underlying motivation appears tenuous, lacking a clear alignment that would strengthen their connection.
问题
Things that WILL affect my overall rating towards this article:
- For Weakness 1: The author could consider adding visualization experiments to demonstrate the correlation between text prompt tuning and group robustness, such as visualization of text embeddings and corresponding embedding of the group labels, or analysis of the learned text prompt tokens (i.e., whether it actually resembles real text tokens).
Things that WILL NOT affect my overall rating towards this article:
- I wonder whether group-wise Adapter/LoRA (i.e., use group label as the text input and train a set of Adapter/LoRA on the text encoder for each group) leads to similar results as GroupCoOp, since Adapter/LoRA are proved to be more effective than prompt tuning in transformer-based models.
- In the field of CLIP prompt tuning, more effective methods such as MaPLe [1] are available, and should be considered as competing methods in the experiments.
[1] MaPLe: Multi-modal Prompt Learning
This paper combines the strategies of group prompts and pseudo-label generation to achieve debiased optimization, speciafically without explicit group labels, thereby improving group robustness. The proposed method significantly reduces computational requirements by fine-tuning a very small number of parameters (only 0.016%), but shows results close to or even exceeding those of the full model fine-tuning method on multiple datasets, showing extremely high parameter efficiency. The experimental results verify the robustness performance on minority groups on multiple datasets, which is better than some existing debias-based algorithms, showing its effectiveness in solving group imbalance and bias problems.
优点
This paper demonstrates a relatively solid body of work, reflecting the author's deep understanding of the research topic. The study presents innovative approaches that are effective. The experimental results validate the effectiveness of these methods, with data analysis and comparisons illustrating performance improvements, offering valuable insights for the community. The well-structured presentation and thorough explanation of the research process help readers grasp the contributions made. However, I still have some major concerns and please see the below comments.
缺点
- The paper fails to fully explore the motivation, especially why existing VLMs are susceptible to group bias during the fine-tuning stage. Although the paper mentions that existing methods have bias problems, there is no specific analysis of the causes of the bias, nor is there a discussion of the differences in performance across tasks. In addition, the difference between this paper's method and previous methods is not clearly stated, and there is a lack of theoretical or practical evidence to support the proposed debiasing strategy.
- The writing of the method section is quite inadequate. The paper should ensure that all the audience can clearly understand its content, but the questions raised in this paper require readers to consult a large number of references to understand, resulting in unclear expression. It is recommended to provide more detailed explanations of key concepts and methods to improve readability.
- The subscript of Fig. 1 seems inconsistent with the meaning expressed in lines 186-187 and 215. It is recommended to check the chart and text description to ensure consistency and accuracy to avoid confusion for readers.
- The experimental section lacks qualitative analysis of the results, and does not provide visualization or in-depth explanation of the specific role of group prompts and pseudo-label generation in reducing bias. Specifically, it is unclear where the proposed text prompt plays a role. Although the visualization results of the category classification are provided in the supplementary materials, this is not enough to fully demonstrate the effectiveness of the proposed method. In addition, the results may have errors in parameter settings or code implementation. It is recommended to further verify the reliability of the experimental results.
- The proposed method is relatively naive, although it seems effective, while the author did not show the good analysis of the problems and results. As far as I know, leveraging the extra learnable components is very conventional and also popular direction[1,2,3]. So the author could refer to more top-tier papers to learn how to well present and support a good idea. For example, considering the group problems, I would suggest you to mathematically visualise some key points to show the problem and analysis, which could enhance the contributions and significance of the method.
[1] Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J. and Zhang, L., 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499. [2] Qiu, H., Xia, M., Zhang, Y., He, Y., Wang, X., Shan, Y. and Liu, Z., 2023. Freenoise: Tuning-free longer video diffusion via noise rescheduling. arXiv preprint arXiv:2310.15169. [3] Xiao, A., Xuan, W., Qi, H., Xing, Y., Ren, R., Zhang, X., Shao, L. and Lu, S., CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model.
问题
see above
This paper proposes Group Context Optimization (GroupCoOp) which is a debiased parameter-efficient fine-tuning (PEFT) algorithm tackling the lack of group robustness of vision-language models (VLMs) on downstream datasets. Group robustness refers to the ability of a VLM not being deluded by spurious correlation between true class and irrelevant data feature arising from data distribution bias.
The main contributions claimed by this work includes
- Proposal of GroupCoOp, which is an innovative PEFT algorithm that does not require ground-truth group labels and retraining of the entire parameters of a VLM.
- Proposal of pseudo group label generation scheme to capture data distribution bias.
- Experimental results showing SOTA performance of GroupCoOp in five benchmarks.
- Proposal of effective debiased fine-tuning even with an extremely limited number of training samples.
优点
- The technical quality of this paper was on par with previous accepted papers. Explanations for related works or proposed method were accurate and proven to be effective.
缺点
-
This paper addresses group robustness, a specific instance of the broader issue of distribution shifts in downstream tasks. Numerous existing studies tackle distribution shift or domain gap challenges for VLMs, so it would be valuable to clarify how this work stands out.
-
GroupCoOp leverages CoOp's prompt-tuning framework but extends it to incorporate group-specific prompts aimed at subgroup-level classification. However, while this approach is indeed unique in how it applies prompt-based tuning to groups, its novelty could be perceived as limited without specific comparisons. It would be beneficial to discuss aspects of GroupCoOp that differentiate it from existing subgroup classification strategies within prompt learning frameworks, such as CoOp-CSC or even ERM-based methods that employ subgroup recognition. Including these comparisons would help clarify the novel contributions of the method, particularly how GroupCoOp advances beyond traditional class-based prompts.
-
The structure of the 'Proposed Method' section could be enhanced by providing a clearer breakdown of the GroupCoOp pipeline, specifically the inference process. For instance, it would be useful to clarify how group-specific prompts are generated and matched against test images, particularly highlighting steps such as calculating cosine similarity with group vectors and selecting the final prediction.
-
Overall, GroupCoOp doesn’t seem to introduce a fundamentally new idea to address major limitations of VLMs.
问题
-
Consolidation of certain sections, such as the discussions on parameter efficiency and group robustness, would improve the readability. For instance, merging the explanation of GroupCoOp’s parameter efficiency in both the introduction and experiments could streamline the paper. Additionally, it would be beneficial to ensure consistency regarding the benchmarks used. Specifically, clarifying whether four or five datasets were used in the final evaluation would address any ambiguity and make the experimental setup clearer.
-
GroupCoOp currently distinguishes only two groups per class. Could it be extended to capture finer intra-class variations by further subdividing each class?
-
The benchmarks used in the experiments involve two classes, resulting in four possible combinations of class and spurious attribute. How would GroupCoOp's performance vary when tested with a greater number of classes and attributes?
We would like to express our gratitude to the reviewers for their insightful feedback. We have decided to withdraw our submission to focus on improving our method, particularly in the pseudo group labeling strategy. Thank you again for your thoughtful comments.