6.0

/10

Poster3 位审稿人

最低5最高7标准差0.8

3.7

置信度

正确性3.3

贡献度3.0

表达3.0

NeurIPS 2024

Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective

Yanan Zhang,Jiangmeng Li,Lixiang Liu,Wenwen Qiang

OpenReview PDF

提交: 2024-04-28更新: 2024-11-06

TL;DR

Guided by the theory of causation, we propose semantic decoupling and uncertainty modeling to conduct prompt tuning on CLIP for downstream tasks.

摘要

关键词

causaladaptationfoundational models

评审与讨论

审稿意见

评分: 6置信度: 42024-06-30

This paper investigates the two different misalignment issues between CLIP and downstream tasks, i.e., task misalignment and data misalignment. The author designed several experiments that demonstrated that over-fitting occurs when tuning with the learnable prompt. They propose the Causality-Guided Semantic Decoupling and Classification(CDC) method to mitigate the impact of task-irrelevant generative factors on downstream tasks. The extended experiments demonstrate that the proposed CDC method is effective.

优点

This paper investigates the difficulty of adapting CLIP to downstream tasks via two-level misalignment, which is a bright idea to help the community understand the working mechanism. Then the author provides a comprehensive experiment that reveals how the overfitting occurs and impacts the recognition of new classes. The author uses the perspective of causal inference to alleviate the data misalignment and proposes CDC with front-door adjustment for implementation, which predicts with explicit evidence.

缺点

There is no obvious evidence to show that the CDC will improve the prediction in certain cases, which category previous wrong and correct with CDC. It would be better if several compared failure cases to demonstrate that the CDC can solve the question at which level and which case still needs more advanced methods.
It is interesting how the misalignment between CLIP and downstream pattern will change as the model’s capabilities increase, such as ViG-Large, and whether the more powerful model can solve the misalignment problem. Hence, it will be better if some comparisons of models can be added.
As we all know, MLLMs already sweep through the multimodal community, it will be better if expand some discussion about the misalignment in this paradigm and current method whether easily general to that.
How many parameters are tuned during downstream adapting? Compared to the pre-trained model, what is the ratio of tuned parameters?

问题

see weaknesses

局限性

Yes

作者回复

2024-08-07

We thank Reviewer 7jjJ for the valuable feedback and constructive suggestions. The mentioned issues are addressed as follows:

W1: There is no obvious evidence to show that the CDC will improve the prediction in certain cases, which category previous wrong and correct with CDC. It would be better if several compared failure cases to demonstrate that the CDC can solve the question at which level and which case still needs more advanced methods.

A1: Thank you for your suggestions. We provide several examples in the attached PDF of the global rebuttal to compare the prediction results of CDC with those of the baseline method MaPLe. We will include these analyses in the appendix to the final version of our paper, to enhance an intuitive understanding of our proposed CDC.

In the success cases in Figure 1, we observe that:

(1) Based on different prompt templates, the model captures different semantic information of the samples and generates diverse prediction results. This validates the effectiveness of our semantic decoupling strategy.

(2) By fusing the prediction results based on different templates (i.e., different semantic sets), CDC can obtain correct classification results. As shown in Figure 1 (e), the fused result is not a simple average of all predictions and may differ from all of them. This indicates that the fusion process comprehensively considers the credibility and consistency of each prediction result, enabling CDC to correctly classify samples that MaPLe fails to recognize.

For the failure cases in Figure 2, we discover that when the predictions based on each semantic set consistently lean towards the wrong category, CDC also struggles to correct the samples misclassified by MaPLe. Taking Figure 2 (a) as an example, intuitively, from different semantics such as the character's posture and the appearance of the yo-yo, the image is easily misidentified as playing the flute, making it difficult to classify. In the future, richer training data may enhance the model's ability to distinguish different objects and scenes, further improving CDC's performance.

W2: It is interesting how the misalignment between CLIP and downstream pattern will change as the model’s capabilities increase, such as ViG-Large, and whether the more powerful model can solve the misalignment problem. Hence, it will be better if some comparisons of models can be added.

A2: Refer to the global rebuttal for the details.

W3: As we all know, MLLMs already sweep through the multimodal community, it will be better if expand some discussion about the misalignment in this paradigm and current method whether easily general to that.

A3: Thank you for your suggestions. The misalignment in MLLM is consistent with that presented in our paper. Theoretically, our approach can be applied to MLLM.

MLLMs have demonstrated remarkable potential across a wide range of tasks. After aligning features from different modalities through pre-training, MLLMs often employ techniques such as instruction tuning to encourage the model to complete downstream tasks effectively.

Take Instruct-BLIP as an example. During the instruction tuning phase, Instruct-BLIP learns from specific datasets for certain tasks to build an instruction-aware Q-Former, thus assisting the model in extracting informative features tailored to the given instruction. The learned Q-Former is expected to generalize to unseen datasets and unseen tasks. However, for different tasks, the informative features for similar instructions can be diverse. Therefore, the learned Q-Former risks overfitting the training data. This is similar to prompt tuning analyzed in our paper, where prompts overfit base classes. Therefore, we argue that the discrepancies between training and testing in Instruct-BLIP introduce "data misalignment" in a general sense.

Our proposed CDC has the potential to address the data misalignment issue in Instruct-BLIP. To mitigate the interference of visual features that overfit training tasks, we can decouple the obtained features and estimate the importance of the semantics in the features. By performing a weighted fusion, we can assist the test tasks in extracting more task-relevant visual features, thereby improving overall task performance.

In conclusion, when aiming to enhance the zero-shot generalization performance of a model, the misalignment between training and testing data is a crucial issue that must be carefully considered. Our proposed SCM comprehensively models the misalignment problem in training and testing. CDC has the potential to be transferred to scenarios where misalignment exists, providing a valuable framework for addressing this common challenge in machine learning. In future work, we will investigate the specific implementation of CDC in MLLM to solve the data misalignment problem in it.

W4: How many parameters are tuned during downstream adapting? Compared to the pre-trained model, what is the ratio of tuned parameters?

A4: Thank you for your valuable comments. In Table 5, we provide the number of learnable parameters of our method. Compared to MaPLe, our proposed CDC increases the number of learnable parameters from 3.55M to 14.20M, which brings an average performance improvement of 1.70% in the Base-to-New setting.

In the implementation, we can reduce the model overhead by having all templates share the V-L functions, i.e., CDC* in Table 5. CDC* adds 0.027M parameters compared to MaPLe, bringing an average performance improvement of 1.14%. The analysis of the parameters highlights the efficiency of CDC.

Table 5. Comparison of prompting complexity among different methods.

Method	Params	Params % CLIP	HM
CoOp	2048	0.002	71.66
CoCoOp	35360	0.03	75.83
Independent V-L	31488	0.02	78.55
MaPLe	3.55 M	2.85	78.55
CDC(Ours)	14.20 M	11.40	80.25
CDC*(Ours)	3.58 M	2.87	79.69

2024-08-12

Thank you for your efforts and response. You've addressed my main concern, and I believe this paper will make a valuable contribution to the NeurIPS community. I will maintain my current rating.

审稿意见

评分: 5置信度: 32024-07-12

This paper addresses the two-level misalignment (task and data) issue in adapting CLIP to specific tasks. The authors develop a structural causal model to analyze CLIP's pre-training and adaptation processes, revealing how task-irrelevant knowledge interferes with predictions. To mitigate this, they propose Causality-Guided Semantic Decoupling and Classification (CDC), which implements front-door adjustment. CDC includes Visual-Language Dual Semantic Decoupling (VSD) to represent different semantics through multiple prompt templates, and Decoupled Semantic Trusted Classification (DSTC) to perform classification based on each decoupled semantic while estimating uncertainties. Experiments demonstrate CDC's effectiveness in enhancing CLIP's performance across various settings and tasks, addressing the challenge of data misalignment in vision-language model adaptation.

优点

CDC is well motivated from a causal perspective and has significant technical novelty.
Clear writing and well organized.
Experiment results show the effectiveness of CDC.

缺点

Figure 1(a) appears to illustrate task misalignment. Consider enhancing the caption of Figure 1 with more detailed explanations to clarify this concept.
Regarding data misalignment, it would be beneficial to provide a more precise definition. Does it specifically refer to discrepancies in classes between training and testing processes? It's important to clarify that data misalignment encompasses both label misalignment and distribution misalignment. A brief explanation of each type would improve understanding.
In Figure 3, the term "fuse" is used. It would be helpful to clarify the meaning and context of this term within the figure.
How about accuracy if we directly use zero-shot test for CDC?

问题

See Weaknesses

局限性

See Weaknesses

作者回复

2024-08-07

We thank Reviewer JJqY for the valuable comments and constructive suggestions. The mentioned issues are addressed as follows:

W1: Figure 1(a) appears to illustrate task misalignment. Consider enhancing the caption of Figure 1 with more detailed explanations to clarify this concept.

A1: Thanks for your valuable suggestions. The example we present in Figure 1(a) is indeed to illustrate the task misalignment problem.

In the final version, we will modify the caption of Figure 1 to: "The motivating examples of the two-level misalignment. (a) Task Misalignment. Determined by the contrastive learning mechanism of CLIP, a given image is more similar to the entire textual description in the embedding space than to a single semantic element, which is inconsistent with the demands of the classification task. (b) Data Misalignment. The learned model intends to overfit on base classes. On the DTD dataset, as the number of training epochs increases, the accuracy of base classes rises, while the accuracy of new classes first rises and then drops."

W2: Regarding data misalignment, it would be beneficial to provide a more precise definition. Does it specifically refer to discrepancies in classes between training and testing processes? It's important to clarify that data misalignment encompasses both label misalignment and distribution misalignment. A brief explanation of each type would improve understanding.

A2: Thank you for your suggestion. In our paper, data misalignment encompasses label inconsistency and distribution inconsistency. In the final version, we will further clarify the concept of data misalignment in the appendix to avoid confusion.

Data misalignment refers to the inconsistency between the distribution of training data and testing data. This inconsistency can arise due to two main reasons:

(1) Label Inconsistency: The training and testing classes do not completely overlap. For instance, some classes present in the training data might not appear in the testing data and vice versa. We refer to the classes that appear in training as base classes and the classes that appear in testing as new classes.

(2) Distribution Inconsistency: Even if they share the same class names, the distributions of the classes in the training and testing data may differ, resulting in distribution inconsistency. In such cases, the testing classes are essentially also new classes.

Label inconsistency and distribution inconsistency together constitute the problem of data misalignment.

Furthermore, we have considered the effectiveness of CDC under both types of inconsistencies. The base-to-novel experimental setup mainly involves label inconsistency, while the cross-dataset and cross-domain setups address both label inconsistency and distribution inconsistency. Our proposed CDC has demonstrated effectiveness in all three experimental setups, highlighting its capability to address both types of inconsistencies.

W3: In Figure 3, the term "fuse" is used. It would be helpful to clarify the meaning and context of this term within the figure.

A3: Thank you for your constructive suggestions. In Figure 3, "fuse" refers to iteratively combining the evidence from all templates according to Equation (7) to obtain the final classification results. We will add an explanation of the "fuse" operation to the caption of Figure 3: "fuse" represents the process of iteratively combining all evidence according to Equation (7) to obtain the final results.

W4: How about accuracy if we directly use zero-shot test for CDC?

A4: Thanks for your valuable comment. The CDC method we proposed preserves the zero-shot ability of CLIP. In the cross-dataset transfer experiments, we use exactly zero-shot testing for the target datasets. To further verify the performance of our method under the zero-shot setting, in addition to the experiments in Table 2 of our manuscript, we have employed another architecture, i.e., ViT-L/14, to evaluate the zero-shot performance of CDC. The results of the additional experiments are presented in Table 4.

As shown in the table, compared to the baseline method MaPLe, our CDC method achieves an average performance gain of 0.43% under the zero-shot setting when using the ViT-B/16 structure, and an average zero-shot performance improvement of 1.05% when using the ViT-L/14 structure. These results further demonstrate the effectiveness of the CDC method in preserving the model's zero-shot ability.

Table 4. Comparison of CDC on the zero-shot classification to evaluate the out-of-distribution generalization of CDC in the cross-dataset setting.

Method	Architecture	Caltech	Pets	Cars	Flowers	Food	Aircraft	SUN	DTD	SAT	UCF	Avg
CLIP	ViT-B/16	93.30	89.10	65.6	70.7	85.90	24.70	62.60	44.00	48.40	67.70	65.20
CLIP+MaPLe	ViT-B/16	93.53	90.49	65.57	72.23	86.20	24.74	67.01	46.49	48.06	68.69	66.30
CLIP+CDC	ViT-B/16	94.47	90.77	66.27	72.67	86.27	24.50	68.07	46.60	49.13	68.60	66.73
CLIP	ViT-L/14	95.20	93.50	76.80	79.50	90.90	32.70	67.70	53.00	60.30	75.10	72.47
CLIP+MaPLe	ViT-L/14	96.23	92.57	77.60	76.90	91.40	30.07	71.63	54.23	53.83	75.90	72.04
CLIP+CDC	ViT-L/14	96.70	93.10	77.33	75.87	91.53	31.67	72.87	57.63	58.00	76.20	73.09

2024-08-14

Thanks for your response. I decide to keep my score.

审稿意见

评分: 7置信度: 42024-07-13

This paper investigates the task and data misalignment issues in pre-trained vision-language models such as CLIP. It discovers that the task-irrelevant information significantly affects the prediction of CLIP and soft prompt tuning cannot mitigate the data misalignment issue. The authors propose a novel Causality-Guided Semantic Decoupling and Classification method to mitigate the interference of task-irrelevant information. The experimental results show that the proposed method effectively mitigates the data misalignment and improves the generalization of CLIP.

优点

The paper is well-organized. The introduction of the method and the figures are clear and easy to understand. The description of the experiment setting is detailed, which makes the paper reproducible.
The proposed methods to mitigate the task and data misalignment of CLIP are highly-motivated and intuitive.
The authors design and conduct exhaustive experiments to demonstrate the effectiveness of the propose method. The proposed methods provide significant improvements on the generalization of CLIP.

缺点

In the experiments section, the method is currently adapted solely to the CLIP model. This limitation may not fully demonstrate the model's universality. The authors can adapt the method to various vision-language models with different architectures to showcase broader applicability.
The experiments are exclusively conducted on image classification tasks. The authors can explore adapting vision-language models (VLMs) to a wider range of tasks, such as object detection, image captioning, or visual question answering, to further validate the model's versatility and performance across diverse applications.

问题

Is it feasible to adapt the proposed method to various tasks such as object detection, image captioning, or visual question answering.

局限性

The authors have discussed the limitations in the paper.

作者回复

2024-08-07

We thank Reviewer Ptf4 for the valuable suggestions. The mentioned issues are addressed as follows:

W1: In the experiments section, the method is currently adapted solely to the CLIP model. This limitation may not fully demonstrate the model's universality. The authors can adapt the method to various vision-language models with different architectures to showcase broader applicability.

A1: Thank you for your valuable suggestions. We believe that most current VLMs suffer from the misalignment problems we analyzed in our paper. The SCM we developed effectively models the overfitting problems that VLMs could encounter when adapting to downstream tasks. The proposed CDC, guided by the SCM, can effectively alleviate the interference of task-irrelevant information in VLMs on downstream tasks, thereby enhancing performance.

To validate the universality of our proposed method, we conduct further experiments based on two additional models:

(1) CLIP based on ViT-L/14. ViT-B/16 is the most common setting in prompt tuning and is also employed in our paper. Compared with CLIP based on ViT-B/16, CLIP based on ViT-L/14 is a more powerful model, and experiments on ViT-L/14 can verify the effectiveness of CDC across VLMs of different capacities;

(2) ALIP based on ViT-B/32. ALIP introduces a bi-path model that integrates raw text supervision and synthetic caption supervision. We conduct experiments on ALIP to demonstrate the generalization of our CDC to VLMs with different pre-training objectives.

Refer to the global rebuttal for the experimental results and more analysis. The experimental results demonstrate that our proposed CDC can effectively improve the performance of various VLMs on downstream classification tasks.

W2: The experiments are exclusively conducted on image classification tasks. The authors can explore adapting vision-language models (VLMs) to a wider range of tasks, such as object detection, image captioning, or visual question answering, to further validate the model's versatility and performance across diverse applications.

A2: We appreciate your valuable suggestions. To validate whether our proposed method is beneficial for applying VLMs to a wider range of tasks, we explore the application of our method in one-shot object detection. As shown in Table 3, our approach achieves the state-of-the-art performance. The experimental results confirm the effectiveness of our method.

In one-shot object detection, each foreground object category has only one labeled instance available for training. Due to the limited amount of training data, it is often inconsistent with the true data distribution of the dataset. Therefore, there exists a serious misalignment of the training data with the testing data, which is in line with our definition of data misalignment. Our proposed SCM framework is highly effective in addressing this challenge, which is achieved by transferring our SCM (Figure 2 in the original manuscript) to the scenarios of one-shot object detection. Specifically, $D$ in our SCM can be interpreted as the knowledge contained within the base model of one-shot object detection, which encompasses both knowledge related to one-shot object detection (denoted as $G_r$ ) and knowledge unrelated (denoted as $G_i$ ). The process of fine-tuning the base model on a single instance can be seen as an attempt to extract $G_r$ while eliminating $G_i$ . However, due to the scarcity of samples required for fine-tuning, $G_i$ is often not accurately identified, thus interfering with the modeling of the true causal relationships between the testing instances and their corresponding categories. $\hat{G_i}$ becomes a confounder.

Our proposed CDC effectively mitigates such confounders. Concretely, we generate feature vectors for the foreground object categories using the text encoder of CLIP and align the visual features from the pre-trained base model with the corresponding text features in CLIP through a projector. Additionally, we enhance the feature alignment via prompt tuning and CDC within the CLIP text encoder. As shown in Table 3, our proposed CDC has achieved state-of-the-art performance among works in recent years. Compared to the baseline method DeFRCN [R2], our approach yields an average performance improvements of 1.61% across three different data splits, demonstrating its effectiveness in extending VLM to other tasks.

In future work, we will continue to explore the potential of CDC for enhancing the application of VLMs, e.g., CLIP, across a wider range of tasks.

Table 3 One-shot experimental results of AP50 on the VOC dataset [R3] (%).

Methods	Novel Set 1	Novel Set 2	Novel Set 3
Pseudo-Labelling [R4]	54.50	32.80	48.40
DeFRCN [R2]	57.03	35.82	52.49
ICPE [R5]	54.30	33.50	50.90
DiGeo [R6]	37.90	26.60	30.40
CDC (Ours)	59.62	37.67	52.89

[R2] Qiao L, Zhao Y, Li Z, et al. Defrcn: Decoupled faster r-cnn for few-shot object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 8681-8690.

[R3] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, A. Zisserman, The pascal visual object classes (VOC) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338.

[R4] P. Kaul, W. Xie, A. Zisserman, Label, verify, correct: A simple few shot object detection method, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 14237–14247.

[R5] X. Lu, W. Diao, Y. Mao, J. Li, P. Wang, X. Sun, K. Fu, Breaking immutable: Information-coupled prototype elaboration for few-shot object detection, in: B. Williams, Y. Chen, J. Neville (Eds.), AAAI 2023, AAAI Press, 2023, pp. 1844–1852.

[R6] J. Ma, Y. Niu, J. Xu, S. Huang, G. Han, S. Chang, Digeo: Discriminative geometry-aware learning for generalized few-shot object detection, in: CVPR 2023, IEEE, 2023, pp. 3208–3218.

作者回复

2024-08-07

Response to Weakness 1 of Reviewer Ptf4 and Weakness 2 of Reviewer 7jjJ.

Thank you for your suggestions on exploring the impact of the misalignment issues we proposed across different models. We believe that most current VLMs can suffer from the misalignment problem when adapting to downstream tasks, regardless of the model's capacity and architecture.

Firstly, our proposed data misalignment arises from the distribution differences between training data and testing data during the adaptation process of VLMs, which is an inherent attribute of the task. During the process of tuning prompts based on training data, regardless of the model's architecture and pre-training method, the distribution of testing data remains unknown, causing the learned prompts to overfit the training data. Therefore, the data misalignment problem will persist.

Furthermore, to validate the efficiency of our proposed CDC in addressing the data misalignment problem regarding different VLMs and different model architectures, we conduct experiments based on two experimental settings:

(1) Base-to-New Generalization. As shown in Table 1, we adopt an additional more powerful architecture, i.e., ViT-L/14, and another VLM, i.e., ALIP [R1], to validate the effectiveness of our CDC. Table 1 reports the harmonic mean (HM) of accuracy on base and new classes in the base-to-new experimental setup, based on CLIP and ALIP. As shown in the table, with ViT-L/14, CDC achieves a 5.78% average performance improvement over CLIP and a 2.66% improvement over MaPLe. With ViT-B/32, CDC achieves a 13.07% improvement over ALIP, and a 1.37% improvement over MaPLe.

(2) Cross-Dataset Out-of-Distribution Generalization. We also validate the efficiency of CDC in the cross-dataset experimental setting. Table 2 reports the zero-shot test results of MaPLe and CDC on 10 target datasets. As shown in the table, CDC achieves a 1.05% average performance improvement over MaPLe.

The experimental results indicate that even with the use of a more powerful architecture and another VLM, CDC can still improve model performance by mitigating data misalignment issues during the prompt tuning process.

Table 1. The comparison with baseline methods on base-to-new generalization setting based on different VLMs and different architectures.

Method	Architecture	Avg	ImageNet	Caltech	Pets	Cars	Flowers	Food	Aircraft	SUN	DTD	SAT	UCF
CLIP	ViT-L/14	78.75	76.51	96.69	96.63	79.23	82.75	94.35	40.62	75.54	65.84	76.70	80.30
CLIP+MaPLe	ViT-L/14	81.86	79.76	97.22	97.55	83.09	87.70	94.65	44.81	82.52	72.03	72.98	84.75
CLIP+CDC	ViT-L/14	84.52	80.61	97.14	97.76	83.42	88.90	94.91	46.10	84.31	78.34	90.78	86.15
ALIP	ViT-B/32	41.10	38.50	80.05	41.76	4.89	60.36	50.91	4.05	53.87	28.60	43.60	43.94
ALIP+MaPLe	ViT-B/32	52.80	45.84	88.44	60.64	12.29	74.70	61.14	7.10	67.30	36.27	63.42	56.43
ALIP+CDC	ViT-B/32	54.17	45.67	90.06	62.69	12.44	77.97	62.31	7.20	67.91	41.32	65.99	57.96

Table 2. Comparison of CDC on cross-dataset evaluation based on ViT-L/14.

Method	Architecture	Caltech	Pets	Cars	Flowers	Food	Aircraft	SUN	DTD	SAT	UCF	Avg
CLIP	ViT-L/14	95.20	93.50	76.80	79.50	90.90	32.70	67.70	53.00	60.30	75.10	72.47
CLIP+MaPLe	ViT-L/14	96.23	92.57	77.60	76.90	91.40	30.07	71.63	54.23	53.83	75.90	72.04
CLIP+CDC	ViT-L/14	96.70	93.10	77.33	75.87	91.53	31.67	72.87	57.63	58.00	76.20	73.09

[R1] Yang K, Deng J, An X, et al. Alip: Adaptive language-image pre-training with synthetic caption[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 2922-2931.

最终决定Accept (poster)

2024-09-25

This paper effectively addresses the critical issues of task and data misalignment in adapting CLIP and similar models to downstream tasks. The proposed Causality-Guided Semantic Decoupling and Classification (CDC) method is well-founded in causal theory and demonstrates significant improvements in mitigating the interference of task-irrelevant information. Reviewers appreciated the thorough experimental validation, which consistently showed the effectiveness of CDC across various settings. However, some concerns were raised about the generalizability of the method beyond image classification tasks and across different vision-language models. The authors responded by providing additional experiments and clarifications, demonstrating CDC's broader applicability and robustness. They also addressed specific concerns regarding the explanation of concepts and the impact of the model’s increasing capabilities. Despite these minor concerns, the paper’s strengths, including its technical novelty, clear motivation, and strong empirical results, make it a valuable contribution to the field. Therefore, I recommend acceptance of this paper.