PaperHub
4.9
/10
Poster4 位审稿人
最低1最高4标准差1.1
1
4
3
3
ICML 2025

Learning Invariant Causal Mechanism from Vision-Language Models

OpenReviewPDF
提交: 2025-01-10更新: 2025-07-24

摘要

关键词
Vision-Language ModelsCausal representation LearningOut-of-distribution generalizationRepresentation Learning

评审与讨论

审稿意见
1

This work aims to leverage Invariant Causal Mechanisms in causality to improve prediction under distribution shifts. However, a detailed summary is challenging for me due to several fundamental issues, including an unclear problem formulation, misconceptions of key concepts, and unrealistic theoretical assumptions.

给作者的问题

Overall,

  1. the problem setting is unclear, and some fundamental concepts in causality are misused (see Claims And Evidence).

  2. The identifiability analysis is unrealistic and nearly flawed (Theoretical Claims), which undermines confidence in the proposed methods.

  3. Additionally, using the CLIP model to claim OOD distribution experiments should be approached with caution and carefully considered (see Experimental Designs Or Analyses).

论据与证据

There are several unclear or even mistaken claims in the paper. For example:

  1. Problem Setting: The claim "The goal of OOD generalization is to learn a predictor from training environments... domain shift and open-class scenarios. Domain shift arises when the data distribution in the test environment differs from that in the training environment, while open-class scenarios involve previously unseen classes appearing at test time." is confusing.

Domain shift is a broad category encompassing various settings, such as covariate shift, conditional shift, and label distribution shift. The authors should specify which type of domain shift their work addresses to avoid ambiguity. Open-class scenarios, where previously unseen classes appear at test time, present a significant challenge. The authors should clarify whether this setting is realistically addressable and, if so, whether there exists a theoretical solution for it.

  1. Conceptual Misuse: You have assumed a causal generative model, as shown on the left in Figure 1, where there is a clear causal relationship, e.g., y causes z, and z causes x. Causal mechanisms should be defined from cause to effect, rather than as p(y∣do(x)) or p(y∣do(z)), as claimed. From my understanding, a causal mechanism refers to the underlying process or system that explains how one variable influences another in a causal relationship. It describes how causes bring about effects, and is typically assumed to be invariant. Therefore, one cannot claim that the relationship from effect to cause constitutes a causal mechanism, as this relationship is generally variant and does not align with the principles of causal inference. Further, in Proposition 5.1, p(y|do(z)) or p(y|do (x)) (e.g., "cause given the effect") typically does not have a well-defined meaning in the standard framework. Please let me know if I have misunderstood this.

方法与评估标准

Since the identifiability theory is problematic (see below for concerns regarding the theory), I am not confident that the method's effectiveness is due to causality.

理论论述

Theorem 5.3 is central to supporting this work. However, the assumption in Condition 5.2 is quite peculiar. It broadly states, "There exist some samples such that the inference model can be equal to the generative model on these samples." This is strange, because the generative model is completely unknown. How can one enforce the inference model to match an unknown prior from the generative model? If this assumption holds, one could simply assume that the inference model equals the generative model, which would make the proof trivial. In fact, after reviewing the proof, I found that there is almost no technical challenge to the identifiability proof under the assumption in Condition 5.2.

实验设计与分析

For the experiment results based on CLIP, there is a significant concern regarding whether the training process of CLIP truly does not use the data in the experiments. Since CLIP is trained on a large number of image-text pairs, it’s important to question whether there is any potential data leakage. Specifically, it should be clarified whether the data used in the experiments overlaps with or has any connection to the data CLIP was trained on, as this could lead to biased or invalid results. Ensuring that no data leakage occurs is critical to maintaining the integrity of the experiment's findings.

补充材料

Yes, I reviewed the proof for the theorem.

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

Causality is a challenging concept to understand. I believe it is particularly effective in handling distribution shift tasks, as it not only provides a theoretical framework but also offers practical tools in certain cases. However, we must be cautious in how we apply it, and at the very least, it requires a deep understanding of causality.

伦理审查问题

N/A

作者回复

Re: Claims And Evidence & Q1

Regarding the definition of domain shift and open-class.

  1. In our context, domain shift primarily refers to covariate shift, where the p(x)p(x) differs between the training and testing phases while p(yx)p(y|x) remains unchanged. This scenario is widely adopted in standard domain generalization tasks [1].

  2. The open-class prediction problem is well defined [2–5]. The problem is addressable, and one of the solution is CLIP [3]. Several studies [4,5] also provide theoretical support.

We will include detailed explanations regarding these scenarios in the final version.

Regarding the causal mechanism in Section 5.1.

We clarify as follows:

  1. The construction of SCM depends on the type of task [10]. For example, whether the chicken causes the egg or the egg causes the chicken depends on the objective: if we study how chickens produce eggs, the chicken is the cause and the egg is the effect; if we study how eggs develop into chickens, then the egg is the cause. Therefore, when we study the task of prediction, the input image is the cause, and the predicted label is the effect.
  2. We study two SCMs in our paper: Figure 1(a) is the generation process, while Figure 1(b) is the prediction process. Since our work primarily focuses on prediction, the causal mechanisms p(ydo(x))p(y|do(x)) and p(ydo(zinv))p(y|do(z_{inv})) in Proposition 5.1 are both defined based on the SCM in Figure 1(b) rather than Figure 1(a). In Figure 1(b), the model construct a causal chain XZYX \to Z \to Y, where XX is the cause, YY is the effect. Therefore, it does not means inferring the cause from the effect. In this SCM, the causal effect is: "how change in image XX affects the prediction of output YY".
  3. As stated in lines 194–197, the prediction process can be viewed as the inverse of the data generation process. We emphasize that the term “inverse process” here is solely a mathematical construct to derive the structural equations. These equations correspond to edges in Figure 1(b), which we further provide a detailed explanation in lines 197–208. Therefore, Figure 1(b) is a valid SCM.

In summary, our proposed SCM doesn't contain "cause given the effect" scenarios. And the causal mechanism in Figure 1(b) is valid and well-defined.

Re: Theoretical Claims & Q2

  1. The reviewer may have misunderstood the logical connection between Condition 5.2 and Theorem 5.3. Theorem 5.3 aims to prove identifiability under certain conditions, and our work focuses on formulating such a condition—namely, Condition 5.2. Therefore, although the proof of Theorem 5.3 is relatively straightforward, its validity relies on the formulation of Condition 5.2.
  2. Literature [6] proves that without additional constraint, latent factors are unidentifiable. Motivated by literatures [7,8,9], we identify and formalize this condition in this paper, thereby facilitating a clear understanding and straightforward implementation of the proof for Theorem 5.3.
  3. The advantage of this condition is that it does not require additional assumptions on the latent factors (prior distribution) or on the generative process. Instead, it relies solely on observable label data.
  4. The condition consists of two parts: consistency and diversity. The consistency (Equation 12) only requires the output distribution p^(yx)\hat{p}(y|x) to match the observable distribution p(yx)p(y|x) , rather than an unknown prior.
  5. We demonstrate in lines 247~255 why CLIP can be considered satisfied this condition.

In summary, the Condition 5.2 and Theorem 5.3 are formulated within a standard theoretical framework based on extensive prior work and have practical implications.

Re: Experimental Designs Or Analyses & Q3

We understand the reviewer's concerns, but:

  1. To date, OpenAI has not released the training data used for CLIP, which makes it extremely challenging to verify whether there is any overlap between the experimental data and the data used to train CLIP.
  2. Our experimental design strictly adheres to the established community standards for fine-tuning CLIP in domain generalization tasks. (including CLIP-Adapter, CLIPood, CoOp, CoCoOp, MIRO, and DPL)

Therefore, we believe that our experimental setup is both reasonable and widely accepted.

[1] Domain generalization: A survey.

[2] A survey of zero-shot learning: Settings, methods, and applications.

[3] Learning transferable visual models from natural language supervision.

[4] Zero-shot learning with semantic output codes.

[5] Attribute-based classification for zero-shot visual object categorization.

[6] Nonlinear independent component analysis: Existence and uniqueness results.

[7] Nonlinear ICA using auxiliary variables and generalized contrastive learning.

[8] On linear identifiability of learned representations.

[9] Contrastive learning inverts the data generating process.

[10] Toward causal representation learning.

审稿人评论

In our context, domain shift primarily refers to covariate shift..

--I respectfully disagree. Domain shift can generally be categorized into several specific settings, including covariate shift, target shift, conditional shift, and conditional-target shift [1,2].

The open-class prediction problem is well defined [2–5]. The problem is addressable, and one of the solution is CLIP [3]

--How do you ensure that the training data used for CLIP does not include previously unseen classes from the testing data?

We study two SCMs in our paper:,

--For a given context, there should typically be only one causal model, as a causal model aims to represent a physical process. One cannot claim two models for the same context, as the corresponding physical process is determined and unique. You have defined data generation in Figure 1a. In this context, Figure 1b—which you acknowledge as a predictive model—should only be understood as an inference model."

Condition 5.2 and Theorem 5.3.

--Theorem 5.3 is based on Condition 5.2. If Condition 5.2 is not satisfied, Theorem 5.3 does not hold. From a high-level perspective, Condition 5.2 requires that the estimated z (the left-hand side of Eq. 2, z^=fI(x)=f_I(g(z))\hat{z}= f_{I}(x) = f\_{I}(g(z)) ) matches the ground-truth zz (the left-hand side of Eq. 2, where f_I(x)=g1(x)=g1(g(z))=z f\_{I*}(x) = g^{-1}(x) = g^{-1}(g(z))=z). Consequently, you assume that z^=z\hat{z} = z, which is the objective of identifiability.......Moreover, one does not know the ground-truth zz. Even if one were to assume it, how, then, could this condition be incorporated into the inference model?

[1] Zhang, Kun, et al. "Domain adaptation under target and conditional shift." International conference on machine learning. Pmlr, 2013.

[2] Stojanov, Petar, et al. "Domain adaptation with invariant representation learning: What transformations to learn?." Advances in Neural Information Processing Systems 34 (2021): 24791-24803.

作者评论

Response to Comment 1:

Indeed, the understanding of domain shift is as the reviewer described. However, what we intended to express is that our submission focuses specifically on covariate shift, that is, the discrepancy between the training and testing data distributions.

Response to Comment 2:

Since the composition of CLIP's training dataset has not been publicly released, we are unable to directly verify its contents. To further investigate this issue, we propose an experimental approach. The basic idea is as follows: if the dataset used in our submission were included in CLIP’s training data, then testing CLIP directly on this dataset should yield strong performance.

MethodIMAGENET-SIMAGENET-ATerra IncognitaiWildCam-WILDS 2020
CLIP Zero-shot46.147.834.210.6
Ours50.951.452.514.1

The results of our test are shown in the table below. We observe that CLIP's performance is clearly suboptimal when tested directly. This supports our claim that the dataset is not included in CLIP's training data, and also validates the soundness of our experimental design.

Response to Comment 3:

In our previous response, we provided an example: Which came first, the chicken or the egg? This example was intended to illustrate the following point: while it is true that the SCM remains invariant, the true SCM is also unknown. We can only infer it based on empirical observations and reasoning. As a result, different interpretations may lead to different SCMs.

In this paper, we present two such interpretations: one from the perspective of data generation, and the other from the perspective of data prediction. These two interpretations form a closed loop—they are mutually reversible. Building on this, the remainder of the paper develops the framework from the prediction-oriented perspective.

Response to Comment 4:

Condition 5.2 does not imply that z^=z\hat{z}=z. We provide a detailed explanation below.

Consider a training dataset

D={(xi,ti)}i=1N,\mathcal{D} = \{(x_i, t_i)\}_{i=1}^N,

sampled from the joint distribution p(x,t)p(x,t). Let T\mathcal{T} denote the set of all possible values of tt.

Let θ\theta denote the parameters of fIf_I and fTf_T, and let θ\theta^* denote the parameters of fIf_{I^*} and fTf_{T^*} (to which we have no access).

The ground-truth conditional probability can be regarded as produced by fIf_{I^*} and fTf_{T^*}:

pθ(tx,T)=exp(fI(x)fT(t))tTexp(fI(x)fT(t))={1,if (x,t)D,0,otherwise.p_{\theta^*}(t\mid x,\mathcal{T}) = \frac{\exp(f_{I^*}(x)^\top f_{T^*}(t))}{\sum_{t'\in\mathcal{T}} \exp(f_{I^*}(x)^\top f_{T^*}(t'))} =\begin{cases} 1, & \text{if } (x,t)\in\mathcal{D},\\\\ 0, & \text{otherwise}. \end{cases}

Similarly, the CLIP model functions fIf_I and fTf_T produce the distribution

pθ(tx,T)=exp(fI(x)fT(t))tTexp(fI(x)fT(t)).p_{\theta}(t\mid x,\mathcal{T}) = \frac{\exp(f_{I}(x)^\top f_{T}(t))}{\sum_{t'\in\mathcal{T}} \exp(f_{I}(x)^\top f_{T}(t'))}.

The training objective for CLIP is to minimize the KL divergence

KL(pθ(tx,T)pθ(tx,T)).\mathbf{KL}\Bigl(p_{\theta}(t\mid x,\mathcal{T}) \Vert p_{\theta^*}(t\mid x,\mathcal{T})\Bigr).

Ideally, after training, we have

pθ(tx,T)=pθ(tx,T),p_{\theta}(t\mid x,\mathcal{T}) = p_{\theta^*}(t\mid x,\mathcal{T}),

that is,

exp(fI(x)fT(t))tTexp(fI(x)fT(t))=exp(fI(x)fT(t))tTexp(fI(x)fT(t)).\frac{\exp(f_{I}(x)^\top f_{T}(t))}{\sum_{t'\in\mathcal{T}} \exp(f_{I}(x)^\top f_{T}(t'))} = \frac{\exp(f_{I^*}(x)^\top f_{T^*}(t))}{\sum_{t'\in\mathcal{T}} \exp(f_{I^*}(x)^\top f_{T^*}(t'))}.

This equality illustrates the consistency aspect of Condition 5.2. Building on this, for any pair tat_a and tbt_b the following ratio should hold:

pθ(tax,T)pθ(tbx,T)=pθ(tax,T)pθ(tbx,T),\frac{p_{\theta}(t_a\mid x,\mathcal{T})}{p_{\theta}(t_b\mid x,\mathcal{T})} = \frac{p_{\theta^*}(t_a\mid x,\mathcal{T})}{p_{\theta^*}(t_b\mid x,\mathcal{T})},

which implies

exp(fI(x)fT(ta))exp(fI(x)fT(tb))=exp(fI(x)fT(ta))exp(fI(x)fT(tb)).\frac{\exp(f_{I}(x)^\top f_{T}(t_a))}{\exp(f_{I}(x)^\top f_{T}(t_b))} = \frac{\exp(f_{I^*}(x)^\top f_{T^*}(t_a))}{\exp(f_{I^*}(x)^\top f_{T^*}(t_b))}.

Taking logarithms on both sides, we obtain

(fT(ta)fT(tb))fI(x)=(fT(ta)fT(tb))fI(x).\bigl(f_T(t_a) - f_T(t_b)\bigr)^\top f_I(x) = \bigl(f_{T^*}(t_a) - f_{T^*}(t_b)\bigr)^\top f_{I^*}(x).

Moreover, the diversity condition requires that there exist at least D+1D+1 pairs (ta,tb)(t_a, t_b) such that different fT(ta)fT(tb)f_T(t_a)-f_T(t_b) form different basis for some space LL, and different fT(ta)fT(tb)f_{T^*}(t_a)-f_{T^*}(t_b) form different basis for another space LL'. Consequently, we have

fI(x)=(LL1)fI(x)=AfI(x),f_I(x) = \bigl(L' L^{-1}\bigr)^\top f_{I^*}(x) = A f_{I^*}(x),

indicating that fI(x)f_I(x) is a linear transformation of fI(x)f_{I^*}(x). Note that the matrix AA is unknown.

Thus, Condition 5.2 does not require any knowledge of fIf_{I^*} or fTf_{T^*}, nor does it necessitate knowing the ground-truth zz. Instead, we only assume that the data distribution is generated by these underlying functions.

审稿意见
4

The paper analyzes the OOD generalization of CLIP via the lens of causal/invariant predictor learning, where the goal is to make predictions via the invariant (causal) features for the downstream task. Motivated by the failure cases of naive funetuning of CLIP, the authors propose CLIP-ICM as a principled approach. The proposed approach relies on the linear identifiability guarantees in CLIP's representation space, which is further disentangled into invariant and environment specific features by leveraging interventional data. With the identified invariant features, CLIP-ICM train a linear probe for making predictions in the downstream task. CLIP-ICM is benchmarked on widely used OOD generalization datasets, where it outperforms baselines (existing strategies for finetuning CLIP) especially for the case of open class domain shifts.

Update after rebuttal

I have read the author rebuttal and other reviews as well. I think the paper is very interesting, technically sound, and make a good use for proposing methodology inspired from the latent identification literature. Hence, I retrain my rating and vouch for acceptance.

给作者的问题

  • The description in section 4 about the domain shift is a bit confusing. The authors mention with linear probe the CLIP embeddings are kept frozen (line 153), but when analyzing the results they mention finetuning CLIP (line 168). Were the CLIP representations finetuned with linear probe or only a linear probe was trained with frozen representations?

  • Table 1, open class domain shift scenario, why does finetuning improve the performance for base classes but it deteriorates the performance for novel classes?

论据与证据

Yes, the claims made in the submission are well supported with clear and convincing evidence.

Strong empirical evidence for the claims

  • The failure cases with naive finetuning strategies of CLIP are highlighted clearly with experiments on the Terra Incognita dataset (Table 1).

  • The main experiments in Table 2 test CLIP-ICM with a variety of baselines on multiple benchmarks, with CLIP-ICM providing improved performance in nearly all the cases. Further, experiments in Table 3 with results for the open classes domain shift strengthen the author's claim of superior OOD generalization.

  • Given the requirement of interventional data in CLIP-ICM, the authors generate interventional data to test by manipulating both the base images and captions. This helps to analyze CLIP-ICM's performance with access to diverse interventional data, and ablations CLIP-ICM^{\star} and CLIP-ICM^{\dagger} provide further details.

方法与评估标准

Yes, the proposed methods and evaluation criteria make sense for the problem at hand. All the benchmarks used in this paper are widely used for out-of-distribution generalization. Regarding baselines for finetuning CLIP, I am not the best judge if all the relevant baselines have been used, since I am not familiar with recent works on CLIP finetuning.

理论论述

Yes, I checked the correctness of the proof for all the theorems and I did not find any major issues.

实验设计与分析

Yes, I checked the soundness/validity of all the experiments in the paper, and the experiment design doesn't have any flaws. Further, the authors have done a good job at analyzing their findings, its coherent with the experiment results.

补充材料

Yes, I checked all parts of the supplementary material.

与现有文献的关系

The papers utilizes the methodology of invariant predictor learning, a fairly common approach for tackling out-of-distribution generalization.
Specifically, identifiable invariant predictor learning approaches have been proposed in prior works [1, 2]. The key contribution of the paper is to apply these ideas in the framework of CLIP.

References

  • [1] Lu, Chaochao, Yuhuai Wu, José Miguel Hernández-Lobato, and Bernhard Schölkopf. "Invariant causal representation learning for out-of-distribution generalization." In International Conference on Learning Representations. 2021.

  • [2] Yao, Dingling, Dario Rancati, Riccardo Cadei, Marco Fumero, and Francesco Locatello. "Unifying Causal Representation Learning with the Invariance Principle." arXiv preprint arXiv:2409.02772 (2024).

遗漏的重要参考文献

No, I believe all essential references have been discussed to the best of my knowledge. The authors have written have a very detailed related works section.

其他优缺点

Strengths

  • The paper is well written, with details about the proposed method easy to follow and the empirical findings are clear and easy to follow.

Weaknesses

  • The core ideas behind CLIP-ICM are not very original, the theoretical results in the paper mostly build upon existing proof techniques in the literature. Even the methodology of extract invariant features from representation with linear identification guarantees is not very novel. However, I don't think this is a major concern, as the application of identifiable invariant feature learning specifically to CLIP framework is novel to the best of my knowledge.

其他意见或建议

  • Given that the theoretical results (Theorem 5.3, 5.4) are mostly an application of existing theoretical results, I suggest the authors should rename the theorems to propositions.

  • Just like the authors mention that Theorem 5.3 aligns with results in prior works, the same should should be done for Theorem 5.4 with the prior work by Ahuja et al. 2023 on interventional causal representation learning.

作者回复

We thank the reviewer for their thoughtful evaluation and positive feedback. We appreciate the acknowledgment that our approach offers a solid theoretical foundation and demonstrates clear empirical benefits for OOD generalization. We also value the reviewer’s recognition that our claims are well-supported by both theoretical analysis and practical experiments, as well as the confirmation that our references to existing literature provide sufficient context. Moreover, we are pleased that the reviewer finds our writing to be clear and our explanation of the proposed method to be thorough.

Below, we address the reviewer’s additional questions and suggestions in detail.

Response to Weaknesses

We appreciate the reviewer’s positive comments and would like to clarify our position. We acknowledge that Theorem 5.3 and Theorem 5.5 indeed build upon previous work in the literature. However, our primary interest lies in extending these interesting theoretical insights to practical applications.

In our manuscript, we carefully discuss the conditions under which Theorem 5.5 and Theorem 5.6 hold, and we leverage these conditions to propose our CLIP-ICM method, which is designed to guarantee lower OOD generalization error. One particularly surprising and encouraging observation is that by mapping both image and text embeddings into a shared invariant subspace, CLIP is able to maintain its original zero-shot performance even when confronted with domain shift—thus, ensuring that it continues to perform well on new classes after task-specific fine-tuning.

We are grateful for the reviewer’s recognition of our work and believe that integrating theoretical results with real-world application strategies represents a significant contribution to the field.

Response to Other Comments or Suggestion

  1. Given that the theoretical results (Theorem 5.3, 5.4) are mostly an application of existing theoretical results, I suggest the authors should rename the theorems to propositions.

Thank you for your suggestion. In the final version, we will change these two theorems into propositions.

  1. the same should be done for Theorem 5.4 with the prior work by Ahuja et al. 2023 on interventional causal representation learning.

We thank the reviewer for this suggestion. In the final version of the manuscript, we will explicitly highlight both the connections and distinctions between our Theorem 5.4 and the interventional causal representation learning work by Ahuja et al. (2023).

Response to Questions For Authors

  1. Were the CLIP representations finetuned with linear probe or only a linear probe was trained with frozen representations?

We apologize for the confusion caused by our description, and thank the reviewer for highlighting this issue.

To clarify, in the domain-shift scenario (line 153), the CLIP embeddings remain frozen, and only the linear probe is trainable. At line 168, when we mentioned "fine-tuning," we were referring specifically to the linear probe training process, rather than updating the original CLIP image encoder or text encoder.

We will carefully revise this description in the final manuscript to clearly differentiate these two settings and avoid further confusion.

  1. Table 1, open class domain shift scenario, why does finetuning improve the performance for base classes but it deteriorates the performance for novel classes?

We thank the reviewer for raising this insightful question. This phenomenon—where fine-tuning improves performance on the base classes but degrades performance on novel classes—has been widely acknowledged in studies on adapting CLIP, including CoOP, CoCoOp, CLIP-Adapter, and CLIPood. As noted by Wortsman et al. [1] and Shu et al. [2], naively fine-tuning CLIP often results in a loss of its inherent strong generalization ability, manifesting as improved performance on the specifically fine-tuned downstream task but significantly weakened robustness under distribution shift (including both covariate shift and label shift).

A likely explanation for this deterioration on novel classes is tied to catastrophic forgetting, a phenomenon wherein a model “forgets” previously learned information when trained on new data. In the context of Table 1, when adapting CLIP to a set of base classes, the fine-tuning procedure heavily optimizes for accurate classification of those base classes. Consequently, the original parameters—particularly those responsible for generalizing to unseen classes—are overwritten. As a result, the previously robust zero-shot capability of CLIP (which was central to its strong open-class performance) is compromised.

[1] Wortsman, Mitchell, et al. Robust fine-tuning of zero-shot models. CVPR 2022.

[2] Shu, Yang, et al. Clipood: Generalizing clip to out-of-distributions. ICML 2023.

审稿人评论

Thanks a lot for the rebuttal! I think the paper is very interesting, technically sound, and make a good use for proposing methodology inspired from the latent identification literature. Hence, I retrain my rating and vouch for acceptance.

作者评论

Thank you for your thoughtful review and kind recognition. We truly appreciate the time and effort you dedicated—it means a lot to us and encourages our continued work!

审稿意见
3

This work is motivated from the OOD generalization issue in CLIP, it addresses this problem via learning an invariant causal mechanism and proposes CLIP-ICM framework, which includes collecting interventional data, estimating a linear projection matrix, and predicting in the invariant subspace. The proposed CLIP-ICM shows improvement in OOD datasets.

update after rebuttal

I appreciate the authors for their response, which addresses most of my concerns. I intend to maintain my original score.

给作者的问题

Could the authors share more details of the environment diversity and applicable scenarios of the proposed method?

论据与证据

In general, the claims are well supported. The paper originates from a well-studied principle in invariant learning, the high-level idea is not new, but it would still contribute to the CLIP model generalization.

方法与评估标准

The pipeline of the method is generally clear, but some of the technical details may need further clarification. For example, it would be better to include a more detailed description of the interventional data generation process.

理论论述

The theoretical analysis looks good to me.

实验设计与分析

The experiment would be improved if further diverse environments and contexts are included.

补充材料

Yes, I have reviewed the appendix.

与现有文献的关系

The paper is related to CLIP applications in different domains.

遗漏的重要参考文献

N/A

其他优缺点

In general, the paper would be further improved with a more comprehensive evaluation.

其他意见或建议

There are a few informal writing styles, e.g., the footnote in page 4 takes a single sentence in the main text.

作者回复

We thank the reviewer for the thoughtful comments and positive feedback. We are pleased that the reviewer recognizes our work as well-supported, highlighting our clear pipeline and sound theoretical analysis. Below, we provide detailed responses addressing the specific concerns raised by the reviewer.

Response to Methods and Evaluation Criteria

We appreciate the reviewer’s comments regarding the interventional data generation process. We would like to clarify the following points:

  1. In the original manuscript, lines 369–371 (left column) briefly outline the steps for collecting image-based interventional data, while lines 381–384 (left column) and line 330 (right column) briefly describe the process for collecting text-based interventional data. And we mention that the detailed collection process is provided in Appendix H.1.
  2. In Appendix H.1, we present detailed description of the collection procedures for both image-based and text-based interventional data:
    1. For the image-based interventional data, we explain that it is generated using eight data augmentation techniques, including: ColorJitter, GrayScale, GaussianBlur, RandomInvert, RandomRotation, RandomPosterize, RandomSolarize, and RandomEqualiz. These data augmentation techniques directly implemented from the torchvision.transform package.
    2. The text-based interventional data comprises two components: the text description model and the text intervention model. Both models are generated by invoking GPT-4o. The prompts used for these models are provided on page 21, lines 1118–1142, and Figure 5 presents an example of the text-based interventional data.

Please let us know if you have any additional suggestions for further improvements regarding this aspect.

Response to Experimental Designs Or Analyses

To address your concerns, in addition to the datasets used in the original manuscript (PACS, VLCS, OfficeHome, Terra Incognita, DomainNets, ImageNet, ImageNet-V2, ImageNet-S, ImageNet-A, and ImageNet-R), we additionally conducted experiments on the iWildCam-WILDS 2020 dataset.

iWildCAM comprises 203,029 images from 182 different animal species, which were collected from 323 camera traps distributed across various locations. The images obtained from different locations exhibit variations in lighting, color, camera angle, background, vegetation, and relative animal frequencies.

We follow the setting of Koh et al. (2021)[1], use images from 243 locations as the training domain and those from 48 other locations as the test domain. We report the average macro F1 score of CLIP, CLIP-ICM^*, CLIP-ICM^\dagger, and CLIP-ICM under both ID and OOD conditions, as shown in the table below:

MethodID (48 Locations)OOD (243 Locations)
CLIP14.210.6
CLIP Linear-Probe54.641.4
CLIP-ICM^*15.613.3
CLIP-ICM^* Linear-Probe56.242.1
CLIP-ICM^\dagger15.212.2
CLIP-ICM^\dagger Linear-Probe55.644.3
CLIP-ICM15.814.1
CLIP-ICM Linear-Probe57.146.1

[1] Koh et al. Wilds: A benchmark of in-the-wild distribution shifts. ICML 2021.

Response to Other Strengths And Weaknesses

Thank you for suggesting a more comprehensive evaluation. We have extensively validated our method across multiple datasets, including PACS, VLCS, OfficeHome, Terra Incognita, DomainNets, ImageNet, ImageNet-V2, ImageNet-S, ImageNet-A and ImageNet-R along with detailed ablation studies.

Additionally, we have included experiments on the iWildCam-WILDS 2020 dataset in our previous response. Moreover, in our reply to Reviewer Pp7f, we added ablation studies concerning the role of AinvA_{inv}. Please let us know if you have any further suggestions regarding other experiments.

Response to Other Comments Or Suggestions

We thank the reviewer for pointing out this issue. In the final version, we will check all formatting issues and make the necessary revisions.

Response to Questions For Authors

We thank the reviewer for the question and are happy to elaborate.

  1. The environmental diversity across our evaluation datasets comes in different ways. For datasets like PACS, Office-Home, DomainNet, ImageNet-Sketch, and ImageNet-R, diversity primarily stems from variations in visual styles. In VLCS and Terra Incognita, it is reflected in background complexity, lighting conditions, and camera viewpoints. For ImageNet V2 and ImageNet-A, diversity arises from changes in image sources and the inclusion of hard-to-classify samples, respectively.
  2. Our method is generally applicable to real-world scenarios where environment-induced distribution shifts occur. Potential applications include monitoring in wildlife habitats, perception systems in autonomous driving, and cross-domain image-text retrieval. In particular, tasks that require stable semantic understanding across diverse environments can benefit from the CLIP-ICM framework’s ability to isolate invariant semantic factors from CLIP representations.
审稿意见
3

This paper introduces CLIP-ICM, a framework that improves CLIP’s OOD robustness by leveraging a causal perspective to separate invariant and variant factors. By learning a linear mapping to the invariant subspace using interventional data, CLIP-ICM enhances performance across multiple OOD datasets.

给作者的问题

see above

论据与证据

To the best of my knowledge, the evidence supports the claims well.

方法与评估标准

To the best of my knowledge, the evaluation follows the community convention.

理论论述

I have checked section 5.1 and do not find any issue

实验设计与分析

I have checked section 7 and believe it follows the community standard

补充材料

I have checked the supplementary material A/B/C

与现有文献的关系

The paper studies the OOD from a causal inference perspective, which bridges the gap between the two fields

遗漏的重要参考文献

n/a

其他优缺点

  1. I am curious on the role of A_inv and the interventional data. After I check the ablation study, seems like there is no ablation study on this level. would be beneficial to present the ablation of the three steps in figure 3 to help reader to understand the importance of each component.

  2. Where does the variance come from in the data reported in Table 2 and 3? Is it from the difference of interventional data or different initialization? It would be beneficial to have clarity don't that

  3. Overall the paper's presentation is very clear and comprehensive, and studies an important problem that lies in the interest of the community.

其他意见或建议

see above

作者回复

We thank the reviewer for the constructive feedback and valuable suggestions. We sincerely appreciate the reviewer for their positive feedback, especially for finding our claims well-supported, recognizing the clarity and comprehensiveness of our paper's presentation, affirming that our evaluation methodology aligns with community standards, and highlighting our contribution in bridging causal inference and OOD generalization. Additionally, we provide detailed responses to address the two specific concerns raised by the reviewer as follows.

Response to Other Strengths And Weaknesses

W1:

We appreciate the reviewer's interest in understanding the contribution of the linear projection matrix AinvA_{inv} and the role of interventional data. Regarding the role of interventional data, we would like to first emphasize a few points:

  1. Interventional data and the linear projection matrix AinvA_{inv} are mutually dependent components of our method. According to Equation (9), without interventional data, it is unlikely to estimate AinvA_{inv} in our framework.
  2. As shown in Tables 2, 3, 5, 6, and 10–14, we have conducted extensive comparisons between three variants of our method utilizing different type of interventional data. Specifically:
    • CLIP-ICM^*: using only image-based interventional data,
    • CLIP-ICM^\dagger: using only text-based interventional data,
    • CLIP-ICM: using both types of interventional data.
  3. We have provided an ablation study on the effect of different numbers of interventional data pairs in Appendix M, Figure 6 (d).

Regarding the role of AinvA_{inv}, we agree with your suggestion and thus we include an additional ablation experiment to further illustrate its importance. Specifically, we use our generated image-based interventional data to train a linear-probe on the DomainBed dataset. The experimental results are summarized as follows.

MethodPACSVLCSOfficeHomeTerraIncDomainNetAVG.
Linear Probe96.478.781.960.255.074.4
Linear Probe + Interventional data96.879.382.360.555.874.9
CLIP-ICM^* + Linear Probe97.586.584.664.364.079.0

From the results, we can observe that:

  1. Incorporate image-based interventional data with linear probe only slightly (0.5%) improve the performance of linear probe.
  2. Despite all incorporate image-based interventional data for training, the performance of CLIP-ICM^* + Linear Probe is significantly better than that of Linear Probe + Interventional data.

These findings demonstrate that our proposed AinvA_{inv} module (i.e., the projection to the invariant subspace) indeed improve the performance of CLIP in OOD scenarios.

W2:

Regarding the source of variance in Tables 2 and 3, each value in Tables 2 and 3 represents the mean and standard deviation over 5 runs with different random seeds. According to the standard evaluation protocol of the DomainBed Benchmark, we believe that the primary source of variance originates from the various splits of the training, validation, and test datasets across multiple runs.

最终决定

This paper proposes a causal mechanism for addressing CLIP's out-of-distribution (OOD) problem. The paper is well written, and reviewers noted the strong empirical evidence demonstrating the effectiveness of the proposed method compared to naive finetuning strategies and other baseline models. However, the theoretical claims were not unanimously accepted by reviewers. In particular, one reviewer raised significant criticisms regarding the theoretical foundation of the work, its assumptions, and the proposed theorems. During discussions with reviewers, it became clear that while these theoretical uncertainties are important, they may not undermine the strength of the empirical results. Moreover, reviewers disagreed among themselves about the theoretical merits of the work. Given these considerations, I believe the paper represents an interesting contribution to the field and therefore recommend acceptance. However, the authors should carefully address the theoretical concerns raised by reviewers in their revised manuscript.