A Generalization Theory for Zero-Shot Prediction
We present a theoretical framework for zero-shot prediction by prompting, highlighting the conditional independence relationships supporting the success of this approach.
摘要
评审与讨论
The papers takes a theoretical approach towards understanding key quantities driving zero-shot prediction. Introducing and deriving bound for the aforementioned problem setting, the paper analyzes translation between modalities and effectiveness of prompt engineering strategies a multi-modal learning setting.
给作者的问题
NA
论据与证据
To the level I managed to delve deep, this the claims are supported with solid and rigorous evidence and theoretical corroboration.
方法与评估标准
They make sense, even though they can be extended considerably.
理论论述
Verifying this completely is almost a full time job! To the level I could dive deep, the derivation follow smoothly and make sense. I reviewed most of Appendix B.
实验设计与分析
Yes, they make sense but again they are quite limited to classification in two simplistic settings.
补充材料
Appendix B. The rest skimmed through, and on the high level they follow.
与现有文献的关系
In my view, this is an extremely insightful and well written paper --- definitely of value to the community. I enjoyed reading this paper.
遗漏的重要参考文献
Covers it reasonably well.
其他优缺点
Strengths
- Well written paper, and extremely insightful narrative.
- Solid theoretical foundation.
Weaknesses:
- Heavy focus on theory and limited experimental results and demonstrations.
- Lots of empty spaces within the text, I would just move Appendix F to the main text to cover that, or even better than that expand the numerical results to other settings.
其他意见或建议
NA
Thank you for your hard work in verifying the paper. The authors are happy to hear that you found the narrative insightful and well-written. We address your concerns below.
“[The Methods And Evaluation Criteria] make sense, even though they can be extended considerably… [The Experimental Designs Or Analyses] make sense but…are quite limited to classification… Heavy focus on theory and limited experimental results and demonstrations.”
We acknowledge this feedback by 1) motivating why this is the most relevant task for studying modern ZSP, 2) providing additional experimentation with simulations using CLIP/VICReg and linear probe baselines, and 3) by further justifying the theoretical focus of the work. Details are given below.
Firstly, we would like to emphasize that this task is by far the canonical one for studying ZSP. This is largely due to the fact that prompting is inherently tied to natural language. Indeed, consider Gadre et. al. (NeurIPS, 2023), one of the largest-scale studies on the design of multimodal pre-training data for foundation models. In over three hundred experiments, models are evaluated on either zero-shot image classification, zero-shot image/text retrieval, or linear probing. As in our case, only the encoder architectures, datasets, and prompting strategies are varied. In retrieval, there is no prompting, and in linear probing, training data is available.
Secondly, in response to the explicit requests of 8U7M, we also include linear probing comparisons in Figure 5 of the rebuttal. This gives an idea for the near optimal performance of the ZSP methods in the case there there was no residual dependence and that the prompting strategy was unbiased. How this performance gap depends on the residual dependence is studied experimentally in Figure 6 of the rebuttal as well.
Finally, while you correctly pointed out that there is a relatively heavy focus on theoretical work in the paper, we highlight that this is an intention, as the mathematical foundations of modern ZSP is still in its infancy. During the time of submission, the only directly related work we were aware of was Chen et. al. (ICLR, 2024), which we comment on the bottom of Page 2. We were also made aware of the very recent preprint Oko et al. (arXiv, Jan. 2025) by Reviewer fhW7. Thus, we feel as though the theoretical focus of the work is apt based on the current gaps in the scientific literature on this topic.
Once again, we are grateful for your recommendation of acceptance and are open to address any additional comments or questions during the discussion period.
Thanks for further clarification, this addresses my remaining concerns.
This paper provides a formal modeling of the two-stage learning procedure, known as CuPL: (1) pretraining on multimodal labeled data and (2) zero-shot prediction (ZSP) on the pre-trained model with natural language prompts. The goal is to offer a theoretical explanation of the success of CuPL. To achieve this, the paper analyzes how ZSP optimality depends on the pretraining task distribution, the downstream ZSP task distribution, and the prompting strategy.
To me, the key of this model is Equation (6), where two encoders, respectively on input and latent variables, are introduced. Via this modeling, the authors point out that the ideal prompt samples an unobservable conditional distribution of latent variables, given the label. Based on this construct, the authors then compare the informational dependence of unimodal contrastive learning, reconstructive learning, and multimodal contrastive learning. They point out that the last one is the most compatible dependence structure for ZSP.
On top of the selected dependence structure, Theorem 1 identifies the epistemic error bounded by the pre-training sample size and the number of prompts. Then, Theorem 2 identifies the aleatoric error bound. These bounds eventually lead to the variance-regularized covariance loss in Equation (12). Experiments on image classification compare default community-curated prompts (baselines) and CuPL. Results show an increasing trend in CuPL accuracy with more prompts, supporting the claims of how the number of prompts affects error bound in the theorems.
给作者的问题
I would love to improve my rating if the authors can answer the following questions.
-
In the experiment setup, how are prompts generated in the baseline method, i.e., community-curated prompts?
-
Why does the performance of the baseline stay exactly the same as more prompts are provided? Are the baseline prompts provided all at once? If so, this does not seem to be an apple-to-apple comparison, as CuPL gradually generates prompts.
-
Section 2 has discussed alternative approaches in SSL: unimodal contrastive FSL and reconstructive FSL. Why not compare CuPL/ZSP methods to these baselines? In order to demonstrate the success of CuPL/ZSP, I suppose that the paper needs to demonstrate the advantages of CuPL/ZSP over other SSL methods, but not just community-curated prompts.
论据与证据
Overall, I see several important claims in this paper, and they are all well supported. First, the authors claim that the dependence structure of multimodal contrastive learning fits the best for ZSP, and it is well-supported by the analysis in Section 2. Next, another claim is the epistemic and aleatoric error bounds in ZSP, given pre-training data size and number of prompts (Theorem 1 and 2). The proofs of these claims are sound, and the experimental evaluation supports the influence of prompt numbers. Finally, a third claim is a generalized form of variance-regularized covariance loss in Equation (12), and state-of-the-art CuPL methods are identified to fit this generalized form.
方法与评估标准
The evaluation method is reasonable by comparing CuPL methods that fit in Equation (12) to default prompting strategy. The metrics is standard top-k accuracy, and the experiments not only compares between CuPL to the baseline, but also shows how increasing number of prompts affect the result. There are still several things unclear to me about the baseline methods, and I will point them out in the questions to authors.
理论论述
As stated above, the theoretical claims are supported by sound analysis.
实验设计与分析
As stated above, the experimental design looks reasonable to me, except for the selection of baseline. I will list my questions in the last part.
补充材料
Appendices and links are provided to further support the paper’s claims.
与现有文献的关系
This paper is well-related to broader literature, as it aims to provide theoretical support for the success of CuPL methods.
遗漏的重要参考文献
I hope the authors could discuss more on the relationship between FSL, ZSP and MAML [1].
[1] Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." International conference on machine learning. PMLR, 2017.
其他优缺点
Strengths:
- The paper is very well motivated, as the goal is to explain the success of CuPL.
- The proofs are rigorous and sound.
Weaknesses
- Although Section 2 has demonstrate why ZSP is best compatible with multimodal contrastive learning, it does not necessary discuss the advantage of ZSP over other SSL methods, i.e., unimodal contrastive learning and reconstructive learning. One reason could be that obtaining labels for FSL is hard, but I would like to see more evidence supporting this.
- Like the above bullet point, the experiment does not compare ZSP with FSL baselines. I hope the authors could provide a sound reason for not doing so. To me, demonstrating the success of ZSP requires analysis and/or experiments to show it defeats other SSL methods.
- The narrative structure could be improved. For example, Assumption 1 is crucial for both theorems, but it is in the appendix. Many discussion of Theorem 1 appears before the theorem itself. It would be better if Assumption 1 and Theorem 1 are first stated, and then the analysis follows.
其他意见或建议
N/A
Thank you for your thorough review. We address your comments below.
“I hope the authors could discuss more on the relationship between FSL, ZSP and MAML.”
We discuss model agnostic meta-learning (MAML) in its offline variant, i.e., as a tool for multi-task or meta-learning. The offline meta-learning method can be thought of a simpler setting than FSL, in which the downstream evaluation tasks are given to the user upfront. Thus, pre-training an encoder and training a predictor for all of the evaluation tasks can be done in one end-to-end step. On the other hand, FSL is not (in general) given any downstream task upfront, so models cannot be learned in an end-to-end manner at once. ZSP is a "harder" setting still, as no downstream data is given at any point in the training-evaluation pipeline.
“Although Section 2 has demonstrate why ZSP is best compatible with multimodal contrastive learning, it does not necessary discuss the advantage of ZSP over other SSL methods. One reason could be that obtaining labels for FSL is hard, but I would like to see more evidence supporting this.”
First, we clarify that we do not intend to “discuss the advantage of ZSP over other SSL methods”, because SSL is precursor to ZSP; they are not comparable. However, for the comparison of ZSP to FSL, obtaining labeled data is precisely the bottleneck that motivates ZSP (as you correctly pointed out). Second, the authors request additional clarification on what “more evidence supporting this” refers to, so we may adequately address your points. Having no access to downstream training data is not a claim; it is a specific data availability regime that is now well-established a modern machine learning setting (see Pourpanah (2022)). ZSP is the corresponding pipeline for this problem. Thus, we do not argue for any advantage of ZSP over FSL—they are simply different methods for different problems. Accordingly, different SSL methods accompany these problems, as alluded to in Section 2.
In fact, we generally always expect FSL to perform better than ZSP with equal pre-training data, as FSL receives strictly more task information than ZSP. That being said, a question we do consider is 1) “how close can ZSP get to the performance of FSL (with a large amount of training data)?” and 2) “how do we quantify it?”. This exactly leads to our mathematical notions of prompt complexity and residual dependence. That is, if the prompting strategy approximates the conditional distribution of the text given and we have that and are approximately conditionally independent given (see Figure 2 in the paper), then ZSP has no theoretical disadvantage over FSL. In response to your review, we verify this in simulation in Figure 6 of the rebuttal.
“[Q1 + Q2] ...how are prompts generated in the baseline method... Why does the performance of the baseline stay exactly the same as more prompts are provided? Are the baseline prompts provided all at once?”
The baseline prompts are not “generated”, as they are selected by humans as defaults in the CLIP benchmark package. Therefore, they cannot be created in arbitrarily large amounts such as LLM-generated prompts. Rather than matching the number of prompts between human baselines and LLMs, this illustration shows the scaling of the downstream classification accuracy as the user generates a large volume of prompts. Moreover, the goal of the experiment is not to market class-conditional prompting as a new method, but to experimentally verify the saturation point at which prompt variance term in Theorem 1 becomes negligible.
[Q3] “The experiment does not compare ZSP with FSL baselines. I hope the authors could provide a sound reason for not doing so. To me, demonstrating the success of ZSP requires analysis and/or experiments to show it defeats other SSL methods… Why not compare CuPL/ZSP methods to these baselines?... the paper needs to demonstrate the advantages of CuPL/ZSP over other SSL methods, but not just community-curated prompts.”
As mentioned above, demonstrating the success of ZSP as compared to FSL was not a goal of this paper. Similarly, SSL was not a baseline for ZSP to defeat, but rather a form of pre-training that may lead to the ZSP capability. However, we are interested in studying the performance gap and its dependence (see Figure 6). We provide experiments that compare the ZSP performance to FSL baselines using linear probing on the evaluation datasets seen in the paper: FGVC Aircraft, DTD, Flowers 203, SUN397 (see Figure 5).
If we have addressed your concerns, we would appreciate that your score be raised, and if not, we are happy to answer any questions or consider additional experiments!
This paper explores the theoretical foundations of zero-shot prediction (ZSP) in foundation models, establishing a formal statistical framework to analyze how pretraining on large-scale, multimodal, unlabeled datasets transitions into downstream zero-shot inference via prompting. The authors identify key factors influencing the success of ZSP and introduce a novel perspective by modeling multimodal data as a joint distribution over X (input images), Y (latent labels), and Z (image captions).
Building on this framework, they reformulate image classification as a text classification problem through prompting, leading to a perspective where zero-shot inference is viewed as a sample estimator within a two-stage regression problem. Leveraging concepts from reproducing kernel Hilbert spaces (RKHS), the authors derive closed-form estimators for their statistical framework. They further provide a theoretical analysis of sample complexity, examining the impact of both dataset size and the number of prompts used in estimation. Additionally, they propose a new loss function for ZSP, referred to as the Variance Regularized Covariance objective, and show its connection to existing self-supervision objectives. Finally, the authors conduct semi-synthetic experiments using CLIP models to empirically assess their theoretical results, particularly the relationship between prompt sample complexity and performance.
给作者的问题
-
Clarification on "Ideal Prompting": In the experiment section, what exactly does "ideal prompting" refer to? Does it mean prompts manually designed by humans, in contrast to those generated by LLMs as described in the next paragraph? Understanding this distinction is important because it affects how the results should be interpreted. If "ideal" refers to human-generated prompts, it would be useful to clarify how they were designed and why they are considered ideal. If they are derived from some theoretical criterion, explaining that explicitly would improve clarity.
-
In the section "Learning via Variance-Regularized Covariance", are the authors proposing a new loss function for ZSP, or are they simply re-deriving existing self-supervised learning (SSL) objectives within the proposed framework? This distinction is important because if a new loss function is being introduced, its effectiveness should be empirically validated to demonstrate its advantages over existing approaches. If the section instead provides a theoretical reinterpretation of existing objectives, a more detailed discussion on how this perspective enhances our understanding of SSL objectives and whether it offers any practical benefits would improve the paper.
-
Can the authors control the pretraining process to empirically validate the proposed framework? If direct control over pretraining is not feasible, would it be possible to construct a synthetic dataset that aligns with the theoretical assumptions? This would provide stronger empirical support for the framework.
-
What are the implications of prompt bias and residual dependence? These concepts are introduced as key components of the framework, but their practical significance is not clearly articulated in the later sections of the paper. Clarifying their role—either through empirical validation or additional theoretical discussion—would strengthen the paper's contributions.
论据与证据
The paper presents a statistical framework for analyzing zero-shot prediction (ZSP) in foundation models. It introduces key concepts such as prompt bias, residual dependence, and prompt sample complexity while relating a class of self-supervised learning (SSL) objectives to a variance-regularized covariance (VRC) form. While this framework offers an interesting theoretical perspective, the claims made in the paper are not sufficiently supported by empirical evidence. Below, I outline specific issues with the key claims:
- Prompt Bias and Residual Dependence:
- Claim: The authors introduce the notions of prompt bias and residual dependence to quantify deviations from optimal zero-shot inference.
- Issue: The paper does not provide empirical evidence demonstrating the practical usefulness of these concepts. While they are mathematically well-defined, their impact on real-world zero-shot tasks remains unverified.
- Connection Between SSL Objectives and Variance-Regularized Covariance (VRC):
- Claim: The paper relates a class of SSL objectives to a variance-regularized covariance formulation, suggesting theoretical connections to VICReg.
- Issue: Despite establishing these theoretical connections, the authors do not provide any experiments to validate the proposed loss function's effectiveness in the ZSP setting. Without empirical results, it remains unclear whether this formulation offers practical benefits over existing self-supervised learning objectives.
- Prompt Sample Complexity and LLM-Ensembled Prompts:
- Claim: The authors propose a notion of prompt sample complexity to support their theoretical analysis and argue that ensembling LLM-generated prompts improves zero-shot performance.
- Issue: The proposed prompt sample complexity does not substantively contribute to the theoretical analysis, as it does not establish new insights beyond existing work. Moreover, the claim that ensembling LLM-generated prompts improves performance is already well-documented in the community (e.g., [1][2][3]), making this result unsurprising rather than a novel contribution.
References: [1] Menon, Sachit, and Carl Vondrick. "Visual Classification via Description from Large Language Models." ICLR, 2023. [2] Yang, Yue, et al. "Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification." CVPR, 2023. [3] Esfandiarpoor, Reza, Cristina Menighini, and Stephen Bach. "If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions." EMNLP, 2024.
方法与评估标准
The authors use CLIP models and standard zero-shot image classification benchmarks, which are reasonable choices for proof-of-concept experiments on prompt complexity. However, this setup is insufficient to fully support the broader claims made in the paper. Notably, the analysis does not extend to other self-supervised learning (SSL) settings, such as image-to-image tasks, which could provide a more comprehensive evaluation.
Moreover, the paper explicitly poses the question: "By what composition of learning stages does ZSP achieve near-optimal statistical performance on a downstream task, and with what dependence on (1) the pre-training data distribution, (2) the downstream task distribution, and (3) the prompting strategy?" However, the empirical experiments are not structured to directly address this question, limiting the strength of the paper’s conclusions.
理论论述
I checked the correctness of the theoretical proofs in detail, but the analysis appears to be a straightforward combination of standard results in RKHS theory. However, I noticed a potential issue in lines 1621–1622, where the approximation is used without justification. Without further clarification, the validity of this assumption and the subsequent results remains unclear.
实验设计与分析
I reviewed the experimental design and analysis. The authors use CLIP models and standard zero-shot image classification benchmarks, which are reasonable choices for proof-of-concept experiments on prompt complexity. However, some aspects need further clarification.
First, it is unclear how the number of prompts is scaled in Ideal Prompting with Observations from . Specifically, do the authors sample captions from a predefined caption pool and simply combine them? More details on this process would help clarify their approach.
Second, a potential issue arises from the use of LLaMA 3 for generating prompts. The distribution in CLIP may not be well approximated by a language model, introducing an additional source of bias. This discrepancy could affect the validity of the results and should be addressed.
补充材料
I reviewed the code in the supplementary material as well as the appendix. The code includes implementations of the experiments conducted in the paper, which appear to be well-structured and correctly implemented. The appendix provides a detailed theoretical analysis of the proposed framework, including:
- Derivations of closed-form estimators within the statistical framework,
- Connections to self-supervised learning (SSL) objectives and predictors, and
- Detailed descriptions of the experimental setup.
Overall, the supplementary material is comprehensive and aligns with the main paper.
与现有文献的关系
The main contribution of this paper is to provide a theoretical foundation for the common practice of zero-shot prediction (ZSP) using multiple prompts per class as an ensemble. While this approach is widely used in the community, its theoretical underpinnings have not been well studied. The authors attempt to formalize this practice by introducing a two-stage regression framework and analyzing prompt sample complexity. This theoretical perspective helps unify existing empirical findings and suggests promising future research directions.
Specifically, prior works have demonstrated the effectiveness of class-conditional prompt ensembling in improving zero-shot classification:
- [1] Uses large language models (LLMs) to generate class descriptors for classification prompts, averaging them to create an ensemble—a direct example of the class-conditional prompt ensemble approach.
- [2] Generates visual descriptions for each class and averages them, another instance of class-conditional prompt ensembling.
- [3] Utilizes multiple concepts to describe each class and applies them in concept bottleneck models.
- [4] Extracts detailed visual descriptions from LLMs for zero-shot classification and extends this technique to few-shot adaptation.
- [5] Investigates the visual features that are most effective for vision-language model (VLM) classification, which can be interpreted as a direct attempt to estimate Y instead of sampling Z.
The proposed two-stage regression framework provides a unified theoretical perspective on these empirical approaches, offering a mathematical basis for prompt ensembling and prompting strategies in ZSP. This connection highlights key aspects of prompt bias, residual dependence, and sample complexity, which could inspire future work on designing more theoretically grounded prompting techniques.
References
[1] Menon, Sachit, and Carl Vondrick. "Visual Classification via Description from Large Language Models." ICLR, 2023.
[2] Pratt, Sarah, et al. "What does a platypus look like? Generating customized prompts for zero-shot image classification." ICCV, 2023.
[3] Yang, Yue, et al. "Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification." CVPR, 2023.
[4] Maniparambil, Mayug, et al. "Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts." ICCV, 2023.
[5] Esfandiarpoor, Reza, Cristina Menighini, and Stephen Bach. "If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions." EMNLP, 2024.
遗漏的重要参考文献
[1] analyzes the relationship between concept frequency in the pretraining dataset and its impact on downstream performance. Since the proposed framework discusses the role of pretraining data distribution in ZSP, this phenomenon seems directly relevant to their theory and could provide additional insights into sample complexity and residual dependence.
[2] proposes another statistical framework for ZSP and provides a theoretical analysis of multimodal generative AI, including CLIP. Given that the current paper also introduces a new theoretical framework, it is important to compare these approaches and clarify how they relate. This work seems particularly relevant to understanding the assumptions and limitations of the proposed framework.
References: [1] Udandarao, Vishaal, et al. "No 'Zero-Shot' Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance." NeurIPS, 2024. [2] Oko, Kazusato, et al. "A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI." arXiv preprint arXiv:2501.04641, 2025.
其他优缺点
Strengths
- The theoretical framework is interesting and could serve as a solid foundation for future work.
- The paper provides an integrated perspective on self-supervised learning (SSL) and zero-shot prediction (ZSP), which may be valuable for bridging these areas.
Weaknesses
- The paper lacks clear organization. The scope is quite broad, yet the theoretical analysis and empirical results are not well integrated to support the full range of topics covered.
- The main contribution is unclear, making it difficult to pinpoint the paper’s key takeaway.
- The experimental validation is not comprehensive enough to fully support the claims and theoretical results.
- The theoretical analysis is heavily reliant on RKHS theory but does not offer practical guidance on key aspects such as the choice of the number of prompts or effective prompting strategies.
其他意见或建议
Please see questions.
Thank you for taking the time to read our manuscript critically. This is undoubtedly one of the most comprehensive reviews we have ever received. Please see your comments addressed below.
“What are the implications of prompt bias and residual dependence?... Clarifying their role... would strengthen the paper's contributions… would it be possible to construct a synthetic dataset that aligns with the theoretical assumptions?”
We provide a synthetic example (see Figure 6) to highlight these implications. We focus on residual dependence, which has only been previously alluded to in Oko et al (arXiv, 2025), but simply assumed be zero. We aim to clearly illustrate two claims: 1) the residual dependence (for image-caption-label ) governs the performance gap between the two-stage predictor (Eq. (5) / Eq. (6) of our paper) and the Bayes optimal predictor and 2) the encoder-based zero-shot predictors used in practice (such as in CLIP) behave as the two-stage predictor given enough data. The mathematical details are given in the linked derivations. This experiment controllably interpolates between the “worst” setting , where ZSP performs at near chance and , where ZSP performs optimally. While further investigation would be interesting, both claims are supported within the simulation.
In other words, is a simple distribution parameter that governs how close the best ZSP method can do compared to the optimal downstream predictor, which is of clear interest both in theory and practice. While out of scope for this paper, we hypothesize that this quantity can be estimated in pre-training data selection methods.
“… are the authors proposing a new loss function for ZSP…? … it remains unclear whether this [variance-regularized covariance] formulation offers practical benefits over existing self-supervised learning objectives.”
There is no intention in the paper to propose a new loss function for SSL, or to even promote one SSL method over another. On the contrary, we include Appendix E to embrace existing, complementary work on SSL and to provide intuition on the relationship between our normalized cross-covariance-based estimator and the common SSL procedures used in practice. This is verified experimentally at least for the motivating example of CLIP in the given simulation.
“zero-shot image classification benchmarks… are reasonable choices for proof-of-concept experiments on prompt complexity… the analysis does not extend to other self-supervised learning (SSL) settings.”
Due to character limits, please see our response to Reviewer GT2x, in which we describe the motivation for this task. In summary, because we specifically study zero-shot prediction through prompting, similar tasks such as image/text retrieval (which do not include a prompting component) are not meaningful in our setting.
“…what exactly does "ideal prompting" refer to?.”
The ideal prompting strategy is one in which the prompt bias term is zero, i.e. the user is able to draw from the distribution (one of the implications of defining prompt bias in the first place). In the ImageNet-Captions dataset, we are able to compare to this ideal strategy because we observe direct observations of pairs (as images have both captions and labels). Thus, we hold out a pool of pre-training examples and draw from these pairs to estimate the prompts.
“Moreover, the paper explicitly poses the question: "...does ZSP achieve near-optimal statistical performance on a downstream task, and with what dependence on (1) the pre-training data distribution, (2) the downstream task distribution, and (3) the prompting strategy?" However, the empirical experiments are not structured to directly address this question.”
Thank you for raising this point. On (1), our theoretical analysis suggests that this dependence is captured by the residual dependence quantity , and upon your suggestion, we designed synthetic experiments to address this claim. We can perform similar real data experiments if the reviewer finds it helpful. On (2), we did not dedicate experiments to showing how distribution shift may affect performance, as countless empirical studies (see Quiñonero-Candela et al. (2022)) exist on this topic. On (3), the experiments in Section 4 help determine the scaling behavior of the accuracy with the number of prompts, highlighting that the ideal range for reducing prompt variance can be as high as 50-100.
We also have addressed all other comments (references, clarifications) but reserve them for the discussion period due to space limitations. Given our efforts to improve the paper in experimentation and presentation in light of your recommendations, we hope you will consider raising your score to above the acceptance threshold. Please allow us to answer additional concerns you have!
Thank you for your detailed and thoughtful response — many of your clarifications effectively addressed my questions and concerns. I’m now more convinced that the paper makes meaningful contributions. While I still think the organization could be improved to more clearly highlight the theoretical insights and their practical implications, I am considering increasing my score to above the acceptance threshold.
Regarding your response
Thank you for raising this point. On (1), our theoretical analysis suggests that this dependence is captured by the residual dependence quantity , and upon your suggestion, we designed synthetic experiments to address this claim. We can perform similar real data experiments if the reviewer finds it helpful. On (2), we did not dedicate experiments to showing how distribution shift may affect performance, as countless empirical studies (see Quiñonero-Candela et al. (2022)) exist on this topic. On (3), the experiments in Section 4 help determine the scaling behavior of the accuracy with the number of prompts, highlighting that the ideal range for reducing prompt variance can be as high as 50-100.
On point (1), I believe that including a corresponding real-world experiment—though understandably difficult to add during the rebuttal period—would substantially strengthen the paper’s contributions. Currently, the real-data experiments primarily focus on prompting strategies, which may make it harder for readers to fully appreciate the broader theoretical implications.
On point (2), could you clarify or point to where in the manuscript your theoretical framework is connected to prior work on distribution shift? In particular, citing work where significant label shift leads to degraded performance in zero-shot prediction would help situate your theory more clearly within the existing literature.
Lastly, I’d be very interested in seeing the clarifications and references you mentioned were omitted due to space constraints.
Thank you again for your careful and thorough reply!
Thank you for engaging in the discussion!
On point (1), we agree that bringing ideas from the simulation into a corresponding real-world experiment is the ultimate goal and we will include such an experiment in the manuscript. We hope that the simulation can at least express our vision for the types of experiments that can be conducted to illustrate our theory. In particular, correlating accuracy gaps to (estimates of) the residual dependence on benchmark datasets will aid our case, as you pointed out.
On point (2), in our response, we alluded to two settings. The first concerns more classical results on distribution shift in which the pre-training task may be supervised (i.e. ImageNet classification) and the downstream task has the same label space (so that no fine-tuning is necessary) or a fixed fine-tuning budget is permitted. These studies include Hendrycks & Dietterich (ICLR, 2019) and Recht et al (ICML, 2019), where both natural and synthetic shifts are applied and both FSL and ZSP performance is measured. The second setting is the modern prompting-based ZSP that is studied in our paper. One central reference is Goyal et al. (CVPR, 2025). Their analyses are in a setting that lies between FSL and ZSP, in that one may access image-caption pairs from the downstream task, but may not necessarily have direct image-label pairs. Based on your feedback, we plan to conduct experiments similar to those in their Section 4.1 for the final version, by measuring the correlation of ZSP performance with increasing severity of data corruption (in the sense of datasets such as ImageNet-C and CIFAR10-C).
Clarifications and References:
“…the approximation is used without justification.”
The approximation is meant to follow from a first-order Taylor expansion under the assumption is sufficiently small. To make an exact equality we may use the exact formula and carry the remainder. We will make this edit in the final version.
“The distribution in CLIP may not be well approximated by a language model, introducing an additional source of bias. This discrepancy… should be addressed.”
The language model is the source of bias which is quantified in the theory, not an additional one.
“[1] analyzes the relationship between concept frequency in the pretraining dataset and its impact on downstream performance… this phenomenon seems directly relevant to [the authors'] theory and could provide additional insights into sample complexity and residual dependence.”
It is absolutely relevant to both sample complexity and residual dependence, and we will include the discussion in the final version. The near-exponential scaling of pre-training data with linear improvement to downstream classification shown in [1] is equivalent to the excess misclassification risk decaying at near , which is slower but practically reflective of our rate in Theorem 1. We hypothesize that their concept frequency notion reflects residual dependence; a conceptually rich caption can be predictive of the class label even without the image, indicating near conditional dependence of the image and label given .
“[2] proposes another statistical framework for ZSP and provides a theoretical analysis of multimodal generative AI, including CLIP… it is important to compare these approaches and clarify how they relate.”
Thank you for identifying this relevant reference. We will discuss [2] as concurrent work in adherence to the ICML 2025 Guidelines. This work, while operating in a similar framework, captures a complementary aspect of the problem: the ability of the encoders learned by the CLIP objective to capture relevant distributional information. This leads to their concept of approximate sufficiency, and the generalization bounds measure errors in the encoders in terms of this sufficiency term (as opposed to our use of sample complexity).
We feel the more interesting comparison is made when considering the analysis of downstream predictions. [2] makes two idealized assumptions in the case of ZSP: 1) they assume that at inference time, the user may sample prompts directly from the distribution of given (see the setup before their Eq. (5)) and 2) they assume (see their Assumption 2) that and are conditionally independent given . The fact that neither of these hold in practice is precisely what gives rise to our notions of prompt bias and residual dependence; these two quantities exactly quantify the degree to which these assumptions are violated. Thank you once again for suggesting this comparison.
The paper proposes a theoretical framework for zero-shot prediction linking pre-training to prompting and also introduces residual dependence (information loss between modalities) and prompt complexity (sample/prompt trade-offs). Risk bounds show ZSP needs huge pre-training data but few prompts.
update after rebuttal
Thanks for the effort, I decide to keep the score.
给作者的问题
No
论据与证据
Yes
方法与评估标准
Yes. The framework and experimental design rigorously explains why self supervised learning pre-training + prompting works for zero-shot image tasks.
理论论述
Yes
实验设计与分析
Yes
补充材料
Yes
与现有文献的关系
The work proposes a theoretical framework for zero-shot prediction linking pre-training to prompting
遗漏的重要参考文献
No
其他优缺点
Strengths: Rigorous analysis; connects SSL objectives (CLIP, VICReg) to theory; Explains success of LLM-generated prompts Weakness: The theoretical analysis is limited to CLIP-like multi-modal models.
其他意见或建议
No
Thank you for your review—we address your main comment below.
"The theoretical analysis is limited to CLIP-like multi-modal models."
Our analysis broadly describes multimodal encoder + prompting strategies, where the encoders could be learned by a variety of objectives (not only CLIP). Please see Appendix E for a review of such multimodal encoder strategies, including Multimodal InfoNCE/CLIP, BarlowTwins/Nonlinear CCA, the Spectral Contrastive Loss, and Multimodal VICReg. Crucially, we do not focus on the mechanics of how particular objectives or optimization algorithms result in particular encoders (see the related work Section 1 and references therein). We focus on the downstream performance is affected by 1) the dependence structure of , the image-caption-label triple, and 2) the nature and amount of the prompting strategy.
To improve based on your feedback, we have included a simulation in Figure 6 of the rebuttal wherein the two-stage predictor analyzed in our paper is compared to both CLIP and VICReg trained on the corresponding data. We find that the dependence of two-stage prediction on the residual dependence follows the same trend as that of both CLIP and VICReg.
If this point is addressed adequately, then we would appreciate that the score be raised; if not, please let us know if you have further comments or questions and thank you once again for this feedback!
This paper develops risk bounds for multi-modal foundation models in the zero-shot prediction (ZSP) setting. In such models, two modalities (such as text and images in the case of vision-language models like CLIP) are each embedded into a shared space. In the ZSP setting, test instances of one modality are classified by measuring their distance in the shared space to possible class labels represented by prompts in the other modality.
The first contribution of this paper is to present a nice formalization of this complex setup that relates the pre-training data of the foundation model to the data of the downstream task. Crucially, this setup relaxes the assumptions of prior work that require the marginal distributions of example descriptions to match between the pre-training and test data. Instead, the proposed framework allows for distribution shift between the two.
The second contribution is to formulate generalization guarantees in this setting. The guarantees are built in terms of reproducing kernel Hilbert spaces. A key piece of these guarantees is formalizing the notion of prompt bias with respect to a hypothetical ideal prompt.
The reviewers agreed that the paper was insightful and significant for an important setting that has been challenging to analyze theoretically. During the discussion period, several clarifications were developed that the authors are encouraged to add to the final version. In addition, Reviewer fhW7 noted the relationship between the paper's analysis of class conditional prompts and a recent line of practical in generating such prompts. The paper could also benefit from discussing this connection.