6.3

/10

Poster3 位审稿人

最低5最高8标准差1.2

3.0

置信度

正确性3.0

贡献度2.0

表达3.3

ICLR 2025

Can Generative AI Solve Your In-Context Learning Problem? A Martingale Perspective

Andrew Jesson,Nicolas Beltran-Velez,David Blei

OpenReview PDF

提交: 2024-09-25更新: 2025-03-02

摘要

This work is about estimating when a conditional generative model (CGM) can solve an in-context learning (ICL) problem. An in-context learning (ICL) problem comprises a CGM, a dataset, and a prediction task. The CGM could be a multi-modal foundation model; the dataset, a collection of patient histories, test results, and recorded diagnoses; and the prediction task to communicate a diagnosis to a new patient. A Bayesian interpretation of ICL assumes that the CGM computes a posterior predictive distribution over an unknown Bayesian model defining a joint distribution over latent explanations and observable data. From this perspective, Bayesian model criticism is a reasonable approach to assess the suitability of a given CGM for an ICL problem. However, such approaches---like posterior predictive checks (PPCs)---often assume that we can sample from the likelihood and posterior defined by the Bayesian model, which are not explicitly given for contemporary CGMs. To address this, we show when ancestral sampling from the predictive distribution of a CGM is equivalent to sampling datasets from the posterior predictive of the assumed Bayesian model. Then we develop the generative predictive $p$-value, which enables PPCs and their cousins for contemporary CGMs. The generative predictive $p$-value can be used in a statistical decision procedure to determine when the model is appropriate for an ICL problem. Our method only requires generating queries and responses from a CGM and evaluating its response log probability. Using large language models, we empirically evaluate our method on tasks involving tabular data, imaging data, and natural language data.

关键词

generative modelsBayesianin-context learninggeneralizationmodel checking

评审与讨论

审稿意见

评分: 6置信度: 22024-10-27

This paper studies how the tools and terminology from Bayesian modelling can be applied to in-context learning with LLMs.

The paper has ten pages.

The first six pages introduce a lot of terminology from Bayesian modelling and ask how this is connected with the auto-regressive next token objective from the LLM. The authors make plausible that the terminology from Bayesian modelling offers a useful perspective on LLM in-context learning and derive a Theorem that shows that "the martingale predictive p-value is equal to the posterior predictive p-value."

The last four pages of the paper centre around a simple empirical application. The authors evaluate Llama-2 7B and Gemma 9B on a synthetic regression task, as well as on SST2 and MedicalQP. They find that the small models handle SST2 well but perform no better than random on MedicalQP.

The paper is overall well-written and sound. However, its research contribution is not significant enough for ICLR.

优点

The paper's topic, how tools from Bayesian analysis can be useful to understand in-context learning with LLMs, is interesting.

The paper is well-written in the sense that its language and structure are clear. I also like the expressive section headings.

I did not check the details of the proofs, but both the theoretical arguments in the paper and the experimental results make sense.

缺点

While the topic of the paper is interesting and the paper is well-written, neither the theoretical nor empirical results in this paper are strong enough to warrant acceptance at ICLR.

The insight that there are connections between in-context learning and Bayesian methods is not novel (https://arxiv.org/pdf/2111.02080). While the first 6 pages of this paper offer a nice methodological discussion, it remains unclear what the research contribution is (apart from Theorem 2, which seems to be a relatively straightforward result).

The experimental analysis is limited to a synthetic regression task and two real-world datasets. The performance difference between SST2 and MQP for the small models is so striking that I do not believe we would need any of the Bayesian tools offered in this paper to see this.

Given the motivation in the first paragraph of the paper. I think think the Medprompt paper is a missing reference: https://arxiv.org/abs/2311.16452

问题

Why are there no references in the first two paragraphs of the paper? In general, I find many points in this paper where references are required.

Figure 1: What is the meaning of the different model responses? Did you sample? If yes, at what temperature?

Why did you choose SST-2 and MQP? Why Llama-2 7B model?

Why not consider a bigger model (e.g. Llama-3 405B) and see if it can perform better than random on MQP?

"ancestral sampling": At the end of the day, is this just a fancy term to describe how you are doing standard sampling of the next token from the language model? You are introducing a lot of terminology in this paper, and it is not always clear whether this is required (also: "explanation", "library")

After discussion period:

After the discussion with the authors, I have decided to increase my score to 6.

The generative predictive p-value is an interesting concept. The authors believe that it has the potential to improve upon existing techniques, and I find their arguments broadly plausible. This concept might be of interest to the ICLR community.

Reason for why not higher score:

Even though the authors provided additional experiments as part of the rebuttal, this paper's experimental section is still a weak spot.

Concretely, I don't think that the current version of the experiments makes a sufficiently strong case for the usefulness of the generative predictive p-value, in the way in which the authors claimed that it can be useful during the rebuttal. The reader who is not genuinely interested in the Bayesian motivation and theory behind the generative predictive p-values is not sufficiently convinced by the presented experiments. The particular experiments in this paper are also not connected to the recent literature on in-context learning (at least, I don't see a strong connection). For example, there are, at this point, already several different papers that study in-context learning with LLMs and tabular data, and it is unclear how the experiments in this paper relate to the overall results in the literature.

It is a minor point, but some aspects of the experimental section are also not particularly well done, such as the descriptions of the figures.

2024-11-27

Thank you for your thoughtful and detailed feedback on our paper. We are glad to hear that you found the topic interesting and appreciated the clarity of our writing and structure. Your comments have helped us improve the clarity of our work and we have updated the manuscript accordingly.

The insight that there are connections between in-context learning and Bayesian methods is not novel (https://arxiv.org/pdf/2111.02080).

Thank you for the opportunity to clarify. We note that the paper of Xie et al. is cited in the original manuscript (line 201). We hope our revised manuscript better conveys that our work builds off the Bayesian interpretation of ICL and addresses the underexplored question of whether the latent model implied by a CGM is appropriate for a given task. This is related to testing Assumption 4 (Well-specification) of Xie et al.. We have added Figure 2 to illustrate the consequences of using a misaligned model to make inferences about data generated according to a given ICL task. Figure 2d illustrates a particularly concerning case where the model is confident and wrong about its predictions. We design our methods to predict these cases and cases when there are not enough in-context examples for reliable task completion. For the latter cases, the utility of the NLL discrepancy function becomes apparent. We have added synthetic experiments in Section 7.2 to illustrate that while the NLL and NLML discrepancy functions lead to accurate predictors of model capability, only the NLL discrepancy provides information about whether there are enough in-context examples. We further demonstrate in Figure 10 that this information can reduce risk, a necessary consideration in safety-critical applications. Our contribution enables the estimation of discrepancies like the NLL using CGMs like large language models.

The experimental analysis is limited to a synthetic regression task and two real-world datasets.

We have added additional experiments to address this concern. For the language task, we add AG News as another in-capability task and RTE as another out-of-capability task. Figure 8 illustrates that the generative predictive $p$ -value is still a robust predictor of out-of-distribution tasks. In applications of generative AI such as image or document completion, accuracy is either poorly defined or does not capture the complexity of the task. We devise an in-context generative fill experiment using the SVHN (in-distribution), MNIST (near OOD), and CIFAR-10 (far OOD) datasets. The model is prompted with several in-context image examples and asked to complete the missing half of the last example. The goal is to produce sensible completions, but not necessarily reconstruct the target image exactly. Figure 5 of the updated manuscript illustrates this task. Figure 5a demonstrates that when the model is trained on sequences of SVHN images, it provides plausible completions given the first half of missing images from the held-out test set. Figure 5b illustrates that when the model is asked to do the same task on OOD MNIST images, it produces plausible completions often, but also frequently hallucinates odd completions or artifacts as in rows 4 and 5. Finally, when the model is prompted to complete OOD CIFAR-10 images, it produces consistent completions but is also prone to clear hallucinations from the SVHN domain as in row 4. Figure 9 plots the OOD detection metrics and shows that the generative predictive $p$ -value again yields an accurate predictor for a task where accuracy is not a suitable metric.

Why are there no references in the first two paragraphs of the paper?

We appreciate your concern and have added citations for in-context learning, posterior predictive checks, and Medprompt to the introduction.

Figure 1: Did you sample? If yes, at what temperature?

We sampled responses at a temperature of 0.95.

Why not consider a bigger model (e.g. Llama-3 405B) and see if it can perform better than random on MQP?

We are very keen to assess our methods using larger models, but are currently limited by compute resources to do so. More importantly, the tasks we evaluate become in-capability as model size increases, so they would not be sensible for out-of-capability detection.

"ancestral sampling":

Thank you for your question! At the end of the day you are right that ancestral sampling just ends up reducing to sampling next tokens. We chose it, however, because we were refering to the higher level "abstractions" of $x_i$ which can be made up multiple tokens and we sample "chunks" of $x_i$ at a time and we thought that it made a better allusion to this fact. Moreover, we thought this was ok as it is a commonly used term for autoregressive sampling.

评论- Reply to authors

2024-12-02

Thanks to the authors for the revised paper and the detailed responses.

I read the other reviews, author responses and the revised paper.

Given that the authors have added significantly more experiments, the criticism from my original review no longer applies. It's also now clear to me that the authors have carefully considered their theoretical framework, which Reviewer 2r64 seems to think is interesting.

My main confusion about the current version of the paper is that I don't understand the practical advantages of the generative predictive p-value. Both other reviewers also raised this point (" Why not simply evaluate the accuracy"?). I read the author's response in the global comment and looked at the image experiments in the paper, and I must admit that I still don't get the advantage of the proposed method.

For the particular case of conditional image generation, I understand that evaluating the quality of the generated samples is not straightforward. However, this problem is well-recognized in the image generation literature, and various metrics for the quality of generated images, such as Fréchet Inception Distance (FID), have been proposed. I don't understand how the approach in this paper would practically improve upon using these metrics.

In Algorithm 1, one choice is the discrepancy function. The paper seems to use mostly NNL. Intuitively, it seems to me that the ability of the posterior predictive p-value to criticise the model in complex application scenarios depends on the fact that we have chosen a good discrepancy function. I wonder if this observation is correct and if we could choose metrics such as FID as the discrepancy function.

My current assessment of this paper after the rebuttal is that this paper clearly contains interesting ideas. Insofar as the theoretical novelty of the posterior predictive p-value is concerned, I find this hard to judge. From a more applied perspective of in-context learning that I understand well, this paper does not (yet) convince me that the proposed approach is practically relevant. One reason for this is that while the experiments are now much extended, I still don't see any application where I could not imagine a traditional approach (evaluating model responses with some appropriate metric) working reasonably well. This makes me still a bit uncertain how interesting the current version of the paper will be to the ICLR audience.

Based on these considerations, I have decided to increase my score to 5 and lower my confidence to 2. I'd be happy to further adapt my assessment after additional comments by the authors or other reviewers.

评论- Practical advantages

2024-12-03

Thank you for your updated feedback and for carefully reviewing our revisions. We greatly appreciate the increased score and your thoughtful assessment. Your concerns about the practical advantages of the generative predictive p-value are insightful, and we'd like to elaborate further.

The generative predictive p-value provides a nuanced assessment of model reliability that generalizes beyond traditional accuracy metrics. While we acknowledge that metrics like accuracy or FID may be sufficient in many scenarios, our approach addresses several key challenges that standard metrics often overlook:

Hard-to-Specify Evaluation Metrics: Consider the image completion experiment we included in the revised paper. The quality of a generated image is not solely about its similarity to a reference but also about whether the completion is contextually sensible and free of hallucinations. While FID captures sample diversity, it does not assess whether the output aligns with the context. The generative predictive p-value, by contrast, evaluates the consistency of generated content relative to the in-context examples, making it applicable even when evaluation metrics are subjective or vaguely defined.
An Interpretable and General Decision Rule: A unique advantage of our approach is that it effectively provides an out-of-capability classifier using only the model and the provided examples. For example, while FID could serve as a feature to predict generalizability to unseen image completion tasks, it necessitates additional steps like training a classifier or finding the appropriate threshold. The generative predictive p-value, on the other hand, inherently identifies relatively high or low metric values. With the NLML discrepancy, we show that the significance level $\alpha$ approximates the empirical false positive rate, offering interpretability. Similarly, with the NLL discrepancy, a significance level of $\alpha=0.5$ achieves reliable out-of-capability prediction accuracy across the tasks we evaluate, perhaps providing a practical rule of thumb.
Guidance for Model Selection: While we have focused on out-of-capability detection, our work provides the theoretical foundations for follow-up work exploring model selection. Anecdotally, application developers using LLMs (e.g., for AI assistants) often include in-context (few-shot) examples in system prompts to improve model responses to user inputs. These examples may serve varied purposes, from enhancing accuracy for specific tasks to guiding input formatting for system compatibility or aligning outputs with developer-defined criteria. When choosing between models (Claude, GPT, or Gemini) or model versions, developers typically have access to log probabilities and sampling. Our approach offers a practical tool to compare model capability for tasks specified in system prompts. This approach should be robust to differences in scale arising from inter-model differences such as vocabulary size, which would challenge direct comparison of the NLML. The generative predictive p-value could then allow designers to make informed decisions about which model to use that could complement existing evaluation protocols or serve as an efficient approximation.

Regarding your comment on using FID as a discrepancy function in Algorithm 1, we agree that exploring such task-specific discrepancies could enhance practical applications like image generation. While we focused on the NLML and NLL discrepancies in this work, future research could incorporate metrics like FID for specific generative models. We are eager to explore this direction further in subsequent studies, knowing that our contribution provides the theoretical foundations for such extensions.

We hope this additional context clarifies the practical relevance of our approach beyond standard metrics. We will add a section in the appendix outlining future directions and application areas. Your engagement with our work is greatly valued, and we welcome any further questions or suggestions you might have to refine our contribution. Thank you again for your constructive feedback.

审稿意见

评分: 8置信度: 32024-11-03

The authors consider the problem of determining whether a CGM can solve an ICL task. Taking a Bayesian view of ICL, the authors use Bayesian model criticism techniques (e.g., PPCs) to tackle this problem. The primary technical challenge is that CGMs don’t expose the posterior and likelihood which are required for certain types of discrepancy functions. The authors then introduce a practical estimator for the posterior predictive p-value that involves repeated ancestral sampling from the predictive distribution of the CGM and show this is equivalent (asymptotically) to posterior predictive p-value. The authors evaluate their method of detecting whether LLMs can solve ICL tasks in synthetic and real-world settings.

优点

I think the framework is interesting. There are lots of papers that view ICL through a Bayesian lens, but this paper is pretty unique in introducing and leveraging this perspective for model criticism. I think the theoretical formulation of this problem through the lens of Bayesian model criticism helped me better understand ICL.

The theoretical results/perspective motivate a fairly intuitive, practical, and general estimator: drawing samples from the CGM and using those samples to form a null distribution over some discrepancy function and then computing a p-value using that empirical null distribution. It’s nice that this intuitive approach emerges naturally from a Bayesian perspective, although I'm curious if people have introduced a similar method before but motivated in a more ad-hoc way.

The writing was generally clear, and there was enough background to help me understand the technical details.

缺点

My biggest concern is: could the authors elaborate more on what they think are compelling, practical use cases? I think right now, the empirical evaluation doesn’t quite do this line of work justice; I’ll elaborate more below, but that being said, I’m open to updating/revising my assessment, given the author’s response.

The method is presented as a way to estimate when a CGM can solve an ICL task. Can’t you often just directly evaluate this, though? For example, you could sample from a CGM and ask a domain expert or write some program to evaluate the responses. For example, in Experiment 3, you actually have ground truth labels for the queries, right? You already know that Llama-2 7B can’t solve the MQP task, so what additional value does this method provide?

I think the value of this method would be more apparent in settings (1) where ground truth labels are very hard to source and (2) there’s ambiguity about whether a given ICL task is “in-capability” or “out-of-capability.” Adding some additional evaluation on those sorts of tasks would strengthen the submission.

Another question I have is: to what extent do we need discrepancy functions that rely on having the model likelihood and model posterior explicitly? If you choose g to be the log marginal model likelihood (as I think you do in some of the experiments), you can estimate the posterior predictive p-value exactly right (I think this is the “Lite” generative prediction p-value algorithm)? I guess it’s not totally clear to me that the NLL method is reliably better. I don't think this significantly diminishes the importance of developing a practical estimator for situations where you need access to the likelihood, and this probably depends on the task, but I would appreciate a more thorough discussion of how to choose a good g().

In practice, the martingale p-value incurs estimation error. I would've liked to see some synthetic experiments that analyze this empirically. For example, how large does N need to be for the approximation to be good? I suspect there's a fairly simple experiment here where you consider some model where the likelihood and posterior are available.

There are a few places where the writing could be improved. For example, when you define the generative process involving f, it might help to give some concrete examples of what that actually means in a practical ICL task. Also, two different discrepancy functions were introduced in Section 5, but you only need the ancestral sampling-based estimator (introduced in Section 5) when you use the second discrepancy function. I think I was a bit confused when I read Section 6 after Section 6.

问题

Some minor comments/questions

I’m not very familiar with the theory underlying PPCs, but I don’t think the posterior predictive p-values always share the theoretical properties of frequentist p-values (e.g., uniformly distributed under the null except maybe when the test statistic is ancillary). Is the same true for martingale p-values and does this have any practical implications?

The notation |(z_i, y_i)| was a bit confusing to me. I think this refers to the number of tokens comprising the pair of query and response?

In the abstract, “is equivalent sampling” -> “is equivalent to sampling".

2024-11-27

I don’t think the posterior predictive p-values always share the theoretical properties of frequentist p-values

Thank you for this question. In our case, similar properties hold. For the sake of the argument, assume that $x^{\text{test}} \sim p_{\text{test}}(x)$ .

We can then use as our null hypothesis the statement that $p(f \mid x^n)$ concentrates around values of $f$ such that $p_{\text{test}}(x) = p(x \mid f)$ . That is, the null is that our model is concentrated around the distribution from which $x^{test}$ was sampled. Under this null, the $p$ -values used in the paper are uniformly distributed with respect to the distribution of $p_{\text{test}}$ .

It is easily shown why this is the case. First, note that if $f, f'$ are two values around which $p(f \mid x^n)$ concentrates, we have that $p(x \mid f) = p_{\text{test}}(x) = p(x \mid f')$ . In turn, this also implies that $g(x, f) = g(x, f')$ . Therefore, picking any such $f'$ , we can rewrite

p_{\text{ppc}} := \int \int \mathbb{1} [g(x, f) \geq g(x^{\text{test}}, f)] dP(x \mid f) dP(f \mid x^n)

p_{\text{ppc}} := \int \int \mathbb{1}[ g(x, f') \geq g(x^{\text{test}}, f') ] dP(x \mid f').

Because $p(x \mid f) = p_{\text{test}}(x)$ , this can be further rewritten as

p_{\text{ppc}} := \int \int \mathbb{1}[g(x, f') \geq g(x^{\text{test}}, f')] dP_{\text{test}}(x).

Assuming that the CDF of $g(x, f')$ is continuous under $p_{\text{test}}(x)$ , this can be rewritten as

p_{\text{ppc}} = 1 - F_g(g(x^{\text{test}}, f')),

where $F_g()$ is the CDF of $g(x^{\text{test}}, f')$ under $p_{\text{test}}(x)$ . Finally, by the properties of the CDF, we know that $F_g(g(x^{\text{test}}, f'))$ is uniformly distributed, and so $p_{\text{ppc}}$ must also be uniformly distributed.

Practically, this has consequences similar to the frequentist case. In particular, we could use it in the traditional hypothesis-testing style to reject models with guarantees on Type 1 errors.

The notation |(z_i, y_i)| was a bit confusing to me. I think this refers to the number of tokens comprising the pair of query and response?

We have added a clarification on line 383 of the updated manuscript.

In the abstract, “is equivalent sampling” -> “is equivalent to sampling"

Thank you, we have corrected this in the updated manuscript.

2024-12-04

Thank you for the time and effort you dedicated to reviewing our work. Your thoughtful questions and feedback have led to significant improvements in the revision of our paper. We deeply appreciate your insights and hope our responses and the updated manuscript meet your expectations.

2024-11-27

[T]o what extent do we need discrepancy functions that rely on having the model likelihood and model posterior explicitly?

The differences between $g$ become apparent when the model is uncertain because there are too few in-context examples to specify the task.

To illustrate this difference, imagine two models. The first model has posterior $p_1(f \mid x^n) = N(0, 1)$ and likelihood $p_1(x \mid f) = N(f, 0.0001)$ . The second model has posterior $p_2(f \mid x^n) = N(0, 0.0001)$ and likelihood $p_2(x \mid f) = N(f, 1)$ . In both models, the posterior predictive is essentially the same standard normal, $p(x \mid x^n) = N(0, 1)$ . However, in the first model, the posterior predictive variance is due to epistemic uncertainty (we are unsure about the correct $f$ ), while in the second model, it is due to aleatoric uncertainty (we are confident about $f$ , but the task is inherently stochastic). We illustrate this example in Figure 16 of appendix F in the updated manuscript.

Now assume that we have a test point $x^{\text{test}} = 0.5$ . Ideally, we want to say that the second model does a good job of predicting this data point because the task is well specified, while the first one does not because we are still uncertain about the task. That is, the first model assigns high probability to many values of $f$ where $x^{\text{test}}$ is unlikely.

Depending on the discrepancy we choose, we may or may not be able to distinguish between the two scenarios and reject the correct model. For example, if we use

g(x, x^n) := -\sum_{z_i, y_i \in x} \log p(z_i, y_i \mid x^n),

$\log p(z_i, y_i \mid x^n)$ is the same for both models and we will have identical $p$ -values, which will indicate that both models are suitable—an undesirable outcome.

On the other hand, if we use

g(x, f) := -\sum_{z_i, y_i \in x} \log p(z_i, y_i \mid f),

the $p$ -values will be quite different. For the first model, many values of $f$ will fall far away from $x^{\text{test}}$ when computing

p_{\text{ppc}} := \int \int \mathbb{1}[g(x, f) \geq g(x^{\text{test}}, f)] dP(x \mid f) dP(f \mid x^n).

As a result, $g(x^{\text{test}}, f)$ will be much lower than $g(x, f)$ , and the PPC will be quite low. However, in the second model, this will not happen, as all values of $f$ will be sampled around $0$ , and $x^{\text{test}}$ is a reasonable observation for a normal distribution centered at $f \approx 0$ with standard deviation $1$ . Therefore, using the $f$ -dependent PPC provides the desired behavior because it informs us when the model is confused about the task.

We have added Section 7.2 to highlight that the NLL discrepancy function is indicative of whether there are enough in-context examples. This added information is useful in risk-sensitive applications where the cost of responding incorrectly is high. For example, a medical recommendation system that autonomously responds if the $p$ -value is greater than the significance level $\alpha$ . We use prediction RMSE over task responses to measure reliability. Figures 10a and 10b plot the RMSE against the $p$ -values computed under the NLL and NLML discrepancies for in-distribution polynomial tabular tasks. We see that lower $p$ -values correlate with higher RMSE for the NLL discrepancy, but not for the NLML discrepancy. At $\alpha=0.1$ , the NLL discrepancy reduces the generation of responses with higher error because it accounts for the number of examples provided. To quantify this intuition, we define risk as the sum of task RMSEs for in-capability predicted tasks. Figures 10c and 10d show that for equally accurate predictors the NLL discrepancy results in substantially reduced risk.

Since the $p$ -value computed under either discrepancy yields accurate predictors of model capability, the choice between discrepancy functions ultimately comes down to a decision on whether the added computational cost of generating dataset completions is justified. If you need to know whether there are enough in-context examples to generate an accurate response---a necessity in risk-sensitive applications---then we recommend using the NLL discrepancy function. If computational efficiency or the cost of response deferral are primary concerns---practical user experience considerations---we suggest using the NLML discrepancy.

2024-11-27

[H]ow large does N need to be for the approximation to be good?

This question interests us as well and we have added an experiment in Section 7.3 to address your concern. Inspection of Equations 1-3 makes clear that the dataset completion size $N-n$ should closely interpolate $p$ -value estimates between $p_{\text{ppc}}$ computed with the NLML discrepancy and with the NLL discrepancy using the likelihood and posterior of a Bayesian model. To verify this, we use a reference Bayesian polynomial regression model to compute the $p_{\text{ppc}}$ . We use our Llama-2 regression model fit to datasets generated from the reference model likelihood under different explanations to compute $p_{\text{ppc}}$ . We let datasets generated by random ReLU-NNs serve as OOD tasks. Figure 11 demonstrates that our expectation is true. Specifically, the $p$ -value estimates are distributionally close to those calculated under the NLML at $N-n = 2$ , and they more closely approximate those calculated under the reference NLL discrepancy as we increase $N-n$ to $100$ . The latter observation is also illustrated in Figure 12 of the appendix.

Ultimately the rates of convergence will depend on the model and the ICL task. For example, linear models and data will converge quite quickly and thus only require a small N. Using a relatively small $N-n$ value of $10$ , Figure 13c shows quantifiable differences between the p-values calculated under the NLL and NLML discrepancies in our natural language experiments. This indicates that the rate of convergence to the martingale predictive $p$ -value is much quicker in this setting than for our synthetic polynomial model.

2024-11-27

Thank you for your thoughtful and constructive feedback on our paper. We are delighted that you found our Bayesian framework for model criticism in in-context learning (ICL) interesting and that it helped deepen your understanding of ICL. Your positive remarks on our theoretical formulation and the intuitive estimator we introduced are greatly appreciated. We address each of your concerns below and have updated our paper accordingly. Your feedback is sincerely appreciated and has significantly improved the quality and clarity of our work.

The method is presented as a way to estimate when a CGM can solve an ICL task. Can’t you often just directly evaluate this, though?

Thank you for raising these points. We agree that when ground truth labels are available and a clear metric is defined, directly evaluating the CGM’s performance on an ICL task is straightforward. However, as the reviewer noted, the strength of our proposed method lies in its applicability to situations where evaluation metrics are hard to specify or labels are hard to source. For instance, if the ICL task involves open-ended queries or responses, metrics like accuracy may be insufficient, but our proposed checks are still effective. This flexibility is especially valuable when the CGM operates as part of a larger system where in-context examples are highly heterogeneous, making it impractical to define a single evaluation metric (e.g., an API servicing diverse responses using ICL). Moreover, since our approach gives an interpretable measure of model capability, it could be an invaluable tool for the scalable oversight of models capable of discovering solutions that even domain experts have difficulty judging. This flexibility is essential for enabling the automated use of CGMs in diverse scenarios.

We address this concern directly by adding an experiment to the empirical evaluation. In applications of generative AI such as image or document completion, accuracy is either poorly defined or does not capture the complexity of the task. We devise an in-context generative fill experiment using the SVHN (in-distribution), MNIST (near OOD), and CIFAR-10 (far OOD) datasets. The model is prompted with several in-context image examples and asked to complete the missing half of the last example. The goal is to produce sensible completions, but not necessarily reconstruct the target image exactly. Figure 5 of the updated manuscript illustrates this task. Figure 5a demonstrates that when the model is trained on sequences of SVHN images, it provides plausible completions given the first half of missing images from the held-out test set. Figure 5b illustrates that when the model is asked to do the same task on OOD MNIST images, it produces plausible completions often, but also frequently hallucinates odd completions or artifacts as in rows 4 and 5. Finally, when the model is prompted to complete OOD CIFAR-10 images, it produces consistent completions but is also prone to clear hallucinations from the SVHN domain as in row 4. Figure 9 plots the OOD detection metrics and shows that the generative predictive $p$ -value again yields an accurate predictor for a task where accuracy is not a suitable metric.

We clarify that the MQP example is used solely to validate the scalability of our method to natural language tasks using pre-trained LLMs. We acknowledge that Llama-2 7B is not well-suited for this task, which is precisely why we chose to test our method against it.

审稿意见

评分: 5置信度: 42024-11-05

The current paper aims to quantify whether a conditional generative model (e.g., an LLM) can solve an in-context learning problem. To achieve this goal, authors relate the problem to that Bayesian model criticism, i.e., how decent a model are we currently using for solving the task at hand. Since the hypothesis used by, e.g., an LLM to solve an ICL task is bound to be unclear, the paper relies on use of the posterior predictive distribution for achieving its goal. Specifically, the authors use posterior predictive checks to perform model criticism. Results show proposed tests correlate with model abilities to solve an ICL task.

优点

What I enjoyed the most about this paper is its really nice, pedagogical writing: even though no particularly new concepts were proposed in the technical sections, the slowly developing pace to contextualize ICL within a Bayesian framework made for a really enjoyable read. This did however take away a lot of space, leading to the experimental section becoming rather sparse on discussion.

缺点

My main apprehension is that the targeted problem of "what ICL problem can a generative model solve" wasn't sufficiently clarified, leading to confusion in regards to how the results should be interpreted. For example, if the goal is to judge whether my model can solve a task in-context, can I not just evaluate on said task via a benchmark and get a performance estimate (e.g., accuracy)? What else do tests like posterior predictive check offer? In fact, results in the paper (e.g., see Figure 6) show that proposed metrics and accuracy follow an almost one-to-one mapping with respect to each other. Thus, I could have just accuracy as an estimate of how well the model performs the task (if at all). To address this comment, I would request authors to clarify what exactly is a good standard for an answer to the question what ICL problem can a generative model solve. If it's to make a predictive test, i.e., something I can use to preemptively say my model is incapable of performing some task, then I think the currently proposed measures don't fall under that definition.
At Line 231, authors say when the discrepancy of the data generated under the model is greater than that of the holdout data, then we can be confident that the model explains the holdout data well. This seems unintuitive: if the model generates the data, wouldn't it have a relatively high likelihood it, especially compared to holdout data that it did not generate?

问题

See limitations.

2024-11-27

Thank you for your thoughtful and constructive feedback on our paper. We are delighted that you found our pedagogical writing and the gradual development of the Bayesian framework for in-context learning (ICL) engaging and enjoyable. Your appreciation of our effort to make the material accessible is truly encouraging. We address your concerns below and have updated our paper accordingly. We appreciate your feedback, which has greatly enhanced the quality and clarity of our work.

[I]f the goal is to judge whether my model can solve a task in-context, can I not just evaluate on said task via a benchmark and get a performance estimate (e.g., accuracy)?

Thank you for this question. We agree that if a good metric exists, then PPCs should follow this metric closely. However, the advantage of using PPCs is that they also work in situations where no such metric exists or where metric specification is hard. For example, if the in-context learning problem consists of open-ended questions, then simple metrics like accuracy or correctness would not suffice. This flexibility is especially valuable when the CGM operates as part of a larger system where in-context examples are highly heterogeneous, making it impractical to define a single evaluation metric (e.g., an API servicing diverse responses using ICL). Moreover, since our approach gives an interpretable measure of model capability, it could be an invaluable tool for the scalable oversight of models capable of discovering solutions that even domain experts have difficulty judging. This flexibility is essential for enabling the automated use of CGMs in diverse scenarios.

[I]f the model generates the data, wouldn't it have a relatively high likelihood it, especially compared to holdout data that it did not generate?

The concept that the generated data can have lower likelihood than the test data can seem counterintuitive. However, when the model accurately estimates the distribution of the task, it can generate data with higher or lower likelihood than the holdout data, which is sampled from the task distribution. If the discrepancy for generated data from is often greater than or equal to that of the holdout data, it suggests that the model does not overfit and generalizes well to new data.

[E]xperimental section [is] rather sparse on discussion

We appreciate your feedback regarding the sparsity of discussion in the experimental section. We have streamlined the background sections to allocate more space for detailed analysis and discussion of our experimental results. Specifically, we have added a discussion section to compare the OOD detection metric results in Section 7.1. Further, we have added additional experiments and discussion in Sections 7.2 and 7.3 to give more insights into the choice of discrepancy function.

2024-12-04

Thank you again for your thoughtful and constructive feedback on our paper. We hope you have found our response and updates to address your concerns. We have further considered your comment that "the targeted problem of 'what ICL problem can a generative model solve' wasn't sufficiently clarified." We have added the following formalization to the appendix of our draft to address this.

Defining model capability for an in-context learning problem

Here, we formalize what we mean by saying that "a model can solve an ICL problem." A formal definition must account for two things: (1) the model may generate undesired responses because the context does not adequately specify the task, and (2) the model may generate undesired responses despite the context precisely specifying the task. The definition below accounts for both.

Definition 1: An ICL problem comprises a model $\theta$ , a dataset $x^n = \\{z_i, y_i\\}^n_{i=1} \sim p(x^n \mid f^*),$ and a task $f^*$ . Assume that valid responses $y$ to user queries $z \sim p(z \mid f^*)$ are distributed as $p(y \mid z, f^*)$ under the task. Finally, let $A(z, f^*)$ denote any set of responses satisfying $P(Y \in A(z, f^*) \mid z, f^*) \geq 1-\epsilon.$ The model $\theta$ is called capable of solving the ICL problem if $\lim_{n\to\infty} \int \mathbb{1}\left\\{ y \in A(z, f^*) \right\\}dP_{\theta}(y \mid z, x^n) \geq 1 - \epsilon.$

The first equation defines a set of valid responses $y$ to any query $z$ given the task $f^*$ . Jesson et al. 2024 call this a $(1-\epsilon)$ -likely set. For example, the set could be a confidence interval in a regression task or a set of semantically equivalent ways to express positive sentiment in an open-ended sentiment analysis task. The $1-\epsilon$ set gives us a formal and general way to express the notion of a desirable response for a given query and task. The second equation says that a model is capable if the probability that a generated response belongs to the set of valid responses converges to be at least $1-\epsilon$ as the number of examples increases. That is, as the context more precisely specifies the task. This definition accounts for condition (1) through the limit, allowing for capable models with too few in-context examples to be called capable. The indicator accounts for both conditions by counting the number of times model-generated responses fall inside the $1-\epsilon$ set, a general measure of accuracy.

In addition to accounting for the above conditions, this definition allows the model predictive distribution to collapse to subsets of $A(z, f^*)$ ---even deterministic responses---and still be called capable. This attribute is preferable to a definition of capability that requires the model predictive distribution to converge to the reference distribution, which would exclude many practical models.

While this definition is general, there are still practical limitations to consider. For example, an infinitely deep and wide random transformer with a finite maximum sequence length might be capable of solving most problems under this definition; however, its data efficiency may be so poor that we fill the context window before it can generate accurate responses. Similarly, even if the model could accommodate infinitely long sequences, the available data may exhausted before the model generates desirable responses. These scenarios are extreme examples of the over-parameterized case illustrated in Figure 2c.

This discussion also sheds light on the propensity for a classifier using the NLML p-value to produce false negatives (misclassify out-of-capability tasks as in-capability). In the $z > 0.5$ region of Figure 3b, the model will be predicted as in-capability because the model predictive distribution covers the data for small in-context learning dataset sizes. It is not until the model sees 100 examples that the misalignment between the predicted and reference distributions becomes apparent. The NLL discrepancy serves to mitigate this issue, as discussed.

评论- To all reviewers

2024-11-26

We extend our warmest thanks to all reviewers for the time and effort you have committed to reviewing our work. We appreciate your patience while we have restructured our manuscript and ran additional experiments to address your concerns. Thanks to your feedback, we believe our paper is significantly improved.

We are encouraged that the reviewers found our paper insightful, practical, and theoretically sound. While our discussion on Bayesian models was generally well received, we recognize it occupies significant real estate. To address this, we have refined the discussion in Section 3 and moved the detailed description to Appendix A. We have also moved the details on Doob's theorem to Appendix B, as they distract from the flow of the paper. These changes have created space to clarify how our method fits within the landscape of Bayesian interpretations of ICL in Section 4. Further, we have added a generative image completion experiment, increased the number of in- and out-of-capability tasks for the natural language experiments, and enriched our synthetic experiments to provide further insights into the choice between using the NLL and NLML discrepancy functions.

Uncertainty about the novelty and contribution of our work is a common concern. We hope our revised manuscript better conveys that our work builds off the Bayesian interpretation of ICL and addresses the underexplored question of whether the latent model implied by a CGM is appropriate for a given task. We have added Figure 2 to illustrate the consequences of using a misaligned model to make inferences about data generated according to a given ICL task. Figure 2d illustrates a particularly concerning case where the model is confident and wrong about its predictions. We design our methods to predict these cases and cases when there are not enough in-context examples for reliable task completion. For the latter cases, the utility of the NLL discrepancy function becomes apparent. We have added synthetic experiments in Section 7.2 to illustrate that while the NLL and NLML discrepancy functions lead to accurate predictors of model capability, only the NLL discrepancy provides information about whether there are enough in-context examples. We further demonstrate in Figure 10 that this information can reduce risk, a necessary consideration in safety-critical applications. Our contribution enables the estimation of discrepancies like the NLL using CGMs like large language models.

A common question was, "Why not simply evaluate the accuracy to assess whether the model can solve a given task?" Our goal in providing the regression experiments was to demonstrate that our methods work without explicitly defining an evaluation metric. To complement this, we add an experiment to illustrate when an evaluation metric is hard to specify. Specifically, consider a document or image completion task where a completed document or image does not need to match a reference completion exactly. Instead, the completion should be sensible given the context and not contain hallucinations. This objective is difficult to specify.

To demonstrate that our methods work in this setting, we devise an in-context generative fill experiment using the SVHN (in-distribution), MNIST (near OOD), and CIFAR-10 (far OOD) datasets. The model is prompted with several in-context image examples and asked to complete the missing half of the last example autoregressively. The goal is to produce sensible completions but not necessarily reconstruct the target image. Figure 5 of the updated manuscript illustrates this task. Figure 5a demonstrates that when we fit the model to sequences of SVHN images, it provides plausible completions given the first half of missing examples from the test set. Figure 5b illustrates that when we prompt the model to do the same task on MNIST images, it often produces plausible completions but frequently hallucinates odd completions (row 4) or artifacts (row 5). Finally, when we prompt the model to complete CIFAR-10 images inFigure 5c, it produces consistent though arguably nonsensical completions and is prone to hallucinate content from the SVHN domain, as in row 4. Figure 9 plots the OOD detection metrics and shows that the generative predictive $p$ -value again yields an accurate predictor of model capability for this hard-to-specify task.

Thank you again for your feedback and patience. We provide specific responses to each of your concerns below. We are eager to clarify any further questions in the remaining time.

AC 元评审

2024-12-23

This paper provides a new theoretical framework, which quantifies whether a model can solve certain ICL tasks as a generative predictive p-value, analogous to the approaches in Bayesian model criticism. The estimator for this p-value does not require access to likelihoods and posteriors, which is the common case for commercialized LLMs. The authors also showed practical use cases where clear metrics are difficult to be defined, e.g., generative fill tasks ("goal is to produce sensible completions, but not necessarily reconstruct the target image exactly" -- authors).

When reviewing the original manuscript, the reviewers pointed out some weaknesses, including practicality/usefulness, experimental scopes and clarity. During rebuttal, the authors detailed the practicality and usefulness of the proposed approach by adding more experiments for more datasets and tasks. The authors also improved the clarity of the work. Some reviewers updated their scores.

During AC-reviewer discussion, Reviewer cuyE expressed that they have remaining concerns about "a lack of motivation for the problem and the experiments being in a huge contrast to what the paper's stated problem is". Reviewer 2r64 and Reviewer SYeb leaned towards acceptance after seeing new updates, which show reasonable motivation and "the proposed ideas in this paper seem interesting and also somewhat novel/unusual". But they also mentioned that "there is a bit of a discrepancy between the stated motivation and what the paper delivers but perhaps that's for future work," and "If the generative predictive p-value is as useful as the authors claim, then it must also be possible to demonstrate this in carefully designed experiments, as reviewer cuyE says, and this might then be a much more useful paper for the community." I think all reviewers raised valid points.

My take is, the paper is definitely not perfect and it'll be good to include more experiments to support the usefulness of their proposed framework. However, applying ideas in Bayesian model criticism to LLMs for deciding if a model is good enough is interesting and inspiring. The paper could be interesting to machine learning researchers who are interested in theoretical understanding of LLMs or a new metric for deciding the goodness of an LLM for a particular ICL task.

审稿人讨论附加意见

最终决定Accept (Poster)

2025-01-22

Accept (Poster)