PaperHub
6.3
/10
Poster3 位审稿人
最低6最高7标准差0.5
6
7
6
3.0
置信度
COLM 2025

Bayesian scaling laws for in-context learning

OpenReviewPDF
提交: 2025-03-20更新: 2025-08-26
TL;DR

We test the claim that in-context learning in LLMs is Bayesian, leading to a new interpretable scaling law that accurately predicts when suppressed behaviors in both toy and real-world language models will reemerge.

摘要

关键词
in-context learningbayesian inferencescaling laws

评审与讨论

审稿意见
6

This paper aims to gain a deeper understanding of a key capability of ICL. Typically, the performance of LLMs on tasks improves as the number of examples provided increases, a relationship known as the ICL curve. However, the exact form of this relationship and the underlying mechanism are not fully understood. The author hypothesizes that the process of ICL can be modeled as a Bayesian learner. After seeing contextual examples, the LLM updates its posterior probability distribution about which task needs to be performed at the moment based on these evidences. The Bayesian perspective provides a powerful theoretical framework for understanding ICL. The derived Bayesian scaling law not only accurately predicts ICL performance, but also provides interpretable parameters that help understand model behavior, alignment effects, and their limitations, which is of great significance for improving the security and reliability of LLMs.

接收理由

  • This paper attempts to provide a new and theoretically based explanatory framework for ICL, a core capability of LLM. It solves the pain point of the lack of explanatory power of existing scaling laws, especially providing deep insights into understanding model alignment.
  • Comprehensive verification across synthetic and real data, multiple models and tasks enhances the credibility of the conclusions. It has advantages in predicting model behavior in unseen multi-example scenarios, which is particularly important in areas such as security.-
  • The analysis of the mechanism by which SFT/DPO affects ICL (mainly changing the prior rather than the knowledge itself) is an important discovery in the field of alignment, although there have been related findings before.

拒绝理由

  • W1: Is the simplification from theory to practical formula excessive, and whether operations such as parameter binding obscure the true mechanism. The practical version of Bayes' law (Formula 3) is simplified for feasibility (parameter binding, single efficiency K, etc.), which may deviate from the strict Bayesian update
  • W2: Although the paper shows behavioral compatibility, it fails to prove that LLM is a Bayesian learner from a mechanistic perspective. Although Bayes' law performs well, in some cases (especially interpolation), simple models such as Logistic can achieve similar or even better goodness of fit. This shows that although the Bayesian explanation is compatible, it cannot fully prove that LLM is a strict Bayesian learner.
  • W3: The paper mentions the need for a specific optimizer (L-BFGS) and numerical stability techniques, which may imply that model fitting is sensitive to parameters and optimization methods.

给作者的问题

I am willing to update my score if the author responds to my questions well.

  • Q1: The practical Bayesian scaling law (Formula 3) simplifies the theoretical formula (Formula 2) for ease of fitting and parameter reduction, especially by using parameter tying and introducing a single ICL efficiency coefficient K. To what extent may these simplifications affect the accuracy of the model's capture of the real ICL process? Do different parameter tying strategies (such as those mentioned in Appendix B) have a significant impact on the interpretability of parameters or the predictive ability of the model (especially extrapolation)?
  • Q2: Experimental results show that the learned parameters (prior ρ, efficiency K, likelihood P±) are well interpretable, especially when analyzing the effects of SFT and DPO. To what extent do you think these parameters fitted from macroscopic behavioral data can truly reflect the specific neural computing mechanisms or potential representations when performing ICL inside the LLM, rather than just an effective description of the observed learning curve?
  • Q3: The study found that DPO not only changes the prior, but also affects the probability of in-distribution tasks (Figure 4b), and even reduces the probability of favorable tasks (HMM 0). In your Bayesian framework, beyond observing this phenomenon, can you further explain why DPO has this seemingly "damaging" effect on the model's core task knowledge? Does this mean that there is a more complex interaction between DPO's optimization objective and the simple Bayesian posterior update?
  • Q4: Figure 5 compares the Llama 3.1 8B Base and Instruct models, clearly showing that instruction fine-tuning mainly reduces the prior probability of unsafe behavior. In addition to the significant change in the prior ρ, are there some systematic, non-significant but meaningful differences between the Base and Instruct models in the likelihood parameter P± (i.e., the model's ability to distinguish tasks) and the ICL efficiency K? Do these subtle differences provide more clues about how instruction fine-tuning affects the model's "knowledge" rather than just "willingness"?
  • Q5: You mentioned that in real LLM experiments, the model performance was observed to decline near the end of the context window, so only the first 90% of the data was analyzed. How is this performance decline compatible with your Bayesian model? Does it mean that the way the model processes and updates information under extremely long sequences deviates from the idealized, constant-efficiency Bayesian update assumption? Can your model framework be extended to take into account this context length-dependent efficiency variation?
  • Q6: In addition to being a tool for understanding and analyzing ICL and alignment effects, what do you think is the most promising direction for this Bayesian scaling law framework in practical applications? For example, can it be used to: a) more accurately predict the critical point at which a model reaches dangerous behavior (jailbreak) when processing a large number of samples (such as multi-turn dialogues) after being given a small number of samples? b) guide how to design more effective ICL examples (prompt engineering) that can "teach" models specific tasks faster?
评论

[continued from above]

  • Q5: What we observed was complete degradation at the end of the context window, i.e. task accuracy suddenly dropping to 0. We think this was an issue with the inference provider or the model’s advertised context window being slightly longer than its true context window, not a clean trend towards the end of the window. We don’t think this behaviour was systematic enough to be fit with our law; we can provide the full ICL curves we observed for all the models in a follow-up message soon.
  • Q6: We definitely find the clear practical application to be extrapolating the # of examples needed to achieve some desired (or undesired) task accuracy from a much shorter test window, with clear efficiency gains since attention has quadratic time complexity. From a scientific angle, it seems that our Bayesian law is a better analysis tool for understanding how different post-training approaches or inference-time scaling techniques change ICL behaviour; ideally, we want methods to change task knowledge rather than just the prior, and this tool enables separating those two.
    Your second suggestion seems not doable by our current method since it requires measuring KK for each example, whereas we assume fixed KK averaged over many ICL curves. However, we find this idea intriguing, and perhaps in the future one could design a Bayesian law that estimates per-example KK. This could be useful to grade example difficulty (for curriculum learning or RL) or to come up with a short but highly effective set of examples for ICL prompting.

Thanks again for your very useful questions and comments! We will provide the materials mentioned in Q4 and Q5 in a follow-up comment within a day.

评论

Thank you for your extensive review, we appreciate how comprehensive your comments are! We will address your concerns/questions point by point.

Re: reasons to reject:

  • W1: See our response to Q1 below.
  • W2: We agree; e.g. in section 6 we say “we do not claim to have proven that LLMs are Bayesian learners”. We wanted an interpretable law describing model behaviour, which previously did not exist (logistic/power law fits). However, we don’t necessarily find this a weakness; a behavioural law is highly practically useful since it can be easily fit to the behaviour of new models and does not require access to weights (only logprobs). Mechanistic analysis is complementary to, but not better than, our approach in this paper.
  • W3: We did refine our setup to ensure convergence and numerical stability in all laws we tested, not just the Bayesian one, in order to ensure fair comparison. Likely due to the small learned parameter count, Newton methods like L-BFGS are much more stable than e.g. AdamW in our tests. L-BFGS is a standard solver for fitting scaling laws in general, e.g. see section 3.3 of Hoffmann et al. (2022).

Re: questions:

  • Q1: The efficiency parameter KK is indeed additional to the Bayesian law and makes the fit more expressive. We justify this in section 3.2 (essentially, this parameter enables us to not make assumptions about the frequency of Bayesian updates) and show that it is interpretable in Figure 2b. The efficiency parameter improves the fit substantially. As for parameter-tying, this addition actually makes our law less expressive by reducing the number of learned parameters, i.e. we are not using the full expressivity of Bayes’ theorem. We are therefore matched in parameters with the baseline laws. In fact, without tying, we find much lower interpolation error rates for our law, but it extrapolates much worse and the parameters are not interpretable (indicating overfitting, we believe); we can report these ablation results in an appendix.
  • Q2: This is an extremely interesting and relevant question. In our view, outside of this work there are a few pieces of evidence in prior work (listed below) suggesting that the shape of the ICL curve does reveal something about model internals. However, to our knowledge there is no work that clearly and directly establishes a link between the shape of the ICL curve and some mechanistic explanation for ICL. So we do believe our Bayesian scaling law is somehow founded in an underlying mechanism, but in this work we did not do the interpretability work that would be necessary to prove this to be true (and so we do not claim to have found a mechanism).
    • Schaeffer et al. (2025) show a relationship between the shape of the inference-time scaling curve to model performance on individual examples in a dataset and the distribution of question difficulties in that dataset. An analogous relationship might be found for the ICL curve.
    • Anil et al. (2024) which proposed the (bounded) power law as a fit for the ICL curve in appendix I of their paper propose a mechanistic explanation for their law, related to the “function vector” mechanism in ICL which has been reported in some other works.
  • Q3: In figures 16-19, we observed that DPO makes the ICL curve non-monotonic which renders it indescribable by all of the scaling laws we consider (which assume ICL accuracy is non-decreasing with more examples). Therefore, in the Bayesian framework it is not clear how to describe what is happening under DPO. Our best interpretation is that DPO makes the model no longer (approximately) a Bayesian learner, and that this is due to its contrastive objective which doesn’t simply match an expected distribution (unlike SFT, which in our experiments will converge to a model that only knows HMM 0) and thus has unpredictable effects on what the model ends up learning.
  • Q4: We have not closely examined the non-prior parameters between the base and instruct Llama models. We will look into this and provide the findings in a follow-up message soon.

[continued below]

评论

Q4

There are some interesting differences in the Bayesian fits for the base and instruct Llama 3.1 8B models, besides the prior. For the efficiency parameter KK we observe that the base model almost always (except on logiqa) has higher ICL efficiency than the instruct model. This suggests that post-training hurts ICL ability; we weren’t able to find any other work in the literature claiming this.

iddatasetLlama 3.1 8BLlama 3.1 8B (base)
0creak0.2216600.399007
1harmbench0.0350620.014982
2logiqa1.5387210.033447
3persona_machiavellianism0.0667070.160071
4persona_narcissism0.0258530.108538
5persona_psychopathy0.0635210.154794

For the in-distribution probabilities Pi,iP_{i, i} we find that (except on harmbench) the instruct model consistently has higher values than the base model, indicating ICL converges to a higher task accuracy.

idhmmdatasetLlama 3.1 8BLlama 3.1 8B (base)
00creak0.8882600.526711
10harmbench0.3909440.845311
20logiqa0.3877980.231836
30persona_machiavellianism1.0000000.617820
40persona_narcissism0.9954160.623008
50persona_psychopathy1.0000000.606817
61persona_machiavellianism0.9921460.608913
71persona_narcissism0.9824440.532016
81persona_psychopathy0.9850220.632406

The off-diagonal probabilities (which are tied in our Bayesian law) don’t reveal much of interest; they are very low in all tasks for both models.

idhmmdatasetLlama 3.1 8BLlama 3.1 8B (base)
00creak2.543009e-057.561555e-04
10harmbench3.400257e-063.755968e-06
20logiqa4.700065e-059.949695e-07
30persona_machiavellianism1.071971e-051.456856e-04
40persona_narcissism1.459376e-051.529750e-05
50persona_psychopathy1.099947e-051.862698e-06
61persona_machiavellianism1.376955e-131.460457e-08
71persona_narcissism6.595218e-151.764054e-07
81persona_psychopathy1.700813e-206.814592e-06

We will add this analysis in our appendix. Overall, it seems that post-training affected all parameters of our Bayesian fits; not only is the prior on harmful tasks lower, but overall ICL efficiency is decreased and ICL converges to a higher performance. These trends seem to hold up across tasks.

Q5

Below is the plot of the complete ICL curves. Note how on some datasets the performance suddenly craters at the very end. This is clearly some kind of outlier, so we excluded the end of the ICL curves from our analyses.

Link (if the above doesn't render)

评论

Thanks for the reply, it solved most of my questions. I have changed my rating.

审稿意见
7

This paper presents a a Bayesian scaling law for in-context learning (ICL) in large language models. The Bayesian perspective allows for interpretable parameters, such as task priors, learning efficiency, and per-example probabilities. Extensive experiments on GPT-2 models and real-world instruction-tuned large language models validate the proposed scaling laws, demonstrating their ability to accurately predict ICL performance. Experiments also provide insights the how SFT modifies the task priors faster in smaller models than in larger models. Overall I find this a good paper -- the scaling laws are comparable with existing laws, and provide interpretable parameters.

接收理由

  • Novel Bayesian Scaling Law
  • Interpretable
  • Interesting results with insights -- eg: demonstrating how SFT affects task priors but not task knowledge

拒绝理由

None

给作者的问题

  1. To what extent does the architecture of a model influence the scaling law? There doesn't seem to be anything specific to a transformer in the modeling of the law -- so I guess this is a scaling law for in-context learning in auto-regressive models but there's no empirical assessment. Perhaps the authors could comment and also include a brief discussion in the paper.
评论

Thank you for your questions and comments! Re: whether this law is useful/holds across architecture in general, you are correct that we did not study non-transformer autoregressive sequence modelling architectures in the paper. We are not sure if alternative architectures (e.g. SSMs, or hybrid SSM-Transformers) exhibit the same ICL curve. We can run our section 5 experiments on one of these pretrained non-Transformer models if time permits (e.g. the mamba/mamba2 models on HuggingFace). Let us know if this would be interesting in your view!

评论

Thank you for response; that would definitely make the paper stronger but I wont consider it as a must-have for the paper.

审稿意见
6

This paper proposes a Bayesian scaling law for in-context learning, offering an interpretable framework that models task priors and learning efficiency. It matches or outperforms existing power-law baselines across synthetic and real-world settings, and provides insights into alignment failures such as many-shot jailbreaking. The approach is novel, rigorous, and highly relevant to LLM safety.

接收理由

Introduces a Bayesian scaling law for ICL with interpretable parameters such as task priors, ICL efficiency, and example likelihoods, offering insights into how models update their beliefs.

Unlike prior black-box scaling laws, this approach provides meaningful dimensions that help explain the underlying learning dynamics of in-context learning.

The Bayesian law is evaluated on both controlled synthetic data (GINC, using GPT-2 variants) and real-world LLMs (e.g., LLaMA, Gemma), and it matches or outperforms existing laws, especially in extrapolation from few-shot to many-shot settings.

拒绝理由

The paper primarily focuses on curve fitting and does not demonstrate improvements on downstream tasks such as translation, summarization, or question answering, limiting its practical applicability.

Many insights are derived from synthetic datasets (e.g., GINC and HMM), which may not capture the full complexity and ambiguity of real-world natural language tasks.

评论

Thank you for your comments!

We want to clarify the aims of this paper: we present Bayesian scaling laws as an analysis tool for understanding the behaviour of language models. Having such a scaling law enables extrapolation of performance from few-shot ICL (e.g. section 4.1), allows for better understanding of model behaviour under post-training (section 4.2 and 4.3), and we showed these benefits hold in real-world LLMs as well (section 5). Our intention is not to “demonstrate improvements on downstream tasks” directly; instead, one could use this tool to e.g. figure out how many ICL examples to include in the context to achieve some desired performance. Additionally, our experiments in section 5 contain ample evidence of the practical applicability of our findings, and largely agree with our synthetic findings.

Are there particular experiments or pieces of evidence that would convince you of the generality of our findings, or other specific shortcomings that we could address? We appreciate your feedback.

评论

thanks. Can you elaborate more on the "desired performance"? for example, how does your Bayesian scaling law apply to complex, open-ended tasks like code generation or creative writing, where output isn't binary? How would practitioners define the probability distributions and use the interpretable parameters for such tasks? I'm particularly interested in understanding whether the core motivation and framework extend beyond the classification-style tasks you've demonstrated.

评论

This is a very sensible question. The Bayesian scaling law fits to the negative-log likelihood of each in-context example, so as long there are prompt and (ground-truth) answer pairs provided we are able to fit this scaling law. In the case of open-ended generation tasks, we define only 2 distributions to model: on-task and off-task. We can then fit the laws to the on-task ICL curve data, which results in priors over being on-task and off-task, the ICL efficiency parameter for this particular task, and the in/off-distribution probabilities which tell us what performance ICL will converge to.

Our experiments on HarmBench in section 5, where the output is not just classification but rather open-ended generations, are evidence that the scaling law can be fit to such tasks. Our comparison of the base and instruct Llama 3.1 8B models in this setting shows that parameters (specifically the prior) are interpretable.

Also note that we fit our baselines to HarmBench in the same way; in general, since ICL curve fits all model negative-log likelihood, they can all fit open-ended generation tasks (given ground truths).

评论

Thanks, it resolved my concern. I have raised my score.

最终决定

This is a well-regarded paper that provides both insight and useful predictive theory. The reviews were detailed and constructive. In discussion the authors outlined reasonable options for final revision.