Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification

Shang Liu,Zhongze Cai,Guanting Chen,Xiaocheng Li

OpenReview PDF

提交: 2024-09-26更新: 2024-11-20

摘要

Predicting simple function classes has been widely used as a testbed for developing theory and understanding of the trained Transformer's in-context learning (ICL) ability. In this paper, we revisit the training of Transformers on linear regression tasks, and different from the existing literature, we consider a bi-objective prediction task of predicting both the conditional expectation $\mathbb{E}[Y|X]$ and the conditional variance Var$(Y|X)$. This additional uncertainty quantification objective provides a handle to (i) better design out-of-distribution experiments to distinguish ICL from in-weight learning (IWL) and (ii) make a better separation between the algorithms with and without using the prior information of the training distribution. Theoretically, we show that the trained Transformer reaches near Bayes optimum, suggesting the usage of the information of the training distribution. Our method can be extended to other cases. Specifically, with the Transformer's context window $S$, we prove a new generalization bound of $\tilde{\mathcal{O}}(\sqrt{\min\{S, T\}/(n T)})$ on $n$ tasks with sequences of length $T$, providing sharper analysis compared to previous results of $\tilde{\mathcal{O}}(\sqrt{1/n})$. Empirically, we illustrate that while the trained Transformer behaves as the Bayes-optimal solution as a natural consequence of supervised training in distribution, it does not necessarily perform a Bayesian inference when facing task shifts, in contrast to the equivalence between these two proposed in many existing literature. We also demonstrate the trained Transformer's ICL ability over covariates shift and prompt-length shift and interpret them as a generalization over a meta distribution.

关键词

in-context learningTransformeruncertainty quantificationBayes optimalsupervised learning

评审与讨论

审稿意见

评分: 5置信度: 32024-10-31

The paper explores the in-context learning (ICL) ability of transformers to learn both the weight vector and noise variance (uncertainty quantification) of linear regression tasks. For a class of multilayer transformers with bounded context window $S$ , it is shown that the empirical risk minimizer converges to the Bayes optimum with excess risk bounded as $\widetilde{O}(\sqrt{(S\wedge T)/nT})$ . Experiments show that transformers can perform uncertainty quantification under task and covariate shifts if task diversity during pretraining is sufficiently large. Nonetheless, it is also shown that transformers may deviate from the Bayes optimal predictor on out-of-distribution tasks, possibly due to taking statistical shortcuts, in contrast to previous works which postulate an equivalence with Bayesian averaging.

优点

The uncertainty quantification setup is quite interesting, allowing us to examine out-of-distribution prediction capability in a more controlled manner. The paper is overall well written and easy to follow. The experiments on distribution shifts are well-designed and extensive, and the results demonstrating deviation from the Bayes optimal prediction are quite surprising, contradicting previous works which equate ICL with Bayesian averaging. While the linear regression setup is quite stylized, the insights indicate new directions for future work to study ICL/OOD ability at larger scales and practical datasets.

缺点

My main issues lie with the theoretical parts of the paper.

Theorem 3.2 seems to be similar to existing analyses such as the cited work Zhang et al. (2023b), which derives a bound in terms of the Markov chain mixing time $\sqrt{\tau_{\mathrm{mix}}/nT}$ and under more general assumptions on data generation and loss function. While the authors compare to these works and claim their result is tighter, it is clear that considering a finite context window gives $\tau_{\mathrm{mix}} = S\wedge T$ and retrieves their rate. (In Zhang et al., there is also a TV-to-sqrt(KL) "conversion factor" incurred in the approximation error when considering the excess risk, so the given bound could be seen an improvement. However, Zhang et al. also show that the KL term decays exponentially with increasing depth, so it has little effect on the overall risk bound.)
The theory and experimental parts of the paper are disjointed and do not support each other. The focus of the theoretical contribution is to quantify the effect of the finite context window and the analysis does not really depend on the specific bi-objective task as the authors mention. In contrast, the experiments focus on the uncertainty quantification ability and do not study the effect of $S$ .
The discussion on ICL versus IWL ability (Section 2.1) is interesting and the constraint (3) seems reasonable, however it is not rigorously shown whether it is actually satisfied for the proposed ICL setup. Also, the proposed measurements of ICL and IWL ability are not evaluated theoretically or empirically. I believe a more through justification of the setup is important since the experiments are limited to linear regression tasks (this point should also be made clearer in the main text).
Minor: the definition of the Transformer model class needs to be given in the main text.

问题

Can the ICL and IWL ability measure defined in Section 2.1 be analyzed theoretically or empirically?
In many ICL setups, the prediction loss is not summed up over all $t=1,\cdots,T$ but only evaluated for the final query token $(t=T)$ . How would this affect the derived generalization error bound?
Appendix B.2 seems to study the approximation error in terms of the truncated Bayes-optimal error, but this does not tell us the behavior of the population minimizer. Can $R(TF_{\theta^*})$ be bounded, e.g. in terms of depth?

审稿意见

评分: 5置信度: 32024-11-03

This paper addresses the in-context uncertainty quantification for pre-trained transformers, focusing on predicting both the mean and variance of the conditional distribution of a target variable given input features. They derive a Bayes-optimal solution for the task and demonstrate that transformers are capable of performing in-context uncertainty quantification.

The authors rigorously analyze the uncertainty quantification problem when Transformers are constrained to a context window of size $S$ . They derive a generalization bound of $\tilde{O}(\sqrt{\min\{S, T\}/(nT)})$ for pre-training over $n$ tasks with sequence length $T$ , extending to other cases under the assumption of bounded and Lipschitz loss functions. They laim that this bound is tighter than existing analyses, particularly when $S < T$ , and is the first of its kind to utilize the context-window structure to model a Markov chain over the prompt sequence and construct an upper bound on mixing time. They also discuss the additional approximation error due to a finite context window and quantify the trained Transformer's risk convergence to the Bayes-optimal risk.
The paper provides an empirical evaluation of the in-context learning abilities of transformers for uncertainty quantification across three distribution shift scenarios: task shift, covariate shift, and prompt length shift. The results suggest that transformers can learn both mean and uncertainty predictions in-context, even under moderate task distribution shifts, when trained on data with substantial task diversity. Moreover, the authors show that increasing task diversity through meta-learning enhances robustness to covariate shifts. Finally, they observe that removing positional encoding from the embedding vector significantly improves generalization performance.

优点

This paper is fairly written even though I do not understand some parts. Their problem setup is easy to follow and their organization is good (except the section 2.1). However, I could not understand why they write transformers with capital T.
The authors develop a good framework for studying the uncertainty quantification. Their theoretical analysis of the generalization properties of the loss function with the uncertainty quantification is interesting. They empirically observe the performance of transformers under task shift, distribution shift, and length shift.
The observation that the positional embedding worsens the performance of length generalization is valuable.
The way they define the loss function with respect to $\hat{\sigma}$ and $\hat{y}$ is valuable and they provide the Bayesian optimal of these parameters.

缺点

I do not understand some parts of section 2.1 ( I think this section is written poorly). I think there is a gap between their definitions of IWL and ICL and the loss function the authors use. It would be better if the authors explain the relationship between their loss function and their definitions of ICL and IWL.

1.1. I am not able to follow the discussion of the pattern and sequence in section 2.1. I will ask some questions in the questions section.

1.2. The authors claim that the understanding of uncertainty quantification objective is important for understanding the contribution of in-context learning compared to the contribution of in-weight learning. This is fair. However, towards end of the subsection, they define IWL(M) and ICL(W) based on the worst and best case test distribution, which are not directly related to variance of the model. Therefore, the way of defining the loss function with the variance does not perfectly align with the definition of IWL(M) and ICL(M). Furthermore, they do not utilize (at least I did not notice) the definitions of IWL(M) and ICL(M) in their paper. When two functions are defined at the beginning of the paper (specifically in problem setup), I expect that the functions are frequently used, which is not the case in the paper.

Theorem 3.2 is suboptimal, and their comparison to previous work, specifically Li et al. (2023b), is not fair.

2.1. I think that Theorem 3.2 is suboptimal because I do not understand why the generalization performance of the transformers decreases with the context length. Consider the case when $S = T/4$ and $S = T/2$ . From Theorem 3.2, I observe that the generalization performance is worse when $S = T/2$ , but I think the generalization performance should improve when the context length increases. How can increasing the context length decreases the generalization performance. Please correct me if my reasoning fails.

2.2. The authors compare their bound with Li et al. (2023b) and Zhang et al. (2023b) in an unfair way. The bounds obtained in Li et al. (2023b) is better than the current paper when $S \geq T$ . Li et al. (2023b) does not cover the case where $S << T$ . In order to compare the results in the case of $S<T$ , the authors change the algorithmic stability assumption (in line 297-298) and compare their results. However, when the algorithmic stability term is $\mathcal{O}(1/T)$ , then I think the bounds in Li et al. (2023b) still better than the current paper.

问题

Is the loss function devised by the authors and is there any previous works that study the same loss function? Could the authors provide a brief discussion for this loss function whether a similar loss function has been utilized in the literature?
What is pattern and sequence? Could the authors define these terms rigorously? Is the pattern a distribution? The authors use $\mathcal{P}$ for distribution first, then use the same notation for a pattern.
How is the loss function defined based on a pattern? It is used in the definition of IWL and ICL, but I could not find the definition.
Could you provide a more rigorous argument about the definitions of IWL and ICL?

审稿意见

评分: 5置信度: 32024-11-06

This paper studies the in-context learning abilities of the trained transformer through the lens of uncertainty quantification. The authors train the transformer for the task of mean and uncertainty prediction and demonstrate that transformers achieve Bayes near-optimal estimation and uncertainty prediction. However, the transformer is only able to achieve near-optimal in-distribution risk, which doesn't imply it behaves the same as the Bayes-optimal predictor. When testing on out-of-distribution tasks, the performance of transformers heavily relies on the training method.

优点

The paper considered the in-context learning and in-context uncertainty prediction problem, a new problem not considered in the literature.
The theoretical result of Theorem 3.2 is stronger than prior results in the regime when S << T.
The paper analyzed the OOD generalization performance of trained transformers.

缺点

The paper just provided a generalization error bound and didn't upper bound the approximation error, and hence, it does not fully theoretically justify that transformer can be trained to perform in-context uncertainty prediction.
The limited OOD generalization performance has already been explored in the literature, e.g., [Garg et al., 2022], albeit with the in-context learning task.

问题

What is the purpose of analyzing the regime when S << T? In this regime, the generalization error is better, but the approximation error is worse.

撤稿通知

2024-11-20

Thanks for the valuable questions and advice. After careful consideration, we find it difficult to fully address the concerns of the reviewers without significant revision of the paper draft. We henceforth decide to withdraw our draft and prepare a thorough revision. We sincerely thank the reviewers for their time.