6.0

/10

Poster4 位审稿人

最低5最高7标准差0.7

2.8

置信度

正确性3.5

贡献度2.5

表达2.5

NeurIPS 2024

From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

Kevin Christian Wibisono,Yixin Wang

OpenReview PDF

提交: 2024-05-16更新: 2024-11-06

TL;DR

We demonstrate the significance of co-occurrence, positional information, noise, and data structures for in-context learning from training on unstructured data.

摘要

关键词

in-context learninglarge language modelsunstructured datacontinuous bag of wordsco-occurrence

评审与讨论

审稿意见

评分: 6置信度: 22024-07-08

The paper basically presents three theoretical analyses related to ICL. The section 2 shows that we can use CBOW to do the (country)-(capital) kind of ICL. The section 3 shows that positional embeddings, multiple layers in autoregressive LM, and blocked noise structures are important for ICL. The section 4 shows that ICL could fail when there are systematic and consistent mismatches between the training sequence and testing sequence.

优点

I think this paper is easy to follow and most explanations are clear (One minor suggestion: it would be more clear to also illustrate the correct answer of each prompt and provide some brief explanations such as the prompt in section 3 tries to repeat the first letter of a word). I choose fair in the presentation rating because I feel that the paper oversells its contributions in the title and abstract.

All the claims are supported by both strong theoretical conclusions and empirical simulations. The theoretical contributions are novel to me but I am not a theoretical researcher. Since the situations/preconditions of the claims are extremely simple, I think its significancy is not high for practitioners, but the contributions might be significant for theoretical researchers and might inspire the follow-up work.

缺点

I think the main weakness of this paper is the mismatch between scope it seems to cover and its actual scope. The title and abstract suggests that this paper tries to study why the ICL works well given the unstructured training data in practice, but what the paper actually did is thoroughly studying 3 toy situations.

I understand that we often have to simplify the situations in order to get strong theoretical conclusions. I also think that, at least to me, it is difficult to derive those theoretical results in such simplified situations and all the toy situations are relevant to the ICL. Nevertheless, I think these situations are not very representative to most of the practical ICL settings. After all, most ICL is beyond just relying the co-occurrence statistics of the sentences like CBOW, finding the first letter of the word, and repeating some words in the context.

I understand that nowadays, one paper often needs to oversell its scope in the title and abstract to get attentions. Hence, although I suggest that the authors can revise the main storyline to reduce the overselling, I am fine with the current title and abstract if this paper is accepted at the end. I am also not a researcher who studies theory, so I do not know how significant or novel these theoretical results are. Therefore, I would like to let other researchers who have better theoretical background to rate this paper.

问题

I have no specific question to the authors, so I think the rebuttal won't change my opinion. I won't lower my score if the authors choose to skip the rebuttal to my comments.

局限性

The limitation discussion in the appendix K is fair.

作者回复

2024-08-07

Thanks for reviewing our paper. We are delighted that you found it both easy to follow and supported by strong theoretical conclusions and empirical simulations. Below, we address your comments.

It would be more clear to also illustrate the correct answer of each prompt and provide some brief explanations such as the prompt in section 3 tries to repeat the first letter of a word.

We appreciate your suggestion. We have modified Figure 1 to include a brief description of each ICL task and the correct answer for each prompt. Please refer to the PDF file in "Author Rebuttal".

I think the main weakness of this paper is the mismatch between scope it seems to cover and its actual scope. The title and abstract suggests that this paper tries to study why the ICL works well given the unstructured training data in practice, but what the paper actually did is thoroughly studying 3 toy situations.

After all, most ICL is beyond just relying on the co-occurrence statistics of the sentences like CBOW, finding the first letter of the word, and repeating some words in the context.

I suggest that the authors can revise the main storyline to reduce the overselling.

Thanks for your comments. We have revised the title of the paper to “In-Context Learning from Training on Unstructured Data: The Role of Co-Occurrence, Positional Information, and Training Data Structure.”

Additionally, we have updated the abstract to better reflect that this paper identifies three components necessary for ICL to occur from training on unstructured data and provides both theoretical and empirical justifications for each component. For instance, we included the sentence “To this end, we thoroughly examined three ICL tasks and identified three components that are crucial for ICL to occur from training on unstructured data: co-occurrence, positional information, and training data structure.” We hope these changes will alleviate your concern regarding the discrepancy between the paper’s title and abstract and its actual scope.

Regarding the content, we would like to emphasize that we did not claim that this paper provides a complete explanation for why ICL works well with unstructured training data. While we fully agree that ICL encompasses more scenarios than those covered in the paper—as noted in the Limitations section of the original submission—our goal is to offer insights into some components of unstructured data that are crucial for ICL to occur.

Furthermore, although the ICL tasks discussed in this paper are simple, they are firmly grounded in prior research. For example, ICL tasks involving known pairings—such as word translations and country-capital city pairs—have been analyzed in studies by Brown et al. (2020), Todd et al. (2024), and others. Similarly, ICL tasks like word-first letter pairings have been explored in works by Xu et al. (2024), Chen et al. (2024), and others. We believe that thoroughly examining ICL in these simplified settings offers an interesting perspective on ICL, particularly regarding the importance of co-occurrence, positional information, and the structure of the training data.

References

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Neural Information Processing Systems.

Chen, Y., Zhao, C., Yu, Z., McKeown, K., & He, H. (2024). Parallel structures in pre-training data yield in-context learning. arXiv preprint arXiv:2402.12530.

Todd, E., Li, M., Sharma, A., Mueller, A., Wallace, B. C., & Bau, D. (2024). Function vectors in large language models. International Conference on Learning Representations.

Xu, Z., Shi, Z., & Liang, Y. (2024). Do large language models have compositional ability? An investigation into limitations and scalability. ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models.

评论- Thank you for willing to revise the title and abstract

2024-08-08

I raise my presentation score from 2 to 3 and overall score from 5 to 6.

Nevertheless, I remain my confidence to 2 because I think the main contribution of this paper is its theoretical analyses but I do not have sufficient background to judgement the this main contribution.

审稿意见

评分: 5置信度: 42024-07-13

This paper studies the emergence of in-context learning (ICL) in both CBOW (Mikolov et al,. 2013) and Transformer models. The focus is on simple synthetic settings that can be studied both theoretically and through synthetic experiments with small models. The paper identifies co-occurence as a key ingredient for ICL to emerge in CBOW. Then the paper considers how positional information is critical for a Transformer (or any) model to identify certain order-dependent patterns. Finally, the paper presents two sythetic scenarios involving repetition in which ICL fails with a simple model.

优点

The paper begins by identifying synthetic scenarios in which co-occurrence within a sentence is sufficient for a continous bag-of-worlds (CBOW) model to be able to perform ICL. The paper proves two theorems identifying when co-occurrence statistics are sufficient to ensure that CBOW could perform ICL.
The paper then proves that positional information is required to perform certain sythetic tasks related to the ordering of tokens.
Finally, the paper identifies two sythetic settings in which one might expect ICL to work.
Sythetic experiments support each of the above claims.

缺点

The paper states that "ICL is achievable by only modeling co-occurrence information using [CBOW]". However, this seems to miss the generality with which the term ICL is used. That is, ICL is commonly used for generation tasks such as full-sentence machine translation (not just the simple token-level translation examples in this paper). So to say that "ICL is achievable" seems like a misuse of the terminology. Without a more careful definition of ICL, this statement is invalid.
After showing that Llama 2 is unable to use ICL to translate the word English words "soon" and "main" to Indonesian, the paper claims that "these models should be equally likely to produce the correct answer for any given [word], irrespective of its relevance to the in-context examples. However, our experiment demonstrates that this is not the case". This is a huge leap for a poorly designed experiment. Llama 2 was trained on 98.08% English data. The amount of Indonesian language data may have been miniscule. As such, co-occurence may offer an explanation for the result, but adjacency might be equally informative. To speak of co-occurrence without any discussion of adjacency seems a bit odd here. This same issue appears later in the paper's claim "This suggests ICL may arise from co-occurrence information", whereas a claim that it is informed by co-occurrence might be more apt.
It is not clear to this reader why one would expect the setting in Section 4.1 to succeed via ICL in the first place. For example, we also wouldn't expect these settings to suceeed if they were presented to a supervised learner either because of the mismatch between the training examples and the prompt example.
The paper relegates the entire 2.5 page related work section to the appendix. It would be better to include more in the main paper; at present only lines 25-32 in the Intro address prior work making it difficult to position this paper early on.

问题

In line 258, the paper claims that "each token in V should be present as the first token in both the training and test sets." But shouldn't we be interested in whether this is really required in the largest of LLMs? Is there any way to connect this result back to larger models?

局限性

Yes

作者回复

2024-08-07

Thanks for reviewing our paper. Below, we address your comments.

The paper states that "ICL is achievable by only modeling co-occurrence information using CBOW". However, this seems to miss the generality with which the term ICL is used. … So to say that "ICL is achievable" seems like a misuse of the terminology. Without a more careful definition of ICL, this statement is invalid.

We appreciate your comment. We have updated the paper to clarify that this statement applies to ICL tasks with known input-output pairings, like country-capital and word translations.

The paper claims that "these models should be equally likely to produce the correct answer for any given [word], irrespective of its relevance to the in-context examples. However, our experiment demonstrates that this is not the case". This is a huge leap for a poorly designed experiment. Llama 2 was trained on 98.08% English data. The amount of Indonesian language data may have been miniscule.

We thank you for your comment. Although LLaMA 2 and other LLMs are mostly trained on English corpus, we believe this does not invalidate our experiment, as the words involved in the experiments are common in both languages. If the models learn the English-to-Indonesian mapping, they should translate correctly regardless of context.

However, it is worth noting that the experiments in Section 2.4 do not involve any rare languages, and the same conclusion applies.

It is not clear to this reader why one would expect the setting in Section 4.1 to succeed via ICL in the first place.

Thanks for pointing this out. As per your suggestion, we have revised the wording in Section 4.1 to clarify that the failure of ICL in this scenario is not surprising and is instead related to the importance of tasks present in the training data, similar to findings of Raventós et al. (2023) and Yadlowsky et al. (2023).

The original phrasing that the setting in Section 4.1 should succeed via ICL was intended to differentiate ICL in LLMs from simple pattern recognition. In Section 4.1, the pre-training data and test prompts consist of sentences with different repeating patterns. A model (or human) that learns to recognize these patterns is expected to successfully perform ICL on the test prompts, even when the pattern is different. Similarly, in Section 4.2, a model (or human) that identifies consistent $(a_i, b_i)$ pairs at the start and end of each pre-training sentence should successfully perform ICL on the test prompts that contain in-context examples of the form $(a_i, b_i)$ .

It would be better to include more (related work) in the main paper; at present only lines 25-32 in the Intro address prior work making it difficult to position this paper early on.

We thank you for the comment. We have expanded the Introduction section to better relate our work to existing literature and clarify its position in the ICL research landscape. Specifically, we highlighted the following comparisons:

Numerous studies connected ICL with classical methods like gradient descent (e.g., Akyürek et al. (2022)), Bayesian inference (e.g., Xie et al. (2022)), and Newton’s method (e.g., Fu et al. (2023)). In contrast, our work links ICL to the continuous bag of words (CBOW) model, demonstrating that ICL involving known input-output pairings can be enabled through learning CBOW co-occurrence patterns.
While several studies examined the pre-training aspects of ICL, such as data distribution (e.g., Chan et al. (2022)) and task diversity (e.g., Raventós et al. (2023)), our work emphasizes the importance of co-occurrence, positional information, and training data structure for ICL.
Other research explored ICL in specific data-generating processes like discrete functions (Bhattamishra et al. (2023)) and autoregressive processes (Sander et al. (2024)). Our work focuses on data characterized by input-output pairs and repeating token patterns.

In line 258, the paper claims that "each token in V should be present as the first token in both the training and test sets." But shouldn't we be interested in whether this is really required in the largest of LLMs? Is there any way to connect this result back to larger models?

Thanks to your comment, we repeated the experiment with larger models by varying embedding dimensions (10, 50, 100, or 200) and transformer layers (1, 5, 10, or 20). We found that test accuracy remained zero when each token in $V$ appeared as the first token in exactly one of the training and test sets, regardless of model size. In practice, this condition is likely met due to the vast size of LLMs' pre-training data.

References

Akyürek, E. et al. (2022). What learning algorithm is in-context learning? Investigations with linear models. ICLR

Bhattamishra, S. et al. (2023). Understanding in-context learning in transformers and LLMs by learning to learn discrete functions. ICLR

Chan, S. et al. (2022). Data distributional properties drive emergent in-context learning in transformers. NeurIPS

Fu, D. et al. (2023). Transformers learn higher-order optimization methods for in-context learning: A study with linear models. NeurIPS M3L

Raventós, A. et al. (2024). Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression. NeurIPS

Sander, M. E. et al. (2024). How do transformers perform in-context autoregressive learning? ICML

Xie, S. M. et al. (2022). An explanation of in-context learning as implicit Bayesian inference. ICLR

Yadlowsky, S. et al. (2023). Pretraining data mixtures enable narrow model selection capabilities in transformer models. arXiv

2024-08-14

Thank you for your response! We completely agree with your point and will make the necessary revisions to lines 75-77.

评论- Reply to rebuttal

2024-08-14

Thanks for the thorough response. You've mostly addressed my concerns, but one of the small (but still important!) odd claims you make is not, and that is raised again below -- hopefully more clearly this time.

We have updated the paper to clarify that this statement applies to ICL tasks with known input-output pairings

Great. Qualifying definitions of ICL seems very important.

the experiments in Section 2.4 do not involve any rare languages, and the same conclusion applies.

Sorry -- I think you latched on to my rare languages point and missed the broader question about poor experimental design. The paper still claims in lines 75-77 that "If ICL stems from the ability of LLMs to recognize consistent mappings in test prompts, these models should be equally likely to produce the correct answer for any given [word], irrespective of its relevance to the in-context examples. However, our experiment demonstrates that this is not the case".

But why would anyone expect that a model should be equally likely to produce a correct answer for any given word irrespective of its relevance to the in-context examples? That is, your if/then statement does not seem like something most researchers would agree with. For example, decades of work on NLP and machine learning have shown that when learning through parameter estimation (instead of ICL), the relevance of the training examples to the test examples is of critical importance (i.e. the entire NLP literature on domain adaptation) (see [1] and [2]). Similarly, recent work on retrieval augmented ICL and the like has shown that large gains are possible when one retrieves a collection of in-context examples that are relevant (usually in an nearest-neighbor embedding sense) to the test instance (see [3] and [4]). In other words, the dominant view is that the when a model is learning (either via fine-tuning or ICL), whether the test instance is relevant to training examples is a critical factor. (Note that the citations here are just ones I could quickly find for MT. But I imagine you could find similar papers supporting this for, say, classification tasks closer to your experiment in Section 2.4 as well.)

In other words, it seems reasonable that one should counter your statement "our experiment demonstrates this is not the case" by pointing out that the LLM could simply be learning to recognize consistent mappings in test prompts /and/ that the mappings it infers may be domain specific. That would better align with the fields current understanding of how learning, be it through parameter estimation or ICL, happens in most models (LLMs and other varieties too).

This is a relatively minor issue with the paper, but the claim seems so outlandish that I'm returning back to it here.

[1] "A Survey of Domain Adaptation for Neural Machine Translation" [2] "Domain Adaptation and Multi-Domain Adaptation for Neural Machine Translation: A Survey" [3] "Towards Robust In-Context Learning for Machine Translation with Large Language Models" [4] "Efficient Machine Translation Domain Adaptation"

We have expanded the Introduction section to better relate our work to existing literature and clarify its position in the ICL research landscape.

Great!

In practice, this condition is likely met due to the vast size of LLMs' pre-training data.

That seems reasonable.

Under the assumption that lines 75-77 can be fixed easily, I'm bumping my score up to a 5.

审稿意见

评分: 7置信度: 32024-07-14

The paper studies the emergence of ICL using a synthetic setting. Particularly, it focuses on the importance of concurrence statistics to ICL, and shows that under some simplified conditions, a CBOW-styled model is proven to complete the correct completion for an ICL example. The paper additionally proves the importance of position encodings in the studied setting, showing that when the ICL task is inherently task dependent, position encodings is necessary for good performance.

优点

The paper studies an important problem. The approach---reconstructing LM behavior in much "shallower" models---is intriguing and can be applied to additional problems concerning LMs. The technical claims are well presented and the paper is overall very readable.

缺点

The main weakness is that the paper studies a very synthetic setting. I understand some simplification are needed for the derivation of theoretical results, and this is OK. But, for example, it would be interesting to try deriving results on cases where the input consists of valid grammatical sentences, rather than a concatenation of tuples. If that is not possible, the paper should clearly state the disparity between "real" ICL setting and this setting. While LMs can be presented with tuples in inference time, they are usually not trained on such tuples, but rather on free form language.

问题

The experimental results use cross entropy loss rather than the squared loss used in the theory. Why?

局限性

See above.

作者回复

2024-08-07

Thank you for reviewing our paper. We are pleased that you considered the problem we addressed important and found the paper well-presented and very readable. Below, we address your comments.

It would be interesting to try deriving results on cases where the input consists of valid grammatical sentences, rather than a concatenation of tuples. If that is not possible, the paper should clearly state the disparity between "real" ICL setting and this setting. While LMs can be presented with tuples in inference time, they are usually not trained on such tuples, but rather on free form language.

We appreciate your comment. We have expanded the Limitations section to clearly emphasize the distinctions between our setting and the general ICL setting. That being said, it is worth noting that the results related to co-occurrence are applicable to valid grammatical sentences, as the co-occurring pairs can appear anywhere within the sentence (e.g., "Beijing is the capital of China," "the city of Beijing is located in China").

The experimental results use cross entropy loss rather than the squared loss used in the theory. Why?

We thank you for this comment. While the cross-entropy loss is frequently used in practice, it does not yield a unique optimal set of parameters (due to the translation invariance of the softmax function), and obtaining closed-form expressions for the minimizers can be difficult. Therefore, we used the squared loss to make theoretical analyses more manageable. This simplification has been adopted in other theoretical works, such as in Li et al. (2023).

References

Li, Y., Li, Y., & Risteski, A. (2023). How do transformers learn topic structure: Towards a mechanistic understanding. International Conference on Machine Learning.

评论- Reponse

2024-08-13

Thanks for your response.

Please note I asked why the squared loss was not also used in the experiments.

2024-08-13

Thanks for the clarification.

We confirmed that the experimental results with the squared loss align with the theoretical findings. However, we did not include it in the paper because the squared loss is rarely used in practice.

Concretely, in the context of Theorem 1, an accuracy of $1$ is achieved when $(p_0, p_1, p_2) = (0, 1, 0)$ or $(0, 0, 1)$ under the squared loss in the clean scenario, consistent with our theoretical results.

The full results with the squared loss are as follows:

In the clean scenario, an accuracy of $1$ is achieved for each of the 6 $(p_0, p_1, p_2)$ tuples.
In the corrupted scenario, an accuracy of $0$ is achieved when $(p_0, p_1, p_2) = (0, 1, 0)$ , while an accuracy of $1$ is achieved for the remaining 5 tuples.

Under the setting of Theorem 2, the balanced accuracy when $(p_0, p_1, p_2) = (0, 0, 1)$ is $1$ under the squared loss, also confirming our theoretical results.

评论- Response

2024-08-14

Thanks for the clarification. I maintain my positive assessment.

审稿意见

评分: 6置信度: 22024-07-16

The paper investigates the emergence of ICL from training on unstructured data. It explores two types of ICL tasks: the first involves input-output pairings that frequently co-occur within sentences, and the second comprises recognizable patterns that do not commonly co-occur. The authors demonstrate that the first task can be addressed by modeling co-occurrence information, and highlight the importance of positional information and blocked noise structures through the second task. Additionally, the paper discusses scenarios where ICL fails. Both theoretical and experimental evidence are provided in the paper.

优点

It enhances understanding of how the structure of training data influences the emergence of ICL capabilities
the paper provides a mix of theoretical proofs and empirical validations to support its claims

缺点

There is a lack of experiment details in the paper, such as the number of training sentences used, the frequency of each input-output pair's repetitions within training sentences, and the methodology for generating training and evaluation data.
The scope of the experiments is limited, using small datasets and simplistic model architectures. Moreover, there is an absence of real-world data.
There is uncertainty about whether the findings would scale well to complex real-world data, larger models and higher embedding dimensions.

问题

Can you add examples of training sentences and prompts for experiments?

局限性

Limitations are discussed in the paper.

作者回复

2024-08-07

Thank you for reviewing our paper. We are pleased that you found our work valuable in enhancing the understanding of ICL on unstructured training data, and our theoretical and empirical results well-supported. Below, we address your comments.

lack of experiment details in the paper, ... number of training sentences used, the frequency of each input-output pair's repetitions ... methodology for generating training and evaluation data.

Can you add examples of training sentences and prompts for experiments?

We appreciate your comment. We have included more details about the experiments in the paper to strengthen our arguments. For convenience, we summarized them below:

For Table 1 exps, training data consists of 50K sentences. In the clean version, sentences are generated uniformly as described in lines 122-124. In the corrupted version, sentences are similarly generated, but each $(c_i, d_i)$ pair is replaced by $(c_i, r_j)$ or $(d_i, r_j)$ with a probability of $1/4$ each (lines 124-126). Test sentences are generated according to the setup in Theorem 1.
- Clean ex:
  - Training: $c_1 d_1 r_1 r_2 r_3 r_4 r_5 r_6$ or $r_1 r_2 r_3 r_4 r_5 r_6 r_7 r_8$
  - Prompt: $c_1 d_1 c_2 d_2 c_3 d_3 c_4 \dots$
- Corrupted ex:
  - Training: $c_1 r_1 r_2 r_3 r_4 r_5 r_6 r_7$ or $c_1 d_1 c_2 r_1 r_2 r_3 r_4 r_5$
  - Prompt: $c_1 d_1 c_2 d_2 c_3 d_3 c_4 \dots$
For Table 2 exps, training data consists of 50K sentences. In the clean version, sentences are generated uniformly as described in lines 166-169. In the imbalanced and extreme versions, the 60 other words are divided into three categories: 20 for $cd$ sentences ( $rcd_{\cdot}$ ), 20 for $ce$ sentences ( $rce_{\cdot}$ ), and 20 for both types ( $r_{\cdot}$ ). In the imbalanced version, $cd$ ( $ce$ ) sentences are 4 times more likely to sample a $cd$ ( $ce$ ) word than a $ce$ ( $cd$ ) word. In the extreme version, $cd$ ( $ce$ ) sentences cannot contain any $ce$ ( $cd$ ) words. Test sentences are generated according to the setup in Theorem 2.
- Clean ex:
  - Training: $c_1 d_1 r_1 r_2 r_3 r_4 r_5 r_6$ or $c_1 e_1 r_1 r_2 r_3 r_4 r_5 r_6$
  - Prompt: $c_1 d_1 c_2 d_2 c_3 d_3 c_4 \dots$ or $c_1 e_1 c_2 e_2 c_3 e_3 c_4 \dots$
- Imbalanced ex:
  - Training: $c_1 d_1 rcd_1 rcd_2 rcd_3 rce_4 r_5 r_6$ or $c_1 e_1 rcd_1 r_2 rce_3 rce_4 rce_5 r_6$
  - Prompt: $c_1 d_1 c_2 d_2 c_3 d_3 c_4 \dots$ or $c_1 e_1 c_2 e_2 c_3 e_3 c_4 \dots$
- Extreme ex:
  - Training: $c_1 d_1 rcd_1 rcd_2 rcd_3 r_4 r_5 r_6$ or $c_1 e_1 r_1 r_2 r_3 rce_4 rce_5 r_6$
  - Prompt: $c_1 d_1 c_2 d_2 c_3 d_3 c_4 \dots$ or $c_1 e_1 c_2 e_2 c_3 e_3 c_4 \dots$
Table 3 exps follow the setup of experiments in Table 2, except that the pairs are now of the form $(c_i, d_i)$ and $(e_i, f_i)$ instead of $(c_i, d_i)$ and $(c_i, e_i)$ .
For Section 2.5 exps, below is an ex. for each sentence type in Appendix D:
- Paramaribo is the vibrant heart of Suriname.
- Gabon (GAB) protects its diverse rainforests and wildlife.
- The banking sector is central to Liechtenstein's prosperity.
- Every country has its unique cultural identity and heritage.
- The city of Dushanbe reflects Tajikistan's vibrant spirit. Roseau is the cultural tapestry of Dominica.
- Mayotte (MAY) features lush landscapes and peaks. Turkmenistan (TKM) features the fiery Darvaza Crater.
The ICL prompts follow the form used in Section 2.4, with 1-5 examples.
For Table 4 exps, training and test data consist of all sentences in the form $abca$ , where $a$ , $b$ , and $c$ are distinct. Each test sentence is different from any training sentence. In the first scenario (both), the first tokens of the training sentences cover the entire vocabulary. In the second scenario (either), each token can be the first token in either the training or test data, but not both (lines 255-256).
For Table 5 exps, training data consists of 50K sentences generated uniformly as detailed in Section 3.1 (lines 289-296). The ICL prompt formats are also described in Section 3.1.
For Table 6 exps, training data consists of 50K sentences. In the clean scenario, training data are of the form $abcadefd$ and $abcbdefe$ , with ICL prompts as $\underline{abcadef} \dots$ and $\underline{abcbdef} \dots$ . In the block-noisy scenario, training data include sequences like $n_1 n_2 n_3 n_4 abcadefd$ and $abcb n_1 n_2 n_3 n_4 defe$ , with ICL prompts as $\underline{abcadefdghi} \dots$ and $\underline{abcbdefeghi} \dots$ .
For Table 7 exps, training data consists of 50K sentences generated uniformly according to the processes in Sections 4.1 and 4.2. The ICL prompt formats are also described in the same subsections.

The scope of the experiments is limited, using small datasets and simplistic model architectures. … absence of real-world data.

uncertainty about whether the findings would scale well to complex real-world data, larger models and higher embedding dimensions.

Thanks for your comment. We recognize these issues as shortcomings of our paper, as already noted in the Limitations section. However, we believe that our thorough analyses with small data sets and simple model architectures provide valuable insights into some components of unstructured data that are crucial for ICL. Also, even though we did not train LLMs on real-world data sets in our paper, we empirically validated our co-occurrence arguments by probing LLaMA 2, which was trained on real-world data sets (see Section 2.4).

Regarding larger models with higher embedding dimensions, our findings remain consistent with the theoretical conclusions. For example, even in larger models (embedding dimension up to 200 and number of layers up to 20), the test accuracies in Table 4 remain zero when each token in $V$ appears as the first token in exactly one of the training and test sets. Similarly, under the same models, the test accuracies in Table 7 also remain zero, in line with Theorems 5 and 6.

评论- Response to the Rebuttal

2024-08-14

I appreciate the authors' responses during the rebuttal. The experimental details are helpful for the comprehension of the work. I would like to maintain my original score as I am not very confident in my assessment over the theoretical contributions of this work.

作者回复

2024-08-07

We would like to thank all reviewers for providing constructive and insightful reviews. Please find our response to each reviewer in the "Rebuttal" section. Also, the updated Figure 1 (as requested by Reviewer xShA) is attached here.

最终决定Accept (poster)

2024-09-25

This paper examines how in-context learning ability can emerge from training on data containing particular types of simple patterns and some distributional assumptions. The authors explore a progression of increasing complex settings, finding that even a CBOW-style bag-of-words model can exhibit ICL-like behavior on some, while others require models to have access to positional information, multiple layers, or the presence of specific “blocked” noise structures in the data.

Reviewers agreed that this paper made a strong theoretical contribution to our understanding of how ICL can emerge, and appreciated that it was empirically validated in synthetic, controlled settings. They also highlighted how the analysis identified particular components that were necessary for ICL in these settings.

As weaknesses, reviewers nuMz, cRrU, and xShA all noted the limited scope of experiments, being limited to small models and simple architectures. For a theoretical contribution, this shouldn’t be a deal-breaker, although (as suggested e.g. by reviewer cRrU) the paper could benefit from more discussion that connects this to more natural settings, such as grammatical sentences.

While the substance of the contributions is strong, reviewer xShA also noted that it oversells the scope a bit; with the simplified settings studied, it could perhaps benefit from a more narrow framing. The authors noted in the response that they plan to update the title, and I would encourage them to highlight in the discussion how the setting either departs from, or represents a distillation of patterns that are likely to be found in real-world data.