General-Purpose In-Context Learning by Meta-Learning Transformers
Transformers and other black-box models can exhibit in-context learning that generalizes to significantly different datasets while undergoing multiple transitions in terms of their learning behavior.
摘要
评审与讨论
The paper describes a "General Purpose In Context Learning" algorithm. It is basically transformed that is trained to predict the label of a specific sample given a context (the train dataset). The authors do experiments in some datasets such as MNIST and SVHN, and compare to some baselines to demonstrate that their method achieves some degree of generalization.
优点
- The paper adresses a very important problem in the community: learning-to-learn, in order to leverage information from previous tasks.
- The authors invest some effort in demonstrating that the model generalizes.
缺点
-
Lack of related work discussion: there is a tremendous amount of related work aiming to perform meta-learning or adapting transformers for in-context learning. However, the authors do not discuss any of them. For instance Meta-Transformer [1], OptFormer [2], or PFNs [3].
-
The contribution is limited: the authors propose a very similar approach as to previous work [2][3], while only introducing a data augmentation step.
-
The data augmentation step is not well-founded. By performing random projections, it is likely to introduce noise. According to the authors, it allows to achieve generalization, but they do not perform any ablation to test this.
-
Experiments are poor in demonstrating the validity and superiority method. They do not use strong baselines or relevant datasets (they limit most of the experiments to MNIST).
[1] Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., & Yue, X. (2023). Meta-transformer: A unified framework for multimodal learning.
[2] Li, X., Wu, K., Zhang, X., Wang, H., & Liu, J. (2022). Optformer: Beyond Transformer for Black-box Optimization.
[3] Müller, S., Hollmann, N., Arango, S. P., Grabocka, J., & Hutter, F. (2021). Transformers can do bayesian inference.
问题
- Are the authors planing to release the code?
Dear Reviewer,
Thank you for your constructive feedback. We appreciate the opportunity to clarify and strengthen the aspects of our paper that you have highlighted.
Related Work Discussion
We have included a detailed discussion of related work in the appendix due to space constraints in the main paper. However, we understand the importance of this section and are happy to extend it to include works such as Meta-Transformer [1], OptFormer [2], and PFNs [3]. Our paper indeed makes distinct contributions compared to these works, which we outline below:
- Meta-Transformer [1]: Compared to our focus on general-purpose in-context learning, this paper seems to have a different focus on encoding various modalities into a joint representation to be used for various multi-modal down-stream tasks. We believe this is a good citation to add to the conclusion/limitation section “A current limitation is the applicability of the discovered learning algorithms to arbitrary input and output sizes beyond random projections. Appropriate tokenization to unified representations may solve this (Chowdhery et al., 2022; Zhang et al. 2023).”
- OptFormer [2]: This paper meta-learns evolutionary algorithm-like (EA) learning algorithms by meta-learning a population update rule based on crossover, mutation, and selection neural networks. We are happy to mention this work in the paper. We also want to highlight that different from [2], our approach is not EA-like, and has a single Transformer without a population and we analyze how the data distribution, neural network size and memory capacity leads to in-context learning and generalization.
- PFNs [3]: As cited in the related work in the appendix, [3] demonstrated learning to learn on small tabular datasets when meta-training on synthetically generated problems. Experiments on more complex classification settings such as Omniglot relied on fine-tuning. In comparison, our method investigated meta-generalization of learning algorithms directly to datasets such as MNIST, Fashion MNIST, and CIFAR10 via data augmentation while studying fundamental questions about the conditions necessary for such generalization. We characterize transitions between algorithms that generalize, algorithms that memorize, and algorithms that fail to meta-train at all, induced by changes in model size, number of tasks used during meta-training, and meta-optimization hyperparameters. We further show that the capabilities of meta-trained algorithms are bottlenecked by the accessible state size determining the next prediction, unlike standard models which are thought to be bottlenecked by parameter count. Finally, we propose practical interventions such as biasing the training distribution that improve the meta-training and meta-generalization of general purpose learning algorithms.
Ablations of the Data Augmentation Step
Concerning the data augmentation step, we conducted an ablation study depicted in Figure 2. This study demonstrates that while a single projection (y-axis tasks = 2^0) leads to perfect accuracy on the test set with the same projection/task (2b), it fails to generalize to unseen tasks (2c). Generalization to unseen tasks emerges only after a sufficient number of projections (roughly 2^14), suggesting the effectiveness of our augmentation method in promoting generalization. We want to highlight that the choice of this augmentation technique only serves to scientifically analyze the effect of increasing the breadth of a task distribution on the algorithmic solutions implemented by Transformers. We are not suggesting that this specific augmentation must be the foundation of future models, but rather that it is an effective method for asking fundamental questions about general-purpose in-context learning. That said, contemporary work has looked at related randomization methods also in the context of language-based tasks (https://arxiv.org/abs/2305.16349).
Experiments and Baselines
The objective of the paper is to investigate under what conditions black-box models such as Transformers learn to implement general-purpose learning algorithms. Strong performance on standard benchmarks was not the objective of this work. Based on this, we feel like we have chosen appropriate datasets and baselines.
In terms of baselines, we selected algorithms like SGD and VSML, which are most relevant for evaluating general-purpose learning algorithms. SGD for a strong general human-engineered learning algorithm, and VSML for a meta-learner that has been shown to generate quite general learning algorithms. We also included comparisons across architectures such as LSTMs and fast weight programmers/outer-product LSTMs (Table 2 and Figure 5).
The choice of datasets was driven by our objective to meta-learn general-purpose learning algorithms. Here, the learning algorithm needs to learn a lot from scratch & extract a lot of information from the dataset at meta-test time. This naturally leads to more complex datasets requiring significantly more examples in-context (e.g. how SGD requires at least hundreds of thousands of examples). Due to sequence length limitations in current Transformer architectures, using more complex datasets poses significant challenges. Thus, we focused on basic scientific questions of when such general-purpose in-context learning emerges with more simple datasets that are currently tractable in this setting. Nonetheless, we recognize the importance of testing our approach on more complex datasets to further validate its generalizability. As a response, in section 4.4 we have demonstrated that if useful domain-relevant pre-trained networks (here from ImageNet) are available, these domain biases can be leveraged for faster in-context learning while still generalizing well across datasets (Figure 9). We also have demonstrated similar algorithmic transitions with these embeddings in the appendix, Figure 12.
In response to your review and to strengthen this aspect of our research, we have also conducted experiments with CLIP embeddings and mini-Imagenet. In these experiments (see this figure), we first project inputs into a latent space with a pre-trained CLIP model and then proceed as before, randomly projecting these features, and training a GPICL Transformer on top. We add the mini-ImageNet dataset in these experiments and use a 10-way 10-shot setting to ensure the same number of classes across datasets and a similar sequence length to previous experiments. We observe strong and generalizable in-context learning when leveraging these pre-trained embeddings, without meta-training on unseen datasets.
Code Release
We have released the code associated with our paper to facilitate reproducibility and further research in the field. It is currently not anonymized, so we can not link it here.
We hope these responses address your concerns and highlight the novelty and rigor of our work. We remain open to further suggestions and are dedicated to enhancing the clarity and impact of our research.
This paper investigates how transformer-based models can meta-learn general-purpose in-context learning algorithms (that take in training data and produce test-set predictions without any explicit definition of an inference model, training loss, or optimization algorithm) with minimal inductive bias. The authors propose using black-box sequence models like LSTMs and Transformers as meta-learners, since they can learn concepts from demonstrations without an explicit definition of the learning algorithm.
Authors go to great lengths to introduce and classify different in-context learning algorithms and introduce a Transformer-based model (GPICL) and associated meta-training task distribution. They discuss how they generate different tasks for the meta algorithm to learn by taking existing supervised datasets and randomly projecting the inputs and permuting classes to generate many datasets from a small seed. Based on that they define the General-Purpose In-Context Learner (GPICL) - a transformer model that is fed sequences of input-output data and asked to predict next output based on previous input. During training of GPICL, each iteration uses Adam to optimize the loss on a random batch of training data sampled from a random task. Authors run many ablation studies analysing meta-learning with transformers. Among others, they show results that indicate the transformer starts to learn rather than memorize with enough memory used for training, and that simple data augmentations during meta-training lead to the emergence of learning-to-learn behaviors.
优点
-
Presents a simple baseline model (GPICL) for meta-learning general purpose learners with minimal inductive bias. Shows competitive performance compared to models with stronger inductive biases.
-
Provides interesting insights into the transitions from memorization to task identification to general learning as model size and number of tasks increase during meta-training. Identifies the accessible state/memory size as a key bottleneck for meta-learning capabilities, rather than just model parameter count.
-
Identifies the accessible state/memory size as a key bottleneck for meta-learning capabilities, rather than just model parameter count.
-
Well-written and easy to follow presentation of methods and results.
缺点
- Authors use CIFAR10, MNIST, FashionMNIST and SVHN as their datasets. Those are rather simple datasets and it would be good to see if the findings generalizes well to harder and larger datasets. Most importantly it would be interesting to show that the method is performing well due to its inherent ability to learn rather than the datasets being easy.
- The authors do not make it explicitly clear what elements of their new setup is their contribution and which is already present in other papers. I understand the random projection strategy and the coding for the transformer are the main modeling novelties while the ablation studies have significant impact on the understanding of the field. Still, the presentation would be improved with a clear contributions section that highlights this.
问题
Authors claim that the performance of the model improves not with the number of parameters but with the state size. I am wondering if this is the case because the datasets considered such as MNIST are simple enough that having more parameters is no longer helpful rather than showing a general trend.
Dear Reviewer,
Thank you for your valuable feedback on our manuscript. We appreciate your positive remarks on the strengths of our work and would like to address the concerns you've raised to clarify our contributions and findings.
Choice of Datasets (CIFAR10, MNIST, FashionMNIST, SVHN)
The choice of these datasets was driven by our objective to meta-learn general-purpose learning algorithms. Here, the learning algorithm needs to learn a lot from scratch & extract a lot of information from the dataset at meta-test time. This naturally leads to more complex datasets requiring significantly more examples in-context (e.g. how SGD requires at least hundreds of thousands of examples). Due to sequence length limitations in current Transformer architectures, using more complex datasets poses significant challenges. Thus, we focused on basic scientific questions of when such general-purpose in-context learning emerges with more simple datasets that are currently tractable in this setting. Nonetheless, we recognize the importance of testing our approach on more complex datasets to further validate its generalizability. As a response, in section 4.4 we have demonstrated that if useful domain-relevant pre-trained networks (here from ImageNet) are available, these domain biases can be leveraged for faster in-context learning while still generalizing well across datasets (Figure 9). We also have demonstrated similar algorithmic transitions with these embeddings in the appendix, Figure 12.
In response to your review and to strengthen this aspect of our research, we have also conducted experiments with CLIP embeddings and mini-Imagenet. In these experiments (see this figure), we first project inputs into a latent space with a pre-trained CLIP model and then proceed as before, randomly projecting these features, and training a GPICL Transformer on top. We add the mini-ImageNet dataset in these experiments and use a 10-way 10-shot setting to ensure the same number of classes across datasets and a similar sequence length to previous experiments. We observe strong and generalizable in-context learning when leveraging these pre-trained embeddings, without meta-training on unseen datasets.
GPICL’s inherent ability to learn
It's important to note that even simpler datasets require learning when they differ significantly from the training distribution. This learning behavior of GPICL is visible in Figure 3 (where the performance increases with larger context) and Figure 4 (where we show learning as y-axis values > 0). Our method’s success on a variety of datasets, as shown in Table 2 and Figure 9, is indicative of its inherent learning ability, as the various datasets are too different from each other to facilitate generalization without learning.
Clarification of Contributions
We acknowledge the need for a clearer delineation of our contributions. Apart from the innovations in task generation and ablation studies, our paper makes several significant contributions. We introduce a methodology for meta-training transformers and other black-box models for general-purpose ICL, focusing on generalization to significantly different meta-test tasks. We also provide a comprehensive analysis of factors affecting this process, such as model size, memory size, sample size, and optimization strategies. Furthermore, we reveal how these models transition between different algorithmic solutions: Task memorization, task identification, and general learning-to-learn. Finally, in section 4.4 we demonstrate how one can leverage pre-trained domain-relevant neural networks while maintaining broad generalizability. We will ensure to highlight these contributions more distinctly and are open to including our related work from the appendix in the main paper as space allows.
Model Performance Linked to Memory/State Size vs. Number of Parameters
While we acknowledge that parameter count might play a larger role in other datasets, our findings highlight that state/memory size is a crucial, often overlooked factor. In scenarios where learning relies more on general-purpose learning rather than task similarity, parameters might indeed become less relevant. This is analogous to learning algorithms like gradient descent, which, despite having a concise symbolic description (here encoded in the parameters of the Transformer), operate on a larger number of weights (akin to the state/memory size in our model).
We hope these responses address your concerns and further highlight the novel insights and contributions of our study. We remain committed to enhancing the clarity and impact of our work.
Dear Authors,
I appreciate your detailed answer and the addition of the extra experiment using ImageNet-Mini, I think those results strengthen the paper. However, reading other reviews who highlight some weaknesses of the paper I have decided to maintain my original score.
Dear Reviewer,
Thank you for acknowledging the value added by our detailed answer and the additional experiment. We are pleased to hear that these results have contributed positively to your understanding of our work.
We understand from your comment that, despite the strengths you've noted in our paper, your original score remains unchanged, influenced by concerns raised by other reviewers. To further improve our paper and address any remaining reservations effectively, we would greatly appreciate more specific feedback.
Could you kindly elaborate on which aspects of the paper, as highlighted by other reviewers, are still a concern for you? This would help us target our rebuttal more precisely and ensure that we address all significant points.
Thank you.
This submission is a resubmission from another machine learning venue, and the paper has undergone 0 modifications since its previous rejection. While it is permissible to resubmit the work, in this case, the authors have not addressed the points raised in the earlier review process. I believe these points are crucial for the paper's improvement, and it would be counterproductive to overlook the feedback provided in the previous reviews.
If the Area Chair still deems it appropriate to consider this submission, I recommend using all reviews so far.
优点
N/A
缺点
N/A
问题
N/A
伦理问题详情
This submission is a resubmission from another machine learning venue, and the paper has undergone 0 modifications since its previous rejection. While it is permissible to resubmit the work, in this case, the authors have not addressed the points raised in the earlier review process. I believe these points are crucial for the paper's improvement, and it would be counterproductive to overlook the feedback provided in the previous reviews.
If the Area Chair still deems it appropriate to consider this submission, I recommend using all reviews so far.
Dear Reviewer,
Thank you for your feedback. We understand your concern regarding the resubmission of our paper. We want to clarify that since the previous submission, we have made several adjustments in our writing, modified certain claims, and added new experiments to the appendix. These changes were made in response to the feedback received, although they might not have been adequately highlighted in the revised manuscript. We appreciate your input and are open to any specific comments or suggestions you may have to further improve our paper.
Moreover, we have also added new content in this rebuttal, including new experiments with CLIP embeddings and mini-Imagenet. In these experiments (see this figure), we first project inputs into a latent space with a pre-trained CLIP model and then proceed as before, randomly projecting these features, and training a GPICL Transformer on top. We add the mini-ImageNet dataset in these experiments and use a 10-way 10-shot setting to ensure the same number of classes across datasets and a similar sequence length to previous experiments. We observe strong and generalizable in-context learning when leveraging these pre-trained embeddings, without meta-training on unseen datasets.
I acknowledge reading the comment. I appreciate the authors' reply.
This paper demonstrated that transformers can be meta-trained to act as general-purpose in-context learners. This paper also characterizes transitions between algorithms that generalize, algorithms that memorize, and algorithms that fail to meta-train at all, induced by changes in model size, number of tasks, and meta-optimization. This paper proposes practical interventions such as biasing the training distribution to improve the meta-training.
优点
- This paper performed experiments on image classification datasets to demonstrate that transformers can be meta-trained to perform in-context learning.
- Figure 2 gives convincing evidence of a transition from memorization and generalization induced by model capacity and sample size.
- This paper provides practical interventions to improve meta-training.
缺点
- The writing is not completely clear. For example, "general-purpose in-context learning" is a vague term without a rigorous mathematical definition. This makes the paper a bit hard to read.
- The memory or state in Section 4.2 is quite heuristic without a concrete math definition. Beyond LSTM and transformers, it is not clear how the state is defined. The insight that "Large state is more crucial than parameter count" is thus not fully grounded.
The contributions of this paper were significant one year ago when this paper first came out. However, the main message that "one can meta-train a transformer to perform ICL" is well-known nowadays. I am not sure how to evaluate this paper given this situation.
问题
Could the authors address the comments on the weakness?
Dear Reviewer,
Thank you for your insightful comments on our paper and for recognizing the significant contributions of our paper when it was first completed. We appreciate the opportunity to clarify and strengthen our arguments in light of your feedback.
Clarity of "General-Purpose In-Context Learning"
We acknowledge that "general-purpose in-context learning" (GPICL) may seem like a broad term. However, we believe that most practitioners in the field have a strong intuition about the general-purpose applicability of human-engineered learning algorithms (such as gradient descent) across a wide range of problems. In the context of our work, it is intended to capture the wide-ranging applicability of the meta-learned learning algorithms. Our goal is to move towards this broad applicability, and while we do not claim to have fully achieved this ambitious objective, our work lays down foundational steps towards it.
Definition of Memory or State
We understand the concerns regarding the heuristic nature of the memory or state definition in Section 4.2. To provide more clarity, we conceptualize the state/memory capacity as the amount of information (quantifiable in bits) accessible about the sequence for subsequent predictions. We approximate this in our experiments through the size of the activation state vector. This approximation serves as a practical approach to explore and quantify the role of memory capacity in in-context learning, particularly in relation to model size and parameter count.
Significance of Contributions
Thank you for recognizing the significant contributions of our paper when it was first completed. Although the idea of meta-training a transformer for ICL might be more recognized now than it was a year ago when this paper first came out, we believe our paper still offers substantial contributions. Firstly, we extend the understanding of how transformers and other black-box models can be effectively meta-trained for generalizing ICL. Secondly, we provide a detailed analysis of the conditions under which these models succeed, focusing on aspects like model size, memory size, sample size, and optimization strategies. Lastly, our work uncovers the transitions between different algorithmic solutions that a Transformer can implement, which is a novel insight not fully explored in existing literature.
In light of our paper’s initial contributions and the above-mentioned, still-relevant distinguishing features, we argue that this work warrants publication.
We hope these clarifications address your concerns and illustrate the novel contributions and implications of our work. We remain open to further suggestions and are committed to enhancing the clarity and impact of our research.
The paper shows how transformers (and other models) can be meta-trained for in-context learning to take into account losses, optimization algorithms, architectures etc. The general topic is clearly very interesting and timely. The reviewers were not convinced that the revised form, compared to a previous version of this paper, has addressed all the initial criticisms. In any further revisions, it would be important that the authors clearly highlight the improvements over the TMLR submission at the time of initial submission rather than discuss this in the rebuttal phase.
为何不给更高分
None of the reviewers are very positive. Some are very negative.
为何不给更低分
NA
Reject