Task Diversity Shortens the ICL Plateau
摘要
评审与讨论
This paper examines the phenomenon of in-context learning (ICL) in language models, where a model learns from example inputs and outputs within a context to respond accurately to subsequent queries. Researchers have found that during ICL, models often undergo periods of minimal learning progress (“loss plateaus”) before rapidly improving. In this study, the authors show that training on a variety of diverse ICL tasks concurrently reduces these loss plateaus, making each task quicker and easier to learn. This result challenges the expectation that increased task complexity would slow down the learning process. The authors suggest that the enhanced performance of large-scale language models could be due to both the richness of large datasets and the facilitation of learning through task diversity, which helps streamline optimization.
优点
This paper is clearly written. The figures and captions are easy to follow, and the motivation is clear.
缺点
The primary conclusion of this paper, “Task diversity shortens the ICL plateau,” may appear somewhat trivial. The study demonstrates that training models with multiple tasks simultaneously involves using significantly more data (i.e., training with n tasks simultaneously implies utilizing n times the amount of training data). Though these data sets pertain to different in-context learning (ICL) tasks, the observed improvements in learning efficiency could stem from the increased data volume rather than solely from “task diversity.” In this sense, the results align with well-established concepts of “knowledge transfer” in multitask learning, where it is well-documented that similar tasks can support each other’s learning. This paper essentially reinforces that ICL tasks share underlying similarities, which is trivial.
Additionally, it’s unclear how this finding can be practically applied to large language model (LLM) fine-tuning. If task diversity is indeed advantageous, does this imply a need to incorporate task-irrelevant data when fine-tuning a task-specific LLM? Clarifying this connection could greatly enhance the practical relevance of the paper’s conclusions.
问题
Please refer to my weaknesses part.
We thank the reviewer for bringing in the prior views on multi-task learning into the discussion.
We firmly disagree that the conclusion of our work is trivial. As we discuss in the common response, we maintain that our contribution, an optimization perspective, differs from the prior statistical perspective of multi-task learning.
It appears that the authors did not address my reviews comprehensively. Upon reviewing feedback from other reviewers, I notice that many of my concerns are shared. Below are my comments:
-
Reviewer aaqM also highlighted the issue of training on a larger batch and I appreciate that the authors reran the experiments in response. This adjustment makes the setting more fair, and I believe the experimental results in the paper should be updated accordingly.
-
One of my key concerns remains unaddressed: “If task diversity is indeed advantageous, does this imply a need to incorporate task-irrelevant data when fine-tuning a task-specific LLM?” I find this question crucial, and I am unclear why the authors have chosen to ignore it.
-
Regarding the optimization argument presented in the paper, I agree with Reviewer UZEF that there is likely related literature supporting the benefits of multi-task training. Although I am not an expert in this domain, I used GPT-4o to identify relevant references. For instance, one conclusion from the literature is that multi-task training often causes different tasks to converge at the same rate. Consequently, tasks that converge slowly when trained independently may benefit from accelerated convergence during multi-task training. This phenomenon is illustrated in Appendix C, Figure 1 of [1]. Therefore, I find the optimization-based justification in the paper inadequate.
Given these comments, I will maintain my current score.
[1] Asynchronous Convergence in Multi-Task Learning via Knowledge Distillation from Converged Tasks.
We kindly ask the reviewer for further clarity on the following points.
1. The Reviewer 6hyd references the concerns of Reviewer aaqM. Can the reviewer comment on whether they find our rebuttal to Reviewer aaqM convincing or not? If convincing, perhaps the initial criticism is no longer relevant. If not, we would like to understand why.
We also remark that looking to other reviewers to find additional points of criticism that the reviewer did not originally raise may not be informative, especially when our rebuttal to the said criticism is not addressed. If the reviewer has new questions or new points of criticism, we will happily address them. (Also, as stated in our response to Reviewer aaqM, the main experimental results have been updated accordingly.)
2. We apologize for not addressing this question. We initially misinterpreted the remark as a rhetorical point. We now know that the reviewer considered it a crucial question in need of a response. Our answer is yes. One of the key messages of our paper is: Even if you are only interested in ICL task A, it is beneficial to jointly train ICL tasks B and C because doing so speeds up training. Within the context of ICL, our answer is a very direct and unequivocal yes, and we provide the empirical evidence to back up the claim.
3. First of all, we point out that the reviewer's claim that "The primary conclusion of this paper, ... appear somewhat trivial" is a very strong claim that should be substantiated with evidence or an argument. The reviewer does provide an argument, but we point out that the reviewer's argument is statistical, while our main claim is in the sense of optimization. As we elaborate in the common response, this distinction between the statistical and optimization claims is crucial. We wonder if the reviewer agrees that this distinction is a meaningful one.
The reviewer acknowledges using "GPT-4o to identify relevant references" due to being "not an expert in this domain". While leveraging search tools—whether based on traditional IR methods or LLMs—is undoubtedly useful for identifying relevant prior work, I hope the reviewer understands that this comment makes us feel a little bit disheartened. Ideally, one expects reviews to reflect expert opinions (informed by search results).
Substantively, the claim that the prior conclusions of [1] (in particular, Figure 1 in Appendix C of [1]) overlap with our main findings seems to be a wholly hallucinated one.
Specifically, [1] does not consider ICL, but does consider multi-track training, and starts from the claim that naive multi-task training has the problem of overfitting. The goal of the paper is to remedy this overfitting, but, in any case, the work does not make any comparisons with single-task training, so the findings have no bearing on whether multi-task training is easier or harder than single-task training.
This paper studies how multi-task learning improves in-context learning on individual tasks. Through experiments on synthetic tasks, it shows multi-task training with in-context examples shortens the loss plateaus of single in-context learning tasks, making the learning easier. The phenomenon also apears for other model architectures (Mamba) and tasks (simple natural language tasks).
优点
-
The paper studies the learning dynamics of the in-context learning ability of language models, which is an important direction for understanding how language models obtain their capability.
-
"Task Diversity Shortens the ICL Plateau" is an interesting and valuable insight for designing new methods to accelerate the learning of LMs.
-
The experiments are sufficient to verfy the claims in the paper (at least in the toy settings).
缺点
-
An important property of ICL training is its generalization to unseen tasks [1,2]. This paper only shows that increasing the task diversity improves the learning tasks also in the training set. It would be better to examine how the diversity of ICL training tasks affects the ICL on unseen tasks.
-
The training tasks seems to have equal weights in the experiments (by sampling the same number of instances and balancing the losses). However, the tasks usually follow a long-tail distribution in the real-world pre-training corpus. Discussion on the setting where the tasks are unbalanced whould make the conclusions more close to the practical scenarios.
[1] MetaICL: Learning to Learn In Context. 2022. In NAACL.
[2] Pre-Training to Learn in Context. 2023. In ACL.
问题
N/A
We are very happy to hear that the reviewer found our work interesting and valuable. The individual comments are addressed in the following.
Weakness 1. As our paper concentrates on the optimization itself rather than the out-of-distribution (OOD) generalization capability, we did not perform evaluations on unseen ICL tasks. Nevertheless, although this is beyond the scope of our work, we strongly anticipate that the multi-task trained model would do better on the OOD ICL tasks.
Weakness 2. The reviewer makes an interesting point about mixing tasks with uneven probability, so we conducted experiments on this. We selected five ICL tasks and mixed them using probabilities of , resulting in five different combinations. Under these uneven mixing conditions, we consistently observed a shortened plateau, further indicating that our findings hold in such cases.
Plateau reduction with uneven sampling
| Number of tasks | Sparse Parity(2) | Sparse Parity(3) | Linear Regression | Quadratic Regression | Sparse Linear Regression |
|---|---|---|---|---|---|
| 1 | 2.8k (4.2k) | 10.2k (11.5k) | 2.3k (3.3k) | ||
| 5 | 2.4k (2.9k) | 3.0k (6.8k) | 2.0k (2.5k) | 2.4k (4.1k) | 2.0k (2.5k) |
Thank you for your response. I think my current score is appropriate.
In this work, the authors explores how training models with ICL examples from multiple tasks simultaneously can improve the rate at which models learn the different individual tasks, even faster compared to when learning on one individual task. In particular, the authors characterize through a phenomena of the loss plateau, which is a scenario where the model's loss plateaus for a significant number of steps before observing sudden large decreases, by showing that multi-task ICL learning significantly shortens this plateau duration. They use this to conclude that this task diversity appears to aid optimization, making ICL training more efficient.
优点
The manuscript very well written; the division of the paper makes it very intuitive to follow and the plots/tables are informative. The main findings are also quite interesting in how they present a setting where (if all holds) then learning various tasks at one can potentially improve the efficiency of learning a marginally or completely un-related task. The notion of measuring how the loss plateau contracts is a rather novel way of investigation and I see avenues where it could be useful for exploring the idea of other types of emergent properties in language models.
缺点
Some points made by the authors do not have completely solid grounding in the results or lack some support.
For example, on L16 and L199-202, the authors state that "multi-task ICL is easier to learn than single-task ICL is surprising as it contradicts the natural intuition that the combined complexity of multiple ICL tasks would lengthen the learning process". I'm not particularly convinced of this notion, at least in the regime within which the authors present their results. The sample is broken up into tokens, therefore the explicit rules of the functions that are being sampled for the in-context examples are likely not being directly captured in the sequence being presented. Additionally, since samples are consist of examples from a fixed function (that differs between samples), learning multiple tasks together can serve to learn more robust representations that won't as quickly over-fit to individual samples [1].
I also believe that to arrive at the author conclusions, further baselines need to be included. More explicit control over the exact sampling of tasks from , as well as the sampling of the parameters associated with the specific instance of should be necessary. For example, tasks can be sampled with uneven probability; if the shortening is consistent among different task classes but with the same sampling probabilities, then this can serve as a more explicit signal of it directly being due to the use of multiple tasks rather than a set of very few pairs of tasks that observe greater transfer compared to others.
Another particular concern of mine is that the paper is lacking in any specific formal analysis of the optimization landscape between the two settings. From my perspective, this causes a significant portion of the results, while interesting, to appear as speculative at best, which combined with the lack in explicit diversity among the different ICL tasks (either regression or boolean) to suggest a possible level of difficulty of extrapolating these observations to more general scenarios. I appreciate the examples provided in Appendix C on other natural language tasks, but given the fact that the results are somewhat inconclusive with regards to what settings multi-task ICL can help me I believe it's quite essential to investigate the differences from a optimization perspective.
In summary: I did enjoy quite a few points about this paper, but the authors need to reformulate some critical points (or provide details to demonstrate that those are not critical). If the necessary clarifications can be provided, I am open to changing my opinion of the work but as of the moment I believe there still remains some work to be done to solidify the details in it.
[1] Sebastian Ruder. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv preprint, 2017.
问题
- Perhaps I am overthinking, but I want to just make sure my understanding of the training mechanism is correct. I understand that in the multi-task setting, from which is sampled. Then is sampled from the domain of the sampled . Therefore the training (for any given sample) is on any arbitrary task in the set of tasks that were selected. If this understanding is correct, please ignore the next two comments.
- Is (the number of samples from each function class) fixed at the same value for all values of , or do they scale together; I apologize but I couldn't find this exact detail either in the main text or in the Appendix.
- I believe it would be useful to attempt an experiment where you can sample different functions from the same function class for constructing the tasks and see how this affects learning.
- L157: Why this complexity measure was chosen isn't particularly justified. Importantly, although you measure based on the number of learning steps, there are a number of different
- For Section 4.2, I believe that the authors should be able to use synthetically generated data where an explicit common structure exists to better demonstrate the claims.
We sincerely thank the reviewer for their insightful comments. We address the concerns in the following.
Regarding the point on the difference between single-task and multi-task training. As we understand, the reviewer commented that our argument is not convincing as the example sequence is broken into different tokens. In our setup, however, since the in-context demonstrations are already real numbers, we use a single linear embedding layer without a tokenizer, following the experimental setup of prior works [1,2]. Therefore, the explicit function rule is accessible to the model through the sequence.
Further baselines. The reviewer made an excellent point that a further baseline comparison would be valuable, as the plateau reduction might be due to very few pairs of tasks that observe greater transfer compared to others. We also thought about this question, which led to Table 1 (line 247). For Table 1, we evaluated all possible combinations of different tasks and averaged the iteration numbers to escape the plateau. Table 1 reveals that all ICL tasks, not just a selected few, tend to benefit from each other.
Additionally, as suggested by the reviewer, we conducted experiments on multi-task training with uneven sampling probabilities. To construct a multi-task ICL, we selected five ICL tasks. These tasks were then mixed using probabilities of , resulting in five different combinations. Under these uneven mixing conditions, we consistently observed a shortened plateau, further indicating that our findings hold in such cases.
Plateau reduction with uneven sampling.
| Number of tasks | Sparse Parity(2) | Sparse Parity(3) | Linear Regression | Quadratic Regression | Sparse Linear Regression |
|---|---|---|---|---|---|
| 1 | 2.8k (4.2k) | 10.2k (11.5k) | 2.3k (3.3k) | ||
| 5 | 2.4k (2.9k) | 3.0k (6.8k) | 2.0k (2.5k) | 2.4k (4.1k) | 2.0k (2.5k) |
Optimization landscape. While we do not provide a formal/rigorous analysis of the optimization landscape, Section 4.2 provides plausible insights into this phenomenon. Section 4.2.2 demonstrates a common structure across ICL tasks and hypothesizes that task diversity plays a critical role in learning this common structure, particularly from an optimization perspective. Section 4.2.3 verifies the generality of this hypothesis through experiments in the supervised learning setup.
Question 1. Yes! To generate the prompt, for each , we sample from a designated distribution s. (These are described in Section 5).
Therefore, as noted, changes across the task, and also across the prompt, thus a model needs to `in-context learn' through demonstrations .
Question 4: The choice of complexity measures We chose the iteration number as the notion of complexity because it is a directly observable and practically relevant quantity. There are certainly other options, but we would like the reviewer to consider this part of the paper as a very rough intuitive discourse surrounding the newly observed phenomenon.
Question 5: Common structure experiment As the reviewer suggested, we ran the same checkpoint experiments (Section 4.2.2.) for the data with an explicit common structure. We consider a linear in-context regression, , but is sampled from three different distributions: , , and . We can still observe the shortened plateau, compared to single-task baselines.
Common Structure
| Number of tasks | Linear Regression () | Linear Regression () | Linear Regression () |
|---|---|---|---|
| 1 | 4.3k (4.8k) | 10.2k (11.5k) | 4.5k (5.5k) |
| 2 | 2.9k (3.3k) | 2.1k (2.5k) | 3.0k (3.3k) |
| 3 | 1.9k (2.3k) | 1.9k (2.3k) | 1.9k (2.4k) |
We believe these answers clarify the reviewer's concerns. If the reviewer feels that our results are valuable and novel, we kindly ask the reviewer to increase the score.
[1] Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. [2] Satwik Bhattamishra, Arkil Patel, Phil Blunsom, and Varun Kanade. Understanding in-context learning in transformers and llms by learning to learn discrete functions.
Thank you to the authors for providing these additional results. I detail my comments and response to their rebuttal below.
Regarding the point on the difference between single-task and multi-task training.
Thank you for clarifying this part (it might be useful to make this very explicit in your experimental details). I still believe, however, that there's some missing points that are not completely discussed in terms of previous work that have also suggested that multi-task learning is beneficial for training (as the authors are working in this regime).
Further baselines.
I thank the reviewer for these results.
Optimization landscape
I appreciate the hypothesis the authors provide in Section 4.2. However, I feel the current results in that section don't fully justify whether the claim that is being made is what is being learnt by the model in practice. I do think further experimentation is necessary to make this argument more convincing.
Question 1
Thank you for the clarification.
The choice of complexity measures
Thank you for the note. I admit this wasn't a particularly significant point but it might be worth further measuring this phenomena from other perspectives that might already exist in the literature.
Common Structure
Thank you for these results; it does appear to follow the previous results. Although my point was more in terms of using this to make a more concrete claim regarding the claims in section 4.2 (for example, if explicit structure exists, then can the level of the shared structure be integrated into a better understanding of how this ICL plateau is shortened?), this does provide some additional empirical evidence.
In general, I'm grateful for the response from the authors and the effort they have but in the given the short time period for the rebuttal. I'm still a little hesitant to say that I'm convinced by the author's claims and arguments, especially since they are based almost exclusively off of empirical evidence on what remains a somewhat constrained setting which might not be fully representative of the underlying structure of other problems.
Nevertheless, I'm willing to slightly increase my score but at the same time feel a need to lower my confidence as there remain a gap between the scope of the analysis and justification which remain somewhat preliminary which doesn't feel completely bridged yet from my perspective. I hope the authors are understanding of these concerns and do not wish for them to interpret this in a negative manner, but rather as a signal of what are possible improvements that can be look at in the near term.
The paper explores the impact of multi-task learning ("task diversity" in the title) on the speed of training on individual tasks in the context of In-Context Learning (ICL) of various language-modeling architectures (Transformer, Mamba, Hyena). Quite surprisingly, the results suggest that multi-task setup leads to faster training/convergence of individual tasks, compared to training on those individual tasks.
The experiments show primarily synthetic tasks, but language tasks are also included as validation. The paper also investigates different explanations for its findings, e.g., finding that the creation of Induction Heads on its own doesn't explain the results.
优点
The paper showcases a good number of experiments with solid results. I also appreciate the explanation of the experimental setup, and detailed experiments with multiple tasks and models. The authors also present a good analysis of some possible explanations for the results achieved. The presentation is great, and so is the writing.
缺点
The most significant weakness of the work is the construction of the single-task baselines. In my opinion, this issue alone should justify the rejection of the paper. In short - the experiments compare single-task models trained with batch size B with multi-task models (with k tasks) with batch size kB - k times larger batch size, and k times larger the training set/compute cost! Then, the authors use the total number of iterations (batches) to compare models. While this is mentioned in the paper (line 177 and Appendix A), I think results should be shown with a more standard x-axis showing the number of tokens and not the number of iterations.
While the results, as initially presented, sound surprising, the fact that multi-task models train on multiple times as much data compared to the baseline makes the results fairly trivial or at least extremely easy to achieve positive results. Imagine this experiment: instead of training multi-task model on different tasks F1 and F2, let's set F1 to be equal to F2 (the same task twice). Such a "multi-task" model is then trained on simply a 2x bigger batch size than the baseline, and so, obviously, it converges faster in terms of iterations (but probably not faster in terms of tokens!) Of course, authors set different F1 and F2, but they are fairly similar, so I expect the results the authors get by default (if the model is not undersized wrt. problem complexity).
The paper shows model trained on 6 tasks jointly (using 6x bigger batch size). Then, convergence between 1.4x and 9.2x faster in terms of iterations (see Figure 1) is no longer surprising! In terms of tokens, that would mean a relative speed between 0.23x and 1.53x, which means multi-task learning is between 4x slower and 1.5x faster. (Although parity tasks break this pattern, with single-task, the model ends up at a bad local minima. I'd expect the issue to be with training stability and insufficient baseline tuning, however.)
I'd expect either comparison on the same number of tokens - preferably with the same total batch sizes for both single-task and multi-task settings, as training with different batch sizes and number of batches makes the setup prone to mistuning the baseline on accident.
While there is a possibility that I don't understand something fundamental here, I'm inclined to reject the paper for the above reason. With proper comparison metrics, I believe the results are fairly trivial. What is worse, the paper in its current state is a bit misleading.
问题
As stated in the weakness section.
- Why was a comparison on iterations, not total tokens, chosen?
- Were baselines tuned for their smaller batch sizes?
We thank the reviewer for raising very precise points regarding (1) batch size and (2) hyperparameter tuning. We recognize that these are central issues to the evaluation of our work. So, we would like to quickly address Reviewer aaqM’s concerns while we prepare a more comprehensive response for the full rebuttal.
Batch size: We would like to clarify that our intention behind the original batch size choice was to track the sample complexity for the single task in consideration. However, the reviewer raises an excellent point that this comparison would be misleading when two tasks are very similar. We therefore re-ran the experiments with a uniform batch size of instead of for the multi-task ICL. The results are shown in Figure (https://drive.google.com/file/d/1ABDNowiZuhXSEz-jCdyvmiEGamyxWO6u/view?usp=sharing), and we see that the plateau-shortening effect persists, except from the Conjunction setup.
(We also note that running iterations with batch size is not necessarily the same as running iterations with batch size , especially when training is stuck in a plateau. So taking the plot of our original submission and elongating the multi-task ICL curve by a factor of k to match the batch size is not necessarily a fair comparison. The only fair comparison is a direct experiment, and it shows that the plateau-shortening effect persists.)
Hyperparameter tuning: In our original experiments, we did no parameter tuning and blindly followed the setup from [1]. We thought that this would be fair, or perhaps even favorable to the baseline, since the single-task batch size of 64 is inherited from [1], and the learning rates of [1] were tuned with the batch size of and not the batch size of of the multi-task ICL setups.
In any case, we now see that it would be better to tune all setups, especially to double-check the validity of our new results. The results are shown in Figure (https://drive.google.com/file/d/1GRuMlpSEf7cHqucbv-CJy-367Fvp2eh_/view?usp=sharing), and they are qualitatively the same. The absence of tuning did not lead to confounding or erroneous conclusions.
Parity task and local minimum: The reviewer made the comment: "Although parity tasks break this pattern, with single-task, the model ends up at a bad local minima. I’d expect the issue to be with training stability and insufficient baseline tuning, however." We clarify that the difficulty of training ICL parity tasks has also been reported in prior works [2,3], and this provides us with, in a sense, "social proof" that the training failure is not due to poor implementation.
In fact, while we did not emphasize this point in the writing, we considered to have provided a solution to the open problem of training ICL parity. Some may have thought ICL parity to be too combinatorial and challenging for transformers to learn, but we show the positive result that transformers can learn to solve parity, and our multi-task training scheme elegantly resolves the training difficulty.
Conclusion: In about a week, we will provide a full rebuttal addressing all comments of Reviewer aaqM and the other reviewers. In particular, we will re-run all experiments with equitable batch sizes. We wanted to quickly respond to the main concerns of Reviewer aaqM, as it is central to the evaluation of our work. We maintain that the main claim of our paper is, and we will make our complete case with our updated full set of experiments. In the meantime, we welcome Reviewer aaqM to provide any comments or questions if anything is unclear about the experiments of this intermediate rebuttal.
[1] Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. NeurIPS, 2022.
[2] Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. Can mamba learn how to learn? a comparative study on in-context learning tasks. ICML, 2024.
[3] Satwik Bhattamishra, Arkil Patel, Phil Blunsom, and Varun Kanade. Understanding in-context learning in transformers and llms by learning to learn discrete functions. ICLR, 2024.
We provide further experiments addressing the reviewer's concerns.
Table 1 For Table 1, we trained all possible combinations of different tasks and averaged the number of iterations numbers for escaping the plateau. Under the same combinations of tasks, we re-ran the experiments with a uniform batch size of . Again, we observe that the plateau-shortening effect persists.
| Number of tasks | Sparse Parity(2) | Sparse Parity(3) | Linear Regression | Quadratic Regression | Sparse Linear Regression |
|---|---|---|---|---|---|
| 1 | 9.0k (12.1k) | 13.1k (14.2k) | 16.5k (18.4k) | ||
| 2 | 70.1k (89.7k) | 93.3k (110.0k) | 8.3k (10.7k) | 11.1k (21.3k) | 9.6k (12.3k) |
| 3 | 10.0k (10.9k) | 12.2k (20.8k) | 5.9k (7.8k) | 6.3k (6.5k) | 6.7k (8.6k) |
| 4 | 4.9k (5.7k) | 6.1k (15.4k) | 4.0k (4.7k) | 4.3k (6.5k) | 4.3k (5.0k) |
| 5 | 4.0k (4.9k) | 5.1k (11.7k) | 3.5k (4.3k) | 4.0k (6.6k) | 3.5k (4.4k) |
Caption: We use batch size . If tasks are mixed, each task has samples.
We sincerely thank all the reviewers for their invaluable feedback. We were happy to see that the reviewers overall agreed that our presented main observation is interesting. We have addressed the individual concerns in the individual responses.
In this common response, we reiterate the main value of our paper. Our discovery shows that training on diverse ICL tasks is more effective than training on a single ICL task, from an optimization perspective. This contrasts with prior notions such as "knowledge transfer” in multi-task learning which is a statistical perspective. In other words, it is known that multi-task training leads to better/robust features and therefore better generalization after training has been completed. However, our results concern the dynamics during training. We believe that the observation that multi-task training reduces training time is novel.
In relation to LLMs, pre-training via next-token prediction on a diverse corpus can be regarded as a highly diverse multi-task training. In light of our observations, we argue that the success of LLM training may be attributed not only to the statistical richness of the multi-task training data but also to the improved optimization dynamics induced by the diversity of the pre-training corpus.
We ask the reviewers and the area chair to consider whether our results are novel, valuable, and convincing. We argue that our presented results do have such qualities.
Summary:
The paper investigates the effects of multi-task learning on the training speed of individual tasks when training models with In-Context Learning (ICL) tasks (synthetic tasks in the paper). Surprisingly, the results indicate that training on multiple diverse ICL tasks concurrently can accelerate the learning process for individual tasks compared to training on them individually. This approach shortens the loss plateaus typically observed during training, contrary to the expectation that increased task complexity would prolong learning. The study suggests that task diversity in training not only aids optimization but also contributes to the efficiency of training Transformer models by facilitating faster convergence on individual tasks. However, the experiments in this paper are not conducted on language model tasks but are limited to a few categories of synthetic tasks.
Strengths:
- The study provides valuable insights into the learning dynamics of language models when trained with single ICL task vs. multiple ICL tasks.
- The paper presents extensive experiments on toy and synthetic tasks, whose detailed setup and analysis of multiple tasks and models enhance the paper's credibility.
- The presentation and writing are of high quality, making the content easy to follow.
Weaknesses:
- The original draft applied a much larger (K times, for K tasks) batch size for multi-task learning than single-task learning, while the loss curves merely show how the loss changes with the number of batches (iterations). This makes the shortened loss plateau a questionable and misleading claim since multi-task learning consumes much more training tokens than single-task learning at the same number of iterations. The authors corrected this issue in the rebuttal by providing updated curves of loss vs. number of samples, which shows less speedup on fewer categories of tasks by multi-task learning (4-5x speedup on 3/6 task categories and infinity on Parity tasks). While the authors' quick action of correcting the issue in a limited time frame is greatly appreciated, the new result weakens the generalizability of the major claim to some extent.
- A formal analysis of the loss landscape from the optimization perspective is not provided, though a brief discussion of the common structure is provided. The paper lacks a theoretical explanation of the observed phenomenon.
- The analysis in this paper is limited to a few specific categories of synthetic or toy tasks. It is not clear whether the observation can be generalized to other categories of tasks such as the ones on LLMs.
- Solely increasing the task diversity may introduce negative transfer in multi-task learning. It is necessary to discuss the transferability between selected tasks and why/when a negative transfer is not harmful in the proposed setting.
- The paper only studies the shortened loss plateau on the training tasks rather than unseen tasks, which is more related and better aligned with the interests of the foundation model or LLM researchers.
- The experiments applied even weights to multiple tasks. The authors' rebuttal provided experiments using uneven weights on which the conclusion still holds. More comprehensive experiments of different uneven weights can strengthen the conclusion further.
Decision:
The authors provided detailed clarifications and various additional experimental results in the rebuttal. Three out of the four reviewers responded to the rebuttal and confirmed that some concerns had been addressed promisingly. However, they also have some remaining concerns that cannot be fully addressed by the authors' responses. One reviewer raised the rating but also lowered the confidence, leading to two 3s and two 6s. The meta-reviewer carefully read the paper and all the discussions. While the observation that multi-task learning speeds up learning is interesting and relevant to foundation models such as LLMs, the limitation of experimental settings and tasks in this paper still leads to gaps in achieving a general conclusion. Moreover, a more rigorous and in-depth theory behind the observation is preferred, especially in explaining the difference between existing theories of multi-task learning and transferability. The experiments and analysis provided by the authors in the rebuttal and discussion can greatly strengthen the paper if a more complete version can be included. Although the current draft is still not ready for publication due to the above reasons, the meta-reviewer believes an improved version can present an important contribution to the community.
审稿人讨论附加意见
The authors provided detailed clarifications and various additional experimental results in the rebuttal. Three out of the four reviewers responded to the rebuttal and confirmed that some concerns had been addressed promisingly. However, they also have some remaining concerns that cannot be fully addressed by the authors' responses. One reviewer raised the rating but also lowered the confidence, leading to two 3s and two 6s. The meta-reviewer carefully read the paper and all the discussions. While the observation that multi-task learning speeds up learning is interesting and relevant to foundation models such as LLMs, the limitation of experimental settings and tasks in this paper still leads to gaps to a general conclusion. Moreover, a more rigorous and in-depth theory behind the observation is preferred, especially in explaining the difference between existing theories of multi-task learning and transferability. The experiments and analysis provided by the authors in the rebuttal and discussion can greatly strengthen the paper if a more complete version can be included. Although the current draft is still not ready for publication due to the above reasons, the meta-reviewer believes an improved version can present an important contribution to the community.
Reject