Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families
We propose a scaling law for LLMs skills (e.g., reasoning, instruction following, etc)
摘要
评审与讨论
This paper focuses on predicting performance of down-stream tasks across LLM families and benchmarks by leveraging their intrinsic interactive structures. Specifically, this paper applies factor analysis models in Economics to explore low-dimensional latent skills (e.g., reasoning and instruction following) as key predictors of performance. Extensive experiments demonstrate the effectiveness of the proposed model compared with a set of baselines.
优点
- The idea of adopting factor analysis models in Economics to analyze multi-benchmark performance across model families is intriguing.
- The authors provide the theoretical justification to consolidate the proposed method.
- The authors conduct experiments across several downstream tasks to validate the effectiveness of the proposed method.
缺点
The paper is well-structured and effectively conveys its main points. How to understand and predict multi-benchmark performance is an important problem within the domain of Large Language Models (LLMs) and the authors propose the corresponding method to address this problem. I have the following suggestions to further improve the manuscript of this paper:
- The authors conduct experiments using only three open-source dense LLMs, showing competitive performance on several benchmarks. Recently, the sparse Mixture-of-Experts (MOE) LLMs have achieved impressive results across various tasks and represent an essential model family in the LLM landscape. The authors should address how the proposed method could be adapted for MoE architectures, given their distinct structural characteristics compared to dense models. Additionally, the authors should perform experiments with sparse MOE LLMs (e.g., Mixtral[1], DeepSeekMoE[2]) to comprehensively assess the predictive abilities of multi-benchmark performance across LLM families.
- As illustrated in Figure 5, reasoning ability is primarily influenced by model size rather than the number of training tokens. I recommend that the authors discuss how their model might be extended or modified to account for the effects of instruction tuning separately from pre-training. Recent studies, including Skywork-Math [3] and OpenMathInstruct-2 [4], demonstrate a clear scaling law with instruction data, showing that model performance significantly improves as the amount of supervised fine-tuning (SFT) data increases. This finding appears to contradict the observations reported in the current experiments. Therefore, I suggest that the authors address any limitations in their current approach that might explain this discrepancy in findings compared to recent studies on instruction data scaling.
[1] Jiang A Q, Sablayrolles A, Roux A, et al. Mixtral of experts[J]. arXiv preprint arXiv:2401.04088, 2024.
[2] Dai D, Deng C, Zhao C, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models[J]. arXiv preprint arXiv:2401.06066, 2024.
[3] Zeng L, Zhong L, Zhao L, et al. Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models--The Story Goes On[J]. arXiv preprint arXiv:2407.08348, 2024.
[4] Toshniwal S, Du W, Moshkov I, et al. OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data[J]. arXiv preprint arXiv:2410.01560, 2024.
问题
- Could you involve a broader range of LLMs, specifically sparse MOE LLMs like Mixtra and DeepSeekMoE, to provide a more holistic demonstration of the proposed method’s effectiveness?
- See the second point in the Weaknesses section.
Thank you for your time and dedication to our paper! We have responded to your concerns below and have updated the paper to accommodate suggestions by reviewers. Please let us know if you have further questions.
"The authors conduct experiments using only three open-source dense LLMs": We use 30 LLM families in our paper; please check Section 4.1 for more details.
Inclusion of MoE models: We do not include MoE models because of data limitations. For example, training tokens for Mixtral models are unknown and DeepSeekMoE is only available at one size on HuggingFace.
Post-training factors (STF, RLHF, etc): Our findings do not contradict the referenced papers; the reason is the following. In our analysis, the number of tokens and parameters are only related to pre-training. All the post-training factors, including more training tokens during SFT, have their effect taken into account by the family-specific intercepts, i.e., ’s. For example, if a certain family went through a post-training procedure that made these models strong in reasoning, their values for in reasoning will be relatively large (recall that we consider base and instruct models to be from different families). We made this point clear in Section 3.1.
With the discussion period concluding in a few days, we want to ensure we address any remaining questions or concerns you might have about our paper. During the rebuttal process, we have updated the manuscript to (i) clarify how our work differs from Ruan et al.'s and highlight that they do not explore performance prediction from compute (see lines 134-139 and 471-475 in the revised paper), (ii) include a list of the models and families used (Appendix F), and (iii) emphasize that the intercept accounts for hidden factors such as post-training adjustments (lines 184–187). For a detailed explanation of the rationale behind our family-specific intercept, please refer to our response to reviewer DNEX.
Thanks for the authors' response. I have no further questions. After reading the comments from other reviewers and considering the current version of the paper, I will maintain my evaluation score.
The paper presents a novel approach to model LLM performance on different benchmarks starting from the number of training tokens and model size. This "scaling law" can be fitted by using existing benchmark results across various LLMs of different families. They assume some of the parameters of the scaling laws are shared across model families, while others are family-specific. Previous work assumed all parameters are either shared or family-specific, which can be respectively too unrestrictive or lead to too many parameters, and thus impossible to fit with limited observation data. Moreover, in contrast to previous work, they assume that model size and number of training tokens independently affect performance, and they also have the option of learning the shape of the sigmoid function used to transform (modelled via a neural network). The experiments included in the paper confirm the strong predictive power of the obtained scaling law.
优点
originality
- The introduced scaling law has elements of novelty over previously presented ones, in particular that in Ruan et al 2024 and Owen 024.
- The latent skills interpretation is interesting.
quality
- The experiments presented in Section 4 are extensive.
clarity
- The text is well-written
significance
- Leveraging information across model families is indeed beneficial to predict performance for families for which a few observations are available.
- removing the assumption of relying on model size and training tokens only their product is meaningful.
缺点
- I don't think the paper gives a proper characterisation of Ruan et al 2024. In particular, the introduction and abstract seem to claim that assuming "LLM performance is driven by low-dimensional latent skills [...] influenced by computational resources" is a significant novelty of the paper, while that was actually the key value proposition of Ruan et al. 2024. Moreover, Section 2.2 claims that Ruan et al only uses two parameters per model family directly connecting compute to observed performance; however, my understanding of the method in Ruan et al. 2024 is that it is closer to what is presented in this paper in 3.1, i.e., Ruan et al. also learns a set of low-dimensional capabilities for each LLM, which is then transformed into model performance using "loading factors" specific to each benchmark.
- The paper could provide more clarifications on the choices made to come up with the model in Sec 3.1, see questions below.
- The notation in Sec 3.2 is confusing (see questions below). Moreover, it is unclear how much assumption 3.1 is realistic or verifiable.
- Some of the choices related to how the results are presented in Sec 4 seem arbitrary, as they are not explained (see questions below)
- Finally, I don't understand where the name "Sloth" comes from: I am not a native speaker, but it seems weird to pronounce "SSLaws" as Sloth, even though I see the joke.
问题
Related to Sec 3.1:
- why was that specific model used in economics taken as inspiration?
- why is the skill slope shapred across family and the intercept fixed, and not the converse? The choice by the authors seems to be counterintuitive to me, as saying that increases in model size and number of training tokens give the same improvement, and the only thing that changes is the startin value of each skil.
- why does x(s,t) not include log(st)?
Related to Sec 3.2:
- how is the design matric defined?
- what is the dimension p?
Related to Sec 3.3:
- What is the Huber loss?
Related to Sec 4:
- why does Fig 2 show average over benchmarks while figure 1 does not?
- What is the substantial difference that makes their method so that (as stated in Sec 4.4): "Unlike Ruan et al. (2024)’s observational scaling law, Sloth can be used to estimate the latent skills of hypothetical LLMs and then used to predict the performance of those LLMs in downstream tasks"? I thought that was also possible for Ruan et al, as long as the model belongs to a family for which a few other models were already observed, which seems to be a necessary condition for the method in this paper too.
Thank you for your work on our paper! We have responded to your concerns below and have updated the paper with some of your suggestions. Please let us know if you have any questions.
Regarding Ruan et al (2024): The ideas proposed by Ruan et al (2024) cannot be used to predict skills or performance from compute and we consider that to be the main difference between our papers; see below for a detailed explanation. As we comment in our Section 3.1, LLMs low-dimensional latent skills have been studied before by many papers but we are the first ones to propose and successfully achieve a scaling law for those skills. For example, the focus of Ruan et al (2024) is building a scaling law for complex downstream tasks from latent skills instead.
In this paragraph, we explain why Ruan et al (2024) scaling cannot be used to predict skills or performance from compute, implying a major difference from our work. The observational scaling law proposed by Ruan et al (2024) is used to predict the performance of LLMs in complex downstream tasks from their PC scores (extracted from observed benchmark data); please check their Section 3.4 for more details. This implies that, for example, to predict the performance of LLaMa-3-70B in a certain downstream task, they first have to observe the scores of LLaMa-3-70B in a bunch of benchmarks. That is, it is not possible to predict the performance of LLaMa-3-70B in downstream tasks using their approach before the LLM is released (i.e., making predictions from compute); this is one of the gaps we attack in our paper. This limitation was confirmed by Tatsunori Hashimoto in a conversation we had at ICML 2024 (T. Hashimoto, personal communication, July 2024). Moreover, based on Hashimoto's suggestion, we tried to extend their method to make performance prediction from computing possible; as we show in Figure 1, that approach ("PCA+FLOPs") did not work well.
Section 3.1:
-
Economics model: We use that specific formulation for two main reasons. First, it is a generalization of models previously used in the scaling law literature (e.g., the model used by Owen(2024)). Second, it accounts for the interaction of log(s)log(t), which we found to work well (check our Appendix D), and incorporates family effects (intercepts) in a parsimonious manner.
-
Family-specific intercept: We use family-specific intercept and not slopes mainly because of a practical reason: our model has three covariates in , therefore it would be impossible to fit a scaling law with family-specific slopes without observing many models from the same family in our training set. The family-specific intercept ends up being a parsimonious way of dealing with models produced with better technology (e.g., algorithms, data formatting/quality) and has good predictive power as we see from Figures 1 and 2.
-
"why does x(s,t) not include log(st)?": it includes. This is because log(st)=log(s)+log(t).
Sections 3.2 and 3.3:
-
Assumption 3.1: This assumption is only used to describe to what degree the model's parameters are identifiable. This type of assumption is widely used and accepted in the field of Psychometrics (please see [1], for example).
-
Design matrix and : the design matrix is a matrix of covariates specified in Eq 3.2 concatenated with a matrix of zeros and ones (to account for the family intercepts). Each row of represents a different LLM and #families. We made this explicit in Section 3.2.
-
Huber loss: this loss is a smoothed version of the loss. Please check https://pytorch.org/docs/stable/generated/torch.nn.HuberLoss.html
Section 4:
- Figure 1 vs Figure 2: Both figures show the average column. We did not display the full Figure 2 in the main text because we thought it was redundant and we lacked space. However, its full version can be found in Appendix F.
- Ruan et al (2024): Please check the response "Regarding Ruan et al (2024)" above.
References
[1] Yunxiao Chen, Xiaoou Li, and Siliang Zhang. Joint maximum likelihood estimation for high-dimensional exploratory item factor analysis. Psychometrika, 84:124–146, 2019.
I thank the authors for their response. A couple of further comments on my side below.
Comparison to Ruan et al 2024: I am unconvinced by the explanation as to why Ruan et al cannot be used to predict performance from computing. Indeed, in their section 3.4, after Eq 7, they discuss how, from the inferred values of capabilities () for a few models in a considered family, they can learn coefficients of a linear relation mapping those to training compute metric. Thus, that relation can be used to predict the coefficients for a new model of the same family, starting from its training compute. Even if Ruan et al did not do this in their paper, I believe it is incorrect to say that their method cannot do so. Indeed, I believe what I described is extremely similar to the PCA+FLOP approach which the authors implemented following Ruan et al.
This point does not impact on the strengths of the approach which I identified, chiefly in the proposed method being more general than those previously proposed.
Family-specific intercept I understand the technical point, but the resulting model still seems unintuitive to me. If one has more data, I believe having family-specific slopes makes more intuitive sense, even though that would make the scaling law closer to that of Ruan et al. It would be interesting to see how such a scaling law would perform in practice for an increasing number of data points (ie performance of more LLMs).
Dear reviewer DNEX, thank you for your prompt response! We elaborate more on some points below and hope to reach an agreement. Please let us know what you think or if you have more questions/points to discuss.
About Ruan et al (2024): We agree that the approach by Ruan et al (2024) could be adapted to make performance predictions from compute. The approach “PCA+FLOPs” represents exactly that but, as our experiments in Figures 1 and 2 show, it does not work well. Following your suggestion, we have rephrased references to Ruan et al to acknowledge that their method can be applied to predicting the performance of larger LLMs, but it was not explored in their paper (see lines 134-139 and 471-475 of the updated paper).
Intuition behind our formulation with family-specific intercepts: As discussed in the text, this type of model has long been utilized in Economics to assess the efficiency of firms and their production outputs; so our interpretation should directly connect with that. One way to make the model more intuitive is to interpret the intercept as a measure of efficiency, which directly influences how the input levels and translate into outputs. Specifically, if a particular family has a high , the inputs and will yield a bigger impact on performance compared to a model with a lower . To illustrate this, let us consider the performance of LLMs on a specific benchmark (let us forget about for now) and assume that and the interaction term are omitted from the model. In this simplified scenario, the model can be expressed as:
$
\sigma(\alpha_i + \beta^\top x(s, t)) = \frac{e^{\alpha_i + \beta_1 \log s + \beta_2 \log t}}{1 + e^{\alpha_i + \beta_1 \log s + \beta_2 \log t}} = \frac{e^{\alpha_i}s^{\beta_1}t^{\beta_2}}{1 + e^{\alpha_i}s^{\beta_1}t^{\beta_2}} = \frac{A_i s^{\beta_1}t^{\beta_2}}{1 + A_i s^{\beta_1}t^{\beta_2}},
$
where represents the efficiency of family . This formulation highlights how the impact of inputs and on performance depends on the value of .
Including family-dependent slopes: We present a comparison of our method to prior scaling laws (Ruan et al and Owen) that utilize family-specific slopes in Figure 2 (see also lines 366-375 for a discussion). Our results demonstrate that family-specific slopes are detrimental to the performance. In the case of our scaling law, to be able to fit a good model with family-specific slopes, the number of models of different sizes and training tokens per family needs to be high (at least 4 in our case, which is the case for only a single family). As one of our goals is to allow practitioners to make decisions regarding investing resources into training large LLMs based on their results with a smaller LLM, a scaling law requiring 4 models per family would be impractical. We emphasize that an important advantage of our scaling law is the ability to utilize evaluation data across both LLM families and benchmarks, which benefits from parameter sharing. It is possible to fit our scaling law with family-specific slopes and some regularization for identifiability, but we do not expect it to perform well. However, if such experiment would help to address your question, we are happy to try it. Please let us know.
With the discussion period concluding in a few days, we want to ensure we address any remaining questions or concerns you might have about our paper. During the rebuttal process, we have updated the manuscript to (i) clarify how our work differs from Ruan et al.'s and highlight that they do not explore performance prediction from compute (see lines 134-139 and 471-475 in the revised paper), (ii) include a list of the models and families used (Appendix F), and (iii) emphasize that the intercept accounts for hidden factors such as post-training adjustments (lines 184-187).
In this paper, the authors propose fitting scaling laws on existing benchmark data for various LLM families, utilizing results from OpenLLM v1 and v2. Unlike prior work on LLM scaling laws, which generally explores scaling across parameters and data, this paper primarily focuses on latent skills, such as reasoning abilities, that can be simultaneously evaluated by multiple benchmarks like GSM8K and MATH. In contrast to previous works on benchmark scaling laws (e.g., Owen, 2024; Ryan et al., 2024), which fit Equation 2.2 using the number of training tokens and parameters, this study introduces a lower-dimensional assumption to reduce the number of parameters required for fitting. The final objective function is presented in Section 3.3. Empirical results indicate that the proposed method, SLOTH, demonstrates improved predictive accuracy in terms of prediction error.
优点
Overall, the paper is well-structured and easy to follow. The authors propose a model mapping s (LLM size) and t (number of training tokens) to benchmark performance. Additionally, they employ several techniques to reduce the model’s parameter count while maintaining flexibility. The experimental results appear promising.
缺点
I have several concerns:
-
If an LLM’s skill is knowledge-based, I agree that the proposed model would likely offer good predictive accuracy, as such skills depend on training tokens and model size. However, when it comes to reasoning skills (focusing specifically on mathematical reasoning), my experience suggests that these skills heavily rely on post-training factors, such as the extent of computation on reinforcement learning during fine-tuning. Such hidden factors are difficult to capture using the model’s inputs alone, and to my knowledge, these are not commonly reported by open-source models.
-
It appears that some LLM benchmarks may be contaminated, whether intentionally or not. This raises the question: can we fully trust benchmark values as indicators of skill?
-
Finally, how can this scaling law be practically applied? Does it offer guidance for pre-training or post-training phases?
问题
See my comments on the weakness
Thank you for your dedication to our paper! We have responded to your concerns below and have updated the paper to accommodate suggestions by reviewers. Please let us know if you have any questions.
Post-training factors: We agree that post-training factors are not captured by our measures of compute . However, our formulation guarantees that such hidden factors are taken into account by the family-specific intercepts, i.e., ’s. For example, if a certain family went through a post-training procedure that made these models strong in reasoning, their values for in reasoning will be relatively large (recall that we consider base and instruct models to be from different families). Your observations are observable in Figures 5 and 6. For example, from our text “Contrasting with Reasoning, Figure 5 shows that Knowledge is highly influenced by both model size and number of training tokens. Moreover, we can see that the range of standard deviations in the middle plot is much greater than in the other two plots, giving us evidence that this skill might be more sensitive to increases in compute resources and less dependent on the LLM families themselves” (verbatim). In Figure 6, we show that instruction following is highly influenced by instruction tuning when compared to reasoning and knowledge. Thank you for your observation! We have included some discussion regarding post-training in Section 3.1.
Contamination: You are right that some LLMs can have their training contaminated with benchmark data. Fortunately, the Open LLM Leaderboards (v1/v2) have an indicator of contamination, which can be used to filter contaminated models. Like all papers in the scaling law literature for LLM benchmarks, we are assuming no contamination.
Practical applications of our model: Our scaling law can guide practitioners in many instances. For example: (i) choosing the right size of the next (to be released) pre-trained models for a desired level of performance on benchmarks of interest, (ii) allocating resources between training tokens and number of parameters (e.g., from Fig 5, if the practitioner wants a model with better reasoning, it might be better to invest on bigger models with less training data vs smaller models with more training data), and (iii) predicting performance of hypothetical LLMs on complex downstream tasks, such as coding. We added some extra sentences at the beginning of Section 4 regarding applications explored in experiments.
With the discussion period concluding in a few days, we want to ensure we address any remaining questions or concerns you might have about our paper. During the rebuttal process, we have updated the manuscript to (i) clarify how our work differs from Ruan et al.'s and highlight that they do not explore performance prediction from compute (see lines 134-139 and 471-475 in the revised paper), (ii) include a list of the models and families used (Appendix F), and (iii) emphasize that the intercept accounts for hidden factors such as post-training adjustments (lines 184-187). For a detailed explanation of the rationale behind our family-specific intercept, please refer to our response to reviewer DNEX.
The paper proposes a novel scaling law to predict the performance of various LLM families on multiple benchmarks. The authors borrow methodologies from Exploratory Factor Analysis to improve a previously proposed scaling law. Factor Analysis allows the model to find interpretable latent factors to help the scaling model generalize. Furthermore, a learnable activation function is used instead of the sigmoid, leading to an improved fit.
The experimental results reported in the article show:
- The proposed scaling law is as performant as previously proposed approaches.
- The proposed scaling law is able to extract (possibly) explanatory factors for the various benchmark results.
优点
The proposed method is uses an interesting mixture of scaling laws and interpretability to propose an improved benchmark prediction model. The use of a latent subspace "identifying" key skills to solve task is similar to what was proposed in Ruan et al, but the usage of exploratory factor analysis allows for a more insightful understanding.
缺点
-
The paper presentation can be sometimes confusing and/or unpolished. There are some terms are used without a proper introduction (e.g. interaction term), thus hindering the overall article. In general, I feel that the overall presentation could be improved.
-
To my knowledge, the model families seems to not be clearly stated in the main paper, they can only be speculated from looking at figure 17 in the appendix. How are models from the leaderboard categorized into a family? Furthermore, where are model size and training token information taken from?
-
The use of a neural-network to model the activation function σ can lead to overfitting and poor generalization, especially considering the few data used for training (to my understanding it is trained on benchmark evaluations). I would have preferred a more focused experiment on showing if using a trainable activation can lead to a worse generalization.
-
I find the tables, plots, and in general the experiments to be somewhat lacking; In previous work, to show the capabilities of their respective work Ruan et al, and Owen make use of plots, showcasing the extrapolation capabilities of their models (e.g. Figure 4 in Ruan et al). I think that the proposed work would benefit from using more of these visualizations (some are shown in the appendix, but are otherwise not included in the main discussion).
问题
- Why was logistic regression used for Figure 7 instead of the learned activation σ? Is there something that I am missing?
- In line 137 and 138 you state that "(Ruan et al) is not well-suited for performance prediction from compute" why is that the case? can you expand on this?
- How are the level curves in Figure 5 computed? Can you give an intuitive explanation?
Thank you for your work on our paper! We have responded to your concerns below and have updated the paper with some of your suggestions. Please let us know if you have any questions.
Comments about your summary of our paper: We believe two statements in the summary of our work are inaccurate and would like to clarify them below:
- "The proposed scaling law is as performant as previously proposed approaches": this sentence understates the performance prediction of our method; as we see in Figure 1, our methods offer much better prediction results when compared with baselines.
- "(...) to improve a previously proposed scaling law": we do not only improve previous scaling laws but propose a model that can model the scaling of abstract LLM skills using data from multiple benchmarks and LLMs, which is something no one proposed before.''
Concerns about presentation: We worked (and will work) hard to make our paper as polished and readable as possible, which seems to be the impression that other reviewers got from our work. Please let us know which points you think are not polished enough and we will update the paper with your suggestions. We note that "interaction term", for example, is an expression frequently used in regression analysis within machine learning and statistics; in spite of that, we included more explanation in the text.
Concerns about model families: We have included Table 1 in Appendix F detailing all families and models used in our paper. We consider two models to be in the same family if the (main) difference between them is the number of parameters; please check Table 1 for precise categorization. Model sizes and training token information were taken from the original papers that released the models, their Hugging Face model cards, and data collected by Ruan et al (2024); our data collection effort resulted in a more comprehensive study when compared to previous works.
Concerns about overfitting and generalization: In our paper, we show that training the activation function does not lead to poor generalization. In fact, from Figure 1 we see that the neural net model ("Sloth") performs better than any other option in terms of lower test error, including the sigmoid model ("Sloth basic"). In this experiment, the test error is computed using a cross-validation approach; that is, for each test LLM family (e.g., Qwen 2) we: (i) only include only the smallest model in the training set (e.g., Qwen 2 0.5B), (ii) fit the scaling laws using all the LLMs in the training set, (iii) predict the performance of larger models from that family (e.g., Qwen 2 1.5/7/72B) in all benchmarks, and (iv) compute the prediction error by comparing the predictions with the ground truth. We repeat these steps for all test families and average the prediction errors, which are the numbers reported in Figure 1. To make our assessment even more realistic, we do not test older versions of recent families if they are available in the training set, e.g., we do not test on LLaMa 2 models when LLaMa 3 is present in the training set.
Concerns regarding plots and table: In the main text, we include two plots like the ones you are suggesting in Figure 7. For the other results, we do not think this type of plot is the most suitable; the reason is that we run a higher number of experiments when compared to other papers and the heatmaps help us summarize our results better. For example, to show the predictive power of our method (8 variations) compared to the other 7 baselines across 12 benchmarks, a heatmap is more suitable, and if we choose to go with the "extrapolation" plots, at least 15x12 = 180 plots would be needed, which is not informative for the reader. If we consider that each one of the test families needs a different plot (due to cross-validation), 19x180=3420 plots are needed just for Open LLM Leaderboard v1.
"(Ruan et al) is not well-suited for performance prediction from compute": The observational scaling law proposed by Ruan et al (2024) is used to predict the performance of LLMs in complex downstream tasks from their PC scores (extracted from observed benchmark data); please check their Section 3.4 for more details. This implies that, for example, to predict the performance of LLaMa-3-70B in a certain downstream task, they first have to observe the scores of LLaMa-3-70B in a bunch of benchmarks. That is, it is not possible to predict the performance of LLaMa-3-70B in downstream tasks using their approach before the LLM is released (i.e., making predictions from compute); this is one of the gaps we attack in our paper. This limitation was confirmed by Tatsunori Hashimoto in a conversation we had at ICML 2024 (T. Hashimoto, personal communication, July 2024). Moreover, based on Hashimoto's suggestion, we tried to extend their method to make performance prediction from computing possible; as we show in Figure 1, that approach ("PCA+FLOPs") did not work well.
I thank the authors for the detailed response and apologize for my late answer.
About my summary. The sentence "(...) to improve a previously proposed scaling law" was due to an understanding that your work utilizes as a base the work from Ruan et al. if this is not accurate, it could be useful to rewrite part of Section 3 to better express how you differentiate from them.
About my weakness n.4: I probably should have been more specific: in general, I would have appreciated an increased number of plots showing the extrapolation capabilities of your scaling law. I understand the difficulties in expressing all the content from Figure 1, but I was curious about the behavior of your activation function. This could be done by fixing a specific task and subset of families, thus providing a more qualitative intuition of your approach. Indeed, these plots might then be used to extract useful insights and/or a more intuitive visualization of your scaling laws. Another important factor to consider regarding your plots is that figures tend to have captions that are sometimes too short and not descriptive. In my opinion, Figures 5, 6, and 7 suffer from this problem.
About comparing with different scaling laws: A useful experiment for the paper is a comparison with previous scaling laws (Ruan et al. and Owen) in their respective training settings. For example, with Ruan et al., it would be interesting to see how much it differs from your approach when compute is available. I understand that this experiment might be difficult to perform, due to the necessary computational requirements--however--it could strengthen the article.
As a final comment, I would like to say that the proposed scaling law might be a useful contribution to the LLM community, for example in predicting which hyper-parameters and/or families utilize for a given task. However, as it currently stands, I believe that the paper still needs major improvements in its presentation.
Dear reviewer, thank you for your suggestions! We have updated the paper accordingly and included details in the general response below. We hope to solve the issues you brought up and explain the changes in the paper, but please do not hesitate to contact us for further discussion!
Presentation: Thank you for your comments. We have updated the captions of Figures 5, 6, and 7, making them more informative. We are open to more suggestions are promise we will continue polishing the presentation for the next version of the paper based on them.
About your summary: Besides the different model formulations (ours vs previous work), the main differences when compared with Owen (2024) and Ruan et al (2024) is that these works consider different setups; we have included a discussion about this at the end of Section 2. In summary, Owen (2024) does not use family information at prediction time, making their scaling law less accurate but more generalizable, and Ruan et al (2024) assume families are important at prediction time but consider that the target model has already been trained, making their scaling law less applicable in practice and more interesting from an interpretability point of view. In our work, we wish to instead predict the performance of a larger LLM without having to train it but taking family information into account, thus allowing practitioners to make decisions regarding investing resources into scaling their training recipes to larger LLMs based on the performance of smaller LLMs. Moreover, our formulation also allows interpretable insights from the data. Despite different setups, we made comparisons with Owen (2024) and Ruan et al (2024) throughout our work by considering adaptations of their methods as baselines in our setting and comparing our method to theirs in their settings. We also reformulated our writing in lines 134-139 and 471-475 to better position our work in the context of previous research.
About your weakness n.4: We have included an extra analysis in Appendix J exploring the behavior of our two link functions and comparing our approaches with two main baselines. In summary, we see that: (i) training the link function can produce a much more flexible scaling law that can better predict performance saturation better than other methods (e.g., see results for Yi-1.5 in ARC, HellaSwag), (ii) training no family-specific parameters at all (‘’FLOPs (shared intercept)‘’ from Owen (2024)) usually produce poor prediction results, and (iii) PCA+FLOPs (method derived from Ruan et al (2024)) often produces flatter curves that underestimate the performance of bigger models, e.g., see results for Yi-1.5 in TruthfulQA, GSM8k, and MMLU.
About comparing with different scaling laws in their own setup: In our original version of the paper, we compared our scaling law with Owen (2024)'s in their own setting by not including any family-specific parameter in our model, which let the scaling laws generalize to arbitrary families. Please check Figures 3 and 11. In summary, we show that our method excels in their own setting when compared to the baseline. In the new version of the paper, we have included a comparison with Ruan et al (2024) in their setting in Appendix K. We have adapted our initial downstream task performance prediction experiment to Ruan et al (2024)‘s setup by including the target model’s (LLaMa-3-70B) performance on the leaderboard benchmarks into the training data. In summary, both methods behaved similarly regarding out-of-sample prediction, despite our method fitting better to the data overall. We will include more comparisons for the camera-ready version of the paper.
Concerns regarding Figure 7: We use the logistic link for downstream performance prediction because the intersection of our main dataset and the code completion/emotional intelligence datasets is quite small, i.e., there is not enough data (model performances) to fit link/activation functions for these benchmarks (recall that our scaling law allows each benchmark to use a different link function for predicting performance from the latent abilities).
Level curves: For each one of the skills, the level curves are obtained from the function from Eq 3.2. Here, represents the level of skill of a certain LLM with covariates discounted by the family-specific intercept term.
With the discussion period concluding in a few days, we want to ensure we address any remaining questions or concerns you might have about our paper. During the rebuttal process, we have updated the manuscript to (i) clarify how our work differs from Ruan et al.'s and highlight that they do not explore performance prediction from compute (see lines 134-139 and 471-475 in the revised paper), (ii) include a list of the models and families used (Appendix F), and (iii) emphasize that the intercept accounts for hidden factors such as post-training adjustments (lines 184–187). For a detailed explanation of the rationale behind our family-specific intercept, please refer to our response to reviewer DNEX.
We want to thank you all for the discussion and the opportunity to polish our paper.
Several reviewers asked for additional explanations and comparisons to the prior scaling laws, i.e., Owen (2024) and Ruan et al (2024), and we have revised the paper to make this point clearer. Besides the different model formulations (ours vs previous work), the main difference when compared with Owen (2024) and Ruan et al (2024) is that these works consider different setups; we have included a discussion about this at the end of Section 2. In summary, Owen (2024) does not use family information at prediction time, making their scaling law less accurate but more generalizable, and Ruan et al (2024) assume families are important at prediction time but consider that the target model has already been trained, making their scaling law less applicable in practice and more interesting from an interpretability point of view. In our work, we wish to instead predict the performance of a larger LLM without having to train it but taking family information into account, thus allowing practitioners to make decisions regarding investing resources into scaling their training recipes to larger LLMs based on the performance of smaller LLMs. Moreover, our formulation also allows interpretable insights from the data. Despite different setups, we made comparisons with Owen (2024) and Ruan et al (2024) throughout our work by considering adaptations of their methods as baselines in our setting and comparing our method to theirs in their settings. We also reformulated our writing in lines 134-139 and 471-475 to better position our work in the context of previous research.
Moreover, we made other main improvements in the paper (blue text) based on your suggestions:
- We have updated the captions of Figures 5, 6, and 7, making them more informative.
- We included more details about our family-specific formulation and interaction term in lines 184-187;
- We have included an extra analysis in Appendix J exploring the behavior of our two link functions and comparing our approaches with two main baselines. In summary, we see that: (i) training the link function can produce a much more flexible scaling law that can better predict performance saturation better than other methods (e.g., see results for Yi-1.5 in ARC, HellaSwag), (ii) training no family-specific parameters at all (‘’FLOPs (shared intercept)‘’ from Owen (2024)) usually produce poor prediction results, and (iii) PCA+FLOPs (method derived from Ruan et al (2024)) often produces flatter curves that underestimate the performance of bigger models, e.g., see results for Yi-1.5 in TruthfulQA, GSM8k, and MMLU.
- We have included a comparison with Ruan et al (2024) in their setting in our Appendix K. We have adapted our initial downstream task performance prediction experiment to Ruan et al (2024)‘s setup by including the target model’s (LLaMa-3-70B) performance on the leaderboard benchmarks into the training data. In summary, both methods behaved similarly regarding out-of-sample prediction, despite our method fitting better to the data overall. We will include more comparisons for the camera-ready version of the paper.
We believe the revisions and additional results presented here significantly strengthen our work and adequately address the reviewers’ concerns. We hope our responses and updates are satisfactory, and we kindly request the reviewers to reconsider their ratings. Should there be any further questions or concerns, we would be more than happy to provide additional clarification.
This paper presents an interesting idea of a unified scaling law across model families. Such a scaling law is able to predict the performance of down-stream tasks across LLM families and benchmarks by leveraging their intrinsic interactive structures. Specifically, this paper applies Exploratory Factor Analysis to explore "low-dimensional latent skills" (e.g., reasoning and instruction following) of trained LLMs as key predictors of their performance of downstream tasks. Empirical results show that the proposed method demonstrates improved predictive accuracy.
Overall the reviewers think this is a very promising direction, and the author's insights are generally confirmed by empirical data. However, concerns are raised regarding its difference and claimed novelty contribution from Ruan et al. 2024, and whether the trained activation function overfits the performance demonstrated. More thorough analyzes are needed before we can be sure of the significance of such a scaling law. We also encourage the authors to work on improving the presentation of the work.
审稿人讨论附加意见
A generally healthy discussion. No reviewer was convinced to update their scores.
Reject