EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?
摘要
评审与讨论
The paper contributes to the field of weak-to-strong (w2s) generalization by introducing an EnsemW2S-AdaBoost algorithm that utilizes ensembles of weak supervisors to enhance the performance of stronger language models. It emphasizes the concept of easy-to-hard (e2h) data learning, which is crucial for addressing the challenges posed by difficult-to-label data. Experimental results are provided to illustrate the effectiveness in tasks including binary classification and question-answering.
优点
Overall, I believe this research makes significant contributions in terms of idea, design, and community impact. The supplementary materials also provide extensive code, which I hope will be open-sourced for broader review and to enhance reproducibility.
Originality: The authors reasonably enhance the work of Burns et al. by expanding the concept from a single weak supervisor to multiple weak supervisors, thereby increasing both generality and practicality. The adaptation of the AdaBoost algorithm takes into account the autoregressive nature of language models and effectively addresses the challenges of complex token sequence generation tasks, demonstrating diligent design efforts.
Significance: The introduction of the easy-to-hard (e2h) framework, along with the adaptation of AdaBoost, offers valuable insights for practitioners in related fields. These contributions may promote research on super-alignment.
缺点
While the overall idea of this work appears reasonable and promising, some sections lack sufficient clarity and quality. I have a basic understanding of weakly supervised learning and ensemble learning, which makes the explanations regarding the easy-to-hard framework (Section 2) and the experimental demonstrations (Section 5) quite understandable. However, the core section on the adaptation of AdaBoost (Section 3) is somewhat obscure and lacks clear background context.
Furthermore, the experiments are only conducted on simple binary classification and limited QA tasks (ARC and Quartz), which I believe do not provide sufficient support for the claims made. Given that the paper's foundation is super-alignment, it should ideally validate its approach on more challenging benchmarks (like in DataComp-LM or FineWeb). I am not necessarily expecting state-of-the-art performance, but rather a demonstration of how the method performs across a broader range of tasks.
问题
-
In lines 222-224, the prior term and the final equation for are presented in a specific form. However, I did not find a clear derivation for this formulation. Could you clarify the reasoning behind choosing this particular expression?
-
As noted in lines 410-411, "PGR is not very informative, as it can produce extremely large or even negative values." Why, then, does the paper still utilize PGR as a performance metric for binary classification tasks? This seems somewhat contradictory.
-
In line 358, the authors state, "We pick the best w2s performing round for our plots." I believe this approach can be misleading. Although Figures 2 and 4 show an apparent improvement in performance from weak-to-strong models compared to weak model performance, Tables 1 and 2 in the appendix indicate that this is not always the case. When AdaBoost employs different values of , the weak-to-strong model sometimes shows improved performance and sometimes declines. This unstable performance variation raises doubts about the true effectiveness of the proposed method. In other words, achieving good performance with a specific combination of weak and strong models requires careful tuning across different values of , and this is not universal. This reliance on hyperparameter selection suggests that the method may lack generalizability, which contradicts the pursuit of weak-to-strong generalization stated in the paper.
We thank Reviewer vqfu for the great summary and indentifying our vision of easy-to-hard generalization (e2hg) and Adaboost-adaptation crucial for practitioners in related fields and to promote research on super-alignment. We are higly encouraged that the reviewer finds our contributions significant in terms of idea, design, and community impact. We are happy that the reviewer called our design efforts diligent for combining multiple weak supervisors into the autoregressive LLM while addressesing the challenges of token sequence generation tasks. We are greatful for the detailed feedback for making our paper better. Below, we address and clarify all of the reviewer's concerns and questions in detail.
Weaknesses: While the overall idea of this work appears reasonable and promising, some sections lack sufficient clarity and quality. I have a basic understanding of weakly supervised learning and ensemble learning, which makes the explanations regarding the easy-to-hard framework (Section 2) and the experimental demonstrations (Section 5) quite understandable. However, the core section on the adaptation of AdaBoost (Section 3) is somewhat obscure and lacks clear background context.
We appreciate your feedback regarding the clarity and quality of our manuscript.
- In response to your comments, we have added additional background information in Section 3 and Appendix Section C to enhance understanding.
- We have made efforts to clarify or add more information in the content in Section 3 wherever needed.
- To further improve our manuscript, could you please specify the areas within Section 3 where you feel additional clarity is needed?
We are committed to addressing your comments and enhancing the overall quality of our work. Thank you for your valuable insights.
Furthermore, the experiments are only conducted on simple binary classification and limited QA tasks (ARC and Quartz), which I believe do not provide sufficient support for the claims made. Given that the paper's foundation is super-alignment, it should ideally validate its approach on more challenging benchmarks (like in DataComp-LM or FineWeb). I am not necessarily expecting state-of-the-art performance, but rather a demonstration of how the method performs across a broader range of tasks.
We recognize the importance of evaluating our method on a wider array of tasks to strengthen our claims. We are in the process of conducting additional experiments.
Questions: In lines 222-224, the prior term and the final equation for are presented in a specific form. However, I did not find a clear derivation for this formulation. Could you clarify the reasoning behind choosing this particular expression?
We have added the intuition behind this term in the appendix section C4. Please let us know if the resoning is not clear enough.
The calculation of cannot rely solely on error, , as the traditional Adaboost method is valid only when . Applying the same equation in our context could yield negative values. We introduce a prior term, , inspired from multi-class classification Adaboost works [1], to address this issue.
Existing works on multi-class classification Adaboost [1] suggest using (where c is the number of classes) in the prior term, , as represents the random performance of the model. However, when c (the number of classes) becomes very large, the term also grows significantly, causing the parameters of Adaboost to become nearly identical and, consequently, less useful. To address this, we introduce a pre-training error term, , which represents an upper bound on the sample error. We then use (a lower bound on accuracy) as a replacement for the term, as our model's lowest possible accuracy is , not .
As noted in lines 410-411, "PGR is not very informative, as it can produce extremely large or even negative values." Why, then, does the paper still utilize PGR as a performance metric for binary classification tasks? This seems somewhat contradictory.
We acknowledge the use of Performance Gap Recovery (PGR) as a metric for performance comparison in our study, while also addressing some of its limitations.
-
In Section 5.1, lines 410-411, "PGR is not very informative, as it can produce extremely large or even negative values." we highlight PGR's sensitivity to extreme scenarios, which can result in significantly large negative or positive values. However, this sensitivity does not invalidate PGR as a reasonable metric for studying w2sg. We believe it is important to share these demerits to inform the community and guide future research in w2sg.
-
Additionally, PGR serves as a common standard for comparing our results with previous work in the w2s domain. Given that PGR is currently the only metric established in the literature for this purpose, we have chosen to utilize it in our analysis.
Thus, discussing limitations of PGR in Section 5.1 does not detract from its overall utility as a metric in our research.
In line 358, the authors state, "We pick the best w2s performing round for our plots." I believe this approach can be misleading. Although Figures 2 and 4 show an apparent improvement in performance from weak-to-strong models compared to weak model performance, Tables 1 and 2 in the appendix indicate that this is not always the case. When AdaBoost employs different values of T, the weak-to-strong model sometimes shows improved performance and sometimes declines. This unstable performance variation raises doubts about the true effectiveness of the proposed method. In other words, achieving good performance with a specific combination of weak and strong models requires careful tuning across different values of T, and this T is not universal. This reliance on hyperparameter selection suggests that the method may lack generalizability, which contradicts the pursuit of weak-to-strong generalization stated in the paper.
We acknowledge that careful selection of can enhance performance. However, our w2s model consistently outperforms the baseline for any for random data split. But for the easy-hard split, the performance of any may not exceed that of the baseline. The reasons for this observation are as follows:
-
Impact of the Easy-Hard Setup on Weak Model Capability: The weak model is trained on easy data but is required to generate labels for hard data intended for the strong model. If the weak model is inherently less capable—due to factors such as model size—training it solely on easy data can exacerbate its limitations. For instance, when using GPT-2 as the weak model (as shown in Table 2), we do not observe significant improvements in weak performance with increasing , resulting in performance levels similar to the baseline for w2s training as well. This limitation arises because the weak models lack the ability to generate informative labels for the hard data that would facilitate effective learning for the strong model. To address this issue or to pick best for random split, one can employ following methods:
- Cross-Validation: Utilize a small subset of ground-truth data as validation data to identify the best-performing .
- Use Larger Range of : We note that a larger value of () may be beneficial, particularly when dealing with less capable models or when there is a significant disparity between easy and hard data splits.
- Challenges of Easy-Hard Splitting: Another contributing factor to the poor e2hg is the substantial difference between easy and hard instances within the SciQ dataset (Fig 8). While weak models may improve accuracy on the easy split, they often struggle to generalize effectively to the harder data and thus fail to improve w2sg with addition of a few models. Since random split do not have this disparity, we see better performance in that case.
- Diverse Model Investigations: One of our primary motivations for conducting diverse range of experiments was to explore the relationship between model capabilities—affected by model size and training data—and their influence on w2sg.
Thank you again for your time and effort in reviewing our paper! We will try to add more clarity and improve presentation of manuscript in the next few days along with the experiments asked. If our responses above do not fully address your concerns, please let us know. We are happy to address them and incorporate relevant additions to enhance our manuscript.
REFERENCES
[1] Trevor Hastie, et al. Multi-class adaboost. Statistics and its Interface, 2009.
[Experiments for Weakness 2]
Furthermore, the experiments are only conducted on simple binary classification and limited QA tasks (ARC and Quartz), which I believe do not provide sufficient support for the claims made. Given that the paper's foundation is super-alignment, it should ideally validate its approach on more challenging benchmarks (like in DataComp-LM or FineWeb). I am not necessarily expecting state-of-the-art performance, but rather a demonstration of how the method performs across a broader range of tasks.
-
We acknowledge the importance of large datasets such as DataComp-LM and FineWeb; however, testing our approach on these datasets is beyond the scope of academic resources. Instead, we have selected the math_mc dataset, which offers a challenging task specifically focused on mathematics.
-
The paper's goal is to evaluate the approach for solving W2SG using a prototype with smaller models and comparable datasets, similar to the methodology of Burns et al. Exactly replicating the superalignment framework is not feasible. Below, we present experiments conducted on a more challenging and larger dataset: math_mc. (We are running experiments on more model pairs and will directly update it in our paper to improve the paper quality.)
Table: Random Split; math_mc dataset; Weak Model - Qwen-1.5B; Strong model - Qwen-3B
| Method | Weak Model (Option Acc) | Strong Model (Option Acc) | Weak-to-Strong Model (Option Acc) |
|---|---|---|---|
| Baseline | 0.478 | 0.525 | 0.46 |
| EnsemW2S | 0.502 | 0.525 | 0.49 |
Table: Easy-Hard Split; math_mc dataset; Weak Model - Qwen-1.5B; Strong model - Qwen-3B
| Method | Weak Model (Option Acc) | Strong Model (Option Acc) | Weak-to-Strong Model (Option Acc) |
|---|---|---|---|
| Baseline | 0.48 | 0.49 | 0.445 |
| EnsemW2S | 0.48 | 0.49 | 0.450 |
Please let us know if you have any other questions. We are happy to adress them. Thank you for the valuable feedback.
Thanks to the authors for their explanations and results. I still maintain my judgment and score.
I understand the challenges posed by the scarcity of computational resources in academic research. I would like to sincerely express my suggestion that, since the paper focuses on super-alignment, it inevitably raises readers' expectations for the performance of large-scale models on real-world challenging tasks, rather than just simple prototypes. The harsh reality is that the findings from weaker models may not necessarily generalize to more challenging scales. If the paper had pragmatically positioned its intent or context within traditional machine learning topics, such as weak supervised learning or ensemble learning, perhaps the presentation of existing results would be more satisfying for readers.
Dear Reviewer,
Thank you for your thoughtful feedback and the time you have dedicated to reviewing our manuscript. We truly value your insights.
Based on your suggestion to test our EnsemW2S method on large-scale models and real-world challenging tasks, we have conducted experiments using 7B-sized models on the complex Math_mc dataset and the simpler ARC dataset.
We believe 7B-sized models are sufficiently large to validate our approach. Additionally, Math_mc, with its diverse range of math questions covering arithmetic, algebra, geometry, statistics, and reasoning, represents a challenging real-world task.
Table 1. The table below shows accuracy (%). The weak model is Qwen2.5-3B, and the strong model is Qwen2.5-7B. The data is from ARC and math-mc.
| Split | W2S Performance Baseline | W2S Performance Baseline-Ours | Percentage Improvement (%) |
|---|---|---|---|
| Easy-Hard (ARC) | 38.82 | 40.79 | 5.1 % |
| Random (ARC) | 53.07 | 54 | 1.7 % |
| Easy-Hard (Math-mc) | 53.2 | 53.6 | 0.8% |
| Random (Math-mc) | 52.8 | 55 | 4.2% |
Please let us know if these new experiments address your concerns. If so, we would greatly appreciate it if you would consider increasing the score.
Thanks a lot again for your valuable feedback and time.
Best Regards,
Authors
Dear Reviewer,
Thanks for your reply and providing with your useful suggestions for making our manuscript better. We have added results for Qwen-3B and Qwen-7B for ARC data as part of Reviewer b44k response. Would similar sized models (3B and 7B) on challenging math data solve your concern ?
The paper proposes a framework that enhances weak models through an iterative AdaBoost-based sampling and reweighting approach, enabling them to generate pseudo-labels for hard data as an improved ensemble. These pseudo-labels are then used to train a transfer model, effectively narrowing the performance gap with a strong model trained on fully labelled hard data. This framework combines established techniques—including AdaBoost, ensembling and pseudo-labeling—to address the trending weak-to-strong (w2s) generalization problem.
优点
- The paper is well-motivated by addressing critical challenges in prior work: the Single Weak Supervisor Limitation, Lack of Focus on Weak Model Enhancement, and Overlooking Task Complexity.
- It introduces an ensemble-based easy-to-hard (e2h) framework that extends weak-to-strong (w2s) generalization by structuring supervision as a progression from simpler tasks to more complex ones.
- The successful adaptation of AdaBoost from traditional machine learning to the w2s setting, particularly in iteratively enhancing weak models, demonstrates improved supervision for strong models tackling complex tasks.
缺点
My main concern is the method’s generalizability to real-world settings, where weak supervision datasets may not fully cover or have distributional shifts from, challenging test sets.
Specifically, the current evaluation relies on internal correlations, with weak and strong models trained on splits of the same dataset. Cross-dataset performance could provide a more robust test: for example, can weak models trained on easy tasks from Dataset A effectively bridge the gap for hard tasks in Dataset B? Comparing this with setups where both weak and strong models are trained within either Dataset A or Dataset B would further clarify the method’s generalizability.
问题
Questions (to be addressed)
- How does this approach compare to a direct ensemble?
- How sensitive is the method to the distribution shift cross-dataset?
- Could voting from weak models introduce error to the stronger transfer model, for example, if multiple weak models agree on an incorrect answer?
- How is the AdaBoost round T determined, and how does this choice impact results?
- Line 5 in Algorithm 2 is unclear—the parameter update of via the minimization objective needs clarification and is not well-represented in Figure 7.
- Would be nice to have some cost analysis provided.
Additional feedback (minor comments) to improve the paper and not necessarily part of my decision assessment.
- In Figure 2 (bottom), the colour map and x-axis denote model parameters, which seems redundant.
- Figure 7 and Algorithm 2 are essential elements but are placed in the appendix, requiring frequent reference back and forth. Moving them upfront or referring them after Algorithm 1 would be helpful.
- In Figure 1 (right), the last arrow in the second line needs correction.
- There is a naming inconsistency in Figure 1—while the left plot uses “strong/weak teacher/student,” the caption refers to “strong/weak model.”
- On line 174, T is introduced in the main text but is not defined.
- In Algorithm 2, m is not defined.
- For line 10 in Algorithm 1, it would be helpful to provide intuition for the weight update, such as assigning higher weight to samples with higher error rates.
- Figures 10-13 in the appendix lack a legend.
Would be nice to have some cost analysis provided.
Training Cost of Weak Learners:
- Each weak learner is trained sequentially, as its performance is contingent upon the outputs of the preceding weak learner. Consequently, while the GPU load may be lower, the overall training time is directly proportional to the number of weak learners utilized.
- This is because the input and output token count for each weak learner during training remains approximately constant, as suggested by Adaboost. Only the frequency of samples are adjusted based on weights. In EnsembleW2S we sample the tokens by token-weights but eventually combine the sampled tokens while masking the ones not sampled, thus keeping the total tokens approximately similar and training time for each weak-learner independent of the tokens sampled.
- In the practical superalignment case, pre-trained weak learners will be used, which may mitigate concerns regarding training time.
Inference Cost of Weak Learners:
- The generation process can be executed in parallel as well as seqentially, resulting in a GPU load for generation or clock time for generation respectively, that scales linearly with the number of weak learners.
- For decoding, once the token-level distributions generated by the weak learners are combined using EmsemW2S algorithm, efficient decoding algorithms can be employed to produce the final response. However, this is not the focus of this work.
Strong Model Training and Inference:
- The strong model is trained using labels generated by the weak learners and is evaluated on standard datasets. Therefore, the training cost and inference cost associated with the strong model remains unchanged.
Additional feedback (minor comments) to improve the paper and not necessarily part of my decision assessment.
- We sincerely appreciate the detailed feedback provided, which has been invaluable in enhancing our paper.
- We have carefully considered your suggestions and have made relevant additions to address the points raised.
Thank you again for your time and effort in reviewing our paper! If the above responses do not fully address your concerns, please let us know. We are happy to adress them and incorporate relevant additions to enhance our manuscript.
REFERENCES
[1] Freund, Yoav, and Robert E. Schapire. "A decision-theoretic generalization of on-line learning and an application to boosting." Journal of computer and system sciences 55.1 (1997): 119-139.
[2] Freund, Yoav, and Robert E. Schapire. "Experiments with a new boosting algorithm." icml. Vol. 96. 1996.
Thanks to the authors for providing more details and results.
There are no further questions from my side.
Dear Reviewer,
Thank you for your response. We are delighted to see that our revisions have addressed your initial concerns.
We are eager to provide any additional experiments and engage in more discussion since the review period has extended. Could you let us know if you have more questions. If not, we would be happy if you could reconsider your score.
Thanks a lot for giving your valuable time for making our manuscript better.
How is the AdaBoost round T determined, and how does this choice impact results?
-
Theoretically, AdaBoost is designed to avoid overfitting, suggesting that a larger number of iterations (T) can lead to improved performance. However, it is important to note that increasing T also results in higher computational costs.
-
As additional models are incorporated into the ensemble, the performance gains tend to diminish, leading to a saturation point. Therefore, it may be more advantageous to limit T to optimize both performance and computational efficiency.
-
However, in scenarios involving smaller models in the easy-hard setup, it is suggested to select T greater than 8 for binary classfication or a larger T. This recommendation is based on the observation that these models may struggle to generate meaningful outputs on more challenging data because their capability is restricted by both size and easy-data training. So even though they may improve performance on easy data, they may find it difficult to generalize to harder data. Thus by increasing T, we leverage the cumulative effect and observe improved performance.
Line 5 in Algorithm 2 is unclear—the parameter update of via the minimization objective needs clarification and is not well-represented in Figure 7.
-
Line 5 of the AdaBoost algorithm emphasizes that the weak learner must outperform random guessing. For binary classification tasks, the baseline performance of random guessing is 50%.
-
Therefore, the weighted error on the training set must be less than 50%. If the weak learner does not achieve this threshold, the AdaBoost algorithm will fail to function effectively.
-
To address these points in detail, we have included a comprehensive background on the AdaBoost algorithm in Appendix Section C3, which elaborates on the necessity for the weak learner's performance.
-
We have also made improvements to Figure 7.
How does this approach compare to a direct ensemble?
We utilize Adaboost for binary classification task and develop EnsemW2S for generative tasks. Below we expain how Adaboost is better than other ensemble methods using existing literature [1,2] and then present neumerical results for EmsemW2S to be better than other methods for generation task.
-
Binary Classification Task: Adaboost is well studies to be better than some of the common emseble methods like bagging, stacking and random forest because:
- Focus on Hard-to-Classify Samples: Adaboost adaptively adjusts weights to emphasize misclassified examples, allowing it to learn more effectively from challenging parts of the dataset.
- Bias-Variance Tradeoff: Unlike bagging (e.g., Random Forests) that primarily reduces variance, Adaboost reduces both bias and variance, leading to better generalization.
- Robustness with Weak Learners: Adaboost excels with simple, weak learners (e.g., decision stumps), making it computationally efficient and less prone to overfitting compared to Gradient Boosting, which often requires careful tuning.
- Theoretical Guarantees: Adaboost minimizes an exponential loss function and has strong theoretical connections to margin maximization, offering insights into its generalization performance.
- Efficiency and Simplicity: Compared to methods like stacking or Gradient Boosting, Adaboost is simpler to implement, computationally efficient, and avoids the complexity of meta-model training.
-
EnsemW2S for generation task: Below we show results for instead of adaptive token weighing we choose equal weighing. i.e. we choose for all model and compare it with EnsemW2S adaptive choice of . We observe this simple way of combining the models don't really help and more intelligent combination is needed.
Table: Random Split, ARC. Weak Model - Pythia-1.4B. Strong model - Pythia-2.8B
| Method | Weak Model (Option Acc) | Strong Model (Option Acc) | Weak-to-Strong Model (Option Acc) |
|---|---|---|---|
| Baseline | 0.337 | 0.3532 | 0.319 |
| EnsemW2S with () (Round=2) | 0.337 | 0.3532 | 0.32 |
| EnsemW2S with () (Round=3) | 0.337 | 0.3532 | 0.319 |
| EnsemW2S with () (Round=4) | 0.337 | 0.3532 | 0.32 |
Table: Easy-Hard Split, , ARC. Weak Model - Pythia-1.4B. Strong model - Pythia-2.8B
| Method | Weak Model (Option Acc) | Strong Model (Option Acc) | Weak-to-Strong Model (Option Acc) |
|---|---|---|---|
| Baseline | 0.227 | 0.2535 | 0.2423 |
| EnsemW2S with() (Round=2) | 0.231 | 0.2535 | 0.2338 |
| EnsemW2S with() (Round=3) | 0.231 | 0.2535 | 0.233 |
Could voting from weak models introduce error to the stronger transfer model, for example, if multiple weak models agree on an incorrect answer?
Yes, it is indeed possible for weak models to introduce errors into a stronger transfer model.
In random setting this is because weak models are less capable due to smaller size. (The capability can also be affected by being trained only on easy data, which we will discuss in second point.) Thus a lower capable model is prone to producing erroroneous output.
- To address this issue, we employ an Adaptive Boosting/EnsemW2S approach, which aims to reduce training error while enhancing generalization to test data. For random split, we see the performance of multiple weaks to be better than single weak.
In easy-hard split setting, in some cases models may improve accuracy over easy split but not on hard split on which they are evaluated and thus may not show improvement over the baseline. However we observe that with larger values of T (T>8), even less capable models start to generalize better on hard task showing promise to this approach.
- Thus, its recommended to not have significant difference between difficulty levels of easy and hard data. If a substantial difference in difficulty exists, employing strategies such as curriculum learning—where models are trained incrementally on progressively harder tasks—can be an effective approach to enhance generalization and performance.
We thank reviewer ogPp for the amazing summary and indentifying our method of ensemble learning for generative task for w2sg useful and novel. We are encoraged by strengths listed by the reviewer. Below, we provide answers to the reviewers questions and address concerns in detail.
Weaknesses: My main concern is the method’s generalizability to real-world settings, where weak supervision datasets may not fully cover or have distributional shifts from, challenging test sets. Specifically, the current evaluation relies on internal correlations, with weak and strong models trained on splits of the same dataset. Cross-dataset performance could provide a more robust test: for example, can weak models trained on easy tasks from Dataset A effectively bridge the gap for hard tasks in Dataset B? Comparing this with setups where both weak and strong models are trained within either Dataset A or Dataset B would further clarify the method’s generalizability.
To test cross data performance we train on math_mc dataset and test on mmlu-high_school_mathematics and mmlu-college_mathematics.
Table: Train on math_mc and test on math_mc, mmlu-high_school_mathematics and mmlu-college_mathematics; Weak Model - Qwen-1.5B; Strong model - Qwen-3B
| Method | Weak Model (Option Acc) | Strong Model (Option Acc) | Weak-to-Strong Model (Option Acc) |
|---|---|---|---|
| Baseline (Evaluated on math_mc) | 0.478 | 0.525 | 0.46 |
| EnsemW2S (Evaluated on math_mc) | 0.502 | 0.525 | 0.49 |
| -------- | -------- | -------- | -------- |
| Baseline (Evaluated on mmlu-school) | 0.467 | 0.56 | 0.474 |
| EnsemW2S-Cross Data (Evaluated on mmlu-school) | 0.47 | 0.56 | 0.486 |
| -------- | -------- | -------- | -------- |
| Baseline (Evaluated on mmlu-college) | 0.4 | 0.33 | 0.36 |
| EnsemW2S-Cross Data (Evaluated on mmlu-college) | 0.4 | 0.33 | 0.36 |
Observations:
-
Our method demonstrated effective generalization when trained on the Math MC dataset and tested on the MMLU high school mathematics questions. This can be attributed to the relatively lower difficulty level of high school mathematics compared to college mathematics.
-
Conversely, when evaluating on the MMLU college mathematics questions, our method struggled to generalize effectively. This difficulty can be linked to both the differences in datasets and the significantly higher difficulty level of the college mathematics questions.
-
Note: Our easy-hard setup also have significant distribution shift between training data and inference data as shown in Fig. 8,9,10 of appendix. Thus, that also serves as a setup for studying generalization of our method.
In summary, while generalizing from easy data to harder data with W2S supervision is challenging in principle, our method demonstrates promising potential in this area.
Questions:
How sensitive is the method to the distribution shift cross-dataset?
From weakness 2 answers, we observe that EnsemW2S method is less sensitive to different data in the same domain but more sensitive to difficulty/hardness levels.
This paper explores the problem of weak-to-strong (w2s) generalization in language models, particularly for complex tasks where human-labeled data is limited or unavailable. It introduces an approach using an "easy-to-hard", where weak models trained on simple tasks guide stronger models to tackle more difficult ones. Inspired by AdaBoost, the proposed method integrates multiple weak models to generate robust pseudo-labels for challenging tasks, simulating human supervision.
优点
- The authors propose an AdaBoost-inspired ensemble method that combines multiple weak LLMs to provide stronger supervision for training a more powerful model.
- The paper introduces a new algorithm that combines multiple weak LLMs by adjusting token probabilities through a voting mechanism. In some cases, a strong model trained with pseudo-labels from weak models outperforms the same model trained with real labels on complex tasks.
缺点
- The concept of weak-to-strong (w2s) was first introduced by Burns et al. (2023), and this paper is seen to be only an improvement of the Burns et al. (2023) approach by extending single weak model supervision to multiple weak model supervision. At the same time, the integration of multiple weak models is seen as a direct use of the AdaBoost algorithm.
- Applying the concept of AdaBoost to ensemble learning with large language models (LLMs) could lead to significant computational overhead in practical applications. However, the paper fails to provide a detailed analysis of the computational cost or propose strategies for efficient implementation.
- The paper title and abstract emphasize the notion of creating a stronger model through the ensemble of weak models, but this idea is not sufficiently highlighted in the main text. It is only mentioned in Appendix A that the performance improves after AdaBoost training compared to a single weak model. In contrast, the experimental results in the main text primarily show that the ensemble of weak models serves to generate pseudo-labels to address the lack of labeled data. That is, compared to using real labels, weak model supervision enables the strong model to maintain performance on complex tasks without significant degradation.
- The ablation study lacks experiments treating each sample, rather than each token, as an independent unit. Although Appendix Figures 12 and 13 show the minimal impact of different window lengths on token processing, the claim that sample-level weight updating performs worse than token-level updating is unsubstantiated.
- The results compare only random splits and easy-hard data separation. However, performance gains in weak-to-strong models are not evident in the easy-hard split and are noticeably inferior to the random split. This raises questions about whether weak models also need training on hard tasks. Additionally, random split analysis does not account for potential gains attributed to a progression from easy to hard data.
- The method shows favorable performance on models with smaller parameters, as indicated in the scaling analysis. However, it lacks demonstration on larger-scale models, such as 7B, or evidence of broad applicability across different model families, like LLaMA or Qwen.
问题
- The three parts in Figure 1 lack clear distinction. The placement of the "Test Data" label in red font in the middle is somewhat unclear, and the process for generating pseudo-labels for the hard data is insufficiently detailed. Additional labeling would improve clarity.
The paper title and abstract emphasize the notion of creating a stronger model through the ensemble of weak models, but this idea is not sufficiently highlighted in the main text. It is only mentioned in Appendix A that the performance improves after AdaBoost training compared to a single weak model. In contrast, the experimental results in the main text primarily show that the ensemble of weak models serves to generate pseudo-labels to address the lack of labeled data. That is, compared to using real labels, weak model supervision enables the strong model to maintain performance on complex tasks without significant degradation.
Thank you for your insightful question regarding our focus on model improvement through ensemble methods.
- We indeed emphasize the creation of a stronger model by combining multiple weak models, which outperforms any single weak model.
- To illustrate this point, we present results in Section 5.2.1, where we specifically compare the performance of individual weak models and demonstrate the significant improvements achieved through the EmsemW2S ensemble method across various datasets and data-splitting strategies.
- We appreciate your feedback and will consider highlighting this aspect more prominently in the main text to ensure clarity regarding the benefits of our ensemble approach.
The ablation study lacks experiments treating each sample, rather than each token, as an independent unit. Although Appendix Figures 12 and 13 show the minimal impact of different window lengths on token processing, the claim that sample-level weight updating performs worse than token-level updating is unsubstantiated.
Thank you for your insightful feedback. We acknowledge your point regarding the minimal impact of different window lengths on performance. However, we would like to clarify that increasing the window length does not bring token-based weighing closer to a sample-based approach. Below are our key explanations:
-
Distinct Weighing Steps for Sample and Token-Based Strategies: EnsemW2S operates in two distinct phases: (1) weighing tokens based on their associated errors, and (2) sampling based on these weights. The variations in window length primarily influence the sampling stage. During the weighing stage, each token is evaluated independently without being affected by window length. Thus, as the window length approaches infinity, it does not equate to a sample-based approach.
-
Differences in Calculation for the Two Strategies: The calculation of is influenced by the weighted-error term, which differs between token-based and sample-based strategies. In the token-based approach, errors are weighted at the token level, while in the sample-based approach, errors are aggregated at the sample level. This fundamental difference results in distinct values for , independent of window lengths.
We have conducted ablation studies comparing token-based and sample-based weighing, which we will include as bar plots in the appendix of our paper.
Table 1: Ablation for Random data Spliting. The following tables show accuracy values.
| Data | Weak Model | Strong Model | Single Model Baseline | EnsemW2S (Token-Level Weighing) | Sample-Level Weighing |
|---|---|---|---|---|---|
| ARC | Pythia-160m | Pythia-410m | 0.282 | 0.285 | 0.256 |
| Quartz | Pythia-160m | Pythia-410m | 0.500 | 0.528 | 0.518 |
| ARC | Pythia-410m | Pythia-1b | 0.288 | 0.293 | 0.288 |
| Quartz | Pythia-410m | Pythia-1b | 0.551 | 0.551 | 0.539 |
Table 2: Ablation for Easy-hard Data Spliting. The following tables show accuracy values.
| Data | Weak Model | Strong Model | Single Model Baseline | EnsemW2S (Token-Level Weighing) | Sample-Level Weighing |
|---|---|---|---|---|---|
| ARC | Pythia-160m | Pythia-410m | 0.179 | 0.183 | 0.176 |
| Quartz | Pythia-160m | Pythia-410m | 0.476 | 0.483 | 0.476 |
| ARC | Pythia-410m | Pythia-1b | 0.201 | 0.203 | 0.203 |
| Quartz | Pythia-410m | Pythia-1b | 0.492 | 0.518 | 0.501 |
The results clearly indicate that, for both random and easy-hard data splitting strategies, sample-level weighing consistently performs worse than token-level weighing.
We appreciate your suggestion and look forward to any other further questions on this topic.
We thank Reviewer b44k for the great summarization of our work. We are highly encouraged that the reviewers found our method of combininig weak LLMs for super-alignment and the cooresponding weak-to-strong (w2s) results important and useful for the community. We are greatful for the detailed feedback for making our paper better. Below, we address and clarify all of the reviewer's concerns and questions in detail.
Weaknesses: The concept of weak-to-strong (w2s) was first introduced by Burns et al. (2023), and this paper is seen to be only an improvement of the Burns et al. (2023) approach by extending single weak model supervision to multiple weak model supervision. At the same time, the integration of multiple weak models is seen as a direct use of the AdaBoost algorithm.
We appreciate the reviewer’s comments and would like to clarify that the EnsemW2S method is not a direct application of Adaboost across multiple weak models. While Adaboost is fundamentally designed for classification tasks, EnsemW2S is tailored for generative tasks by adjusting token distributions similar to controlled decoding[2]. Additionally, there is limited literature on extending Adaboost to generative tasks and classification tasks with large number of classes (~ to vocab size), making such an extension non-trivial. We will first discuss the differences between Burns et al. and our method, followed by the differences between our method and Adaboost.
Differences Between Burns et al. [1] and EnsemW2S:
-
Use of Multiple Models:
- Burns et al. [1] utilize a single weak model for w2s tasks in binary classification, chess puzzles, and reward modeling.
- In contrast, we employ an ensemble of weak models, demonstrating our approach on the same binary classification task for direct comparison. Additionally, we introduce a novel ensemble method for generative tasks, validated on an SFT Q/A dataset.
-
EnsemW2S for Changing Token Distribution:
- Our approach generates a new token distribution by combining multiple large language models (LLMs) for generative tasks, akin to controlled decoding methods [2], where token distributions are adjusted during inference. EnsemW2S is specifically developed for SFT tasks.
-
Approach to Superalignment:
- While Burns et al. [1] discuss various perspectives on the superalignment problem, they do not address the necessity of integrating easy-to-hard generalization (e2hg) into the weak-to-strong generalization (w2sg) framework. Our work emphasizes the importance of e2hg for w2sg, highlighting the challenges of availability of hard labelled data, but more easily available labelled-easy-data in a practical superalignment setting.
-
Performance Comparison:
- Our method, which combines two or more weak models, demonstrates improved performance over the baseline (single model) in the w2s training used by Burns et al. [1] for the binary classification task.
- For the SFT task, Burns et al. [1] do not provide experimental results, making direct comparison impossible. However, we follow their approach of using a single weak model with a single strong model for our baseline. We observe an improvement of up to 14% over existing baselines and average improvements of ~5% and ~4% for binary classification and generative tasks, respectively.
Differences Between Adaboost and EnsemW2S:
-
Adaboost Not Directly Applicable to Generative Settings:
- Adaboost is primarily an ensemble method for classification tasks. Extensive research [1] has explored its generalization, convergence properties, and overfitting avoidance. However, the dynamics of classification differ significantly from generative tasks, where the next token distribution is derived from preceding tokens. Thus, applying Adaboost directly to generative settings is not feasible.
-
EnsemW2S vs. Adaboost Approach:
- While EnsemW2S is inspired by Adaboost principles, it fundamentally differs by integrating outputs from multiple LLMs at the token level. This method combines token distributions from various LLMs to generate a new token distribution, aligning more closely with controlled decoding [2] rather than traditional Adaboost methodologies.
-
Challenges with Multi-Class Classification Adaboost:
- EnsemW2S attempts to utilize Multi-Class Classification Adaboost; however, there is a notable lack of literature on multi-class classfication Adaboost. Existing studies do not adequately address tasks involving a large number of classes, similar to the vocabulary sizes encountered in NLP tasks [2]. Here we would like to emphasize that EnsemW2S is not just multiclass classification. In addition to aggregating the token-level output of multiple LLMs, our method also designed a novel decoding/sampling technique.
Answer continued in next comment...
Answer continued from previous comment.
Non-Trivial Enhancements of EnsemW2S Beyond Adaboost:
-
Prior Term Adjustment: The calculation of cannot rely solely on error, , as the traditional Adaboost method is valid only when . Applying the same equation in our context could yield negative values. We introduce a prior term, , inspired from multi-class classification Adaboost works [3], to address this issue. Intuition behind this term is in Appendix Sec C4.
-
Handling Large Vocabulary Sizes: Existing works on multiclass-classification Adaboost [3] suggest using (where c is the number of classes) in the prior term, , as represents the random performance of the model. However, when c (the number of classes) becomes very large, the term also grows significantly, causing the parameters of Adaboost to become nearly identical and, consequently, less useful. To address this, we introduce a pre-training error term, , which represents an upper bound on the sample error. We then use (a lower bound on accuracy) as a replacement for the term, as our model's lowest possible accuracy is , not .
-
Error Calculation Methodology: Determining an appropriate error metric is challenging. Sentence-based error assessments do not effectively enhance quality through the combination of multiple LLMs. In contrast, token-based error calculations improve token distributions by leveraging the collective knowledge of the models, fostering intrinsic collaboration among multiple LLMs. This contrasts with existing multi-LLM frameworks, such as LLM Debate, which focus on individual model capabilities without promoting mutual learning.
In summary, we emphasize that the distinctions between EnsemW2S and both Burns et al. [1] and Adaboost are significant in methodology, performance and principle making the contributions of this work non-trivial. Thank you for your attention to these important points.
Applying the concept of AdaBoost to ensemble learning with large language models (LLMs) could lead to significant computational overhead in practical applications. However, the paper fails to provide a detailed analysis of the computational cost or propose strategies for efficient implementation.
We recognize that incorporating additional models will inherently increase computational demands; however, this increase primarily occurs during the training phase of weak models. There are no additional computational costs associated with the training/testing of the strong model. We have now added detailed cost analysis at each step of our method in Appendix section F. However we wanted to emphasize the following:
- Focus on Weak-to-Strong Generalization: The primary objective of our work is a thought experiment centered on the (im)possibility of weak-to-strong generalization (w2sg). We aim to explore how to leverage the "wisdom of the crowd" by combining multiple weak LLMs to enhance w2sg. It is crucial to highlight that many studies on w2sg do not emphasize computational cost, as doing so may distract from core problem.
- Cost Implications During Training: The extra computational costs arise during the training of weak models and the inference from multiple weak models. Training a strong model follows a similar computational pattern to existing LLM training. However, it is important to note that the computational cost of strong model itself is considerably higher than that of the weak models. While weak models can be trained or tested in parallel, they can also be handled sequentially, mitigating some computational concerns.
- No Additional Cost During Testing: After the w2s model has been trained, the inference process incurs no additional computational costs.
The results compare only random splits and easy-hard data separation. However, performance gains in weak-to-strong models are not evident in the easy-hard split and are noticeably inferior to the random split. This raises questions about whether weak models also need training on hard tasks. Additionally, random split analysis does not account for potential gains attributed to a progression from easy to hard data.
Clarification on Easy-to-Hard generalization in W2S Settings
- In the context of e2hg for w2s learning, it is important to note that there is no training on hard data. This is because, in this framework, we assume that labels for the harder data are not accessible, a key oversight in previous research, as highlighted in our paper.
- The harder data is inherently more challenging to label, as it requires more effort from human annotators compared to the easier data, which can be labeled with relative ease.
- To simulate the e2hg for w2s learning scenario, we categorize the data into easy and hard subsets. This is because, in practical super alignment setting, the easy data represents readily available instances, while the hard data consists of instances that are difficult to label, even for weak teachers (a.k.a human).
- Our goal is to enable the strong model to learn the hard data, utilizing labels from weak teachers that are only trained on easy data. This encapsulates the essence of e2hg for w2sg.
Performance Expectations in Weak-to-Strong Settings
- It is indeed expected that improvements in the weak-to-strong setting, when using an easy-to-hard split, will be less pronounced compared to those observed in random split experiments.
- We would like to emphasise that e2hg does not align with curriculum learning. In our methodology, weak teachers are trained solely on the easy data (or one half of a random split) and are then employed to assist the stronger model in learning from the hard data or the other half of the random split.
- Given that learning hard data from easy data is inherently more challenging than learning random sample from another random sample, we anticipate that performance gains will be more substantial in the random split scenario.
The method shows favorable performance on models with smaller parameters, as indicated in the scaling analysis. However, it lacks demonstration on larger-scale models, such as 7B, or evidence of broad applicability across different model families, like LLaMA or Qwen.
Thank you for this feedback. Please find our answers below regarding the larger-sized Qwen model for the generation task. Please note that we have already included results for Qwen on the binary task in our paper. Additional results for larger model sizes will be added to the appendix of our paper.
Table 1. The table below shows accuracy (%). The weak model is Qwen2.5-3B, and the strong model is Qwen2.5-7B. The data is from ARC and math-mc.
| Split | W2S Performance Baseline | W2S Performance Baseline-Ours | Percentage Improvement (%) |
|---|---|---|---|
| Easy-Hard (ARC) | 38.82 | 40.79 | 5.1 % |
| Random (ARC) | 53.07 | 54 | 1.7 % |
| Easy-Hard (Math-mc) | 53.2 | 53.6 | 0.8% |
| Random (Math-mc) | 52.8 | 55 | 4.2% |
Questions: The three parts in Figure 1 lack clear distinction. The placement of the "Test Data" label in red font in the middle is somewhat unclear, and the process for generating pseudo-labels for the hard data is insufficiently detailed. Additional labeling would improve clarity.
Thank you for your detailed feedback regarding Figure 1.
- We have made updates to the figure to address your concerns, including enhancing the distinction between the three parts.
- The placement of the "Test Data" label has been adjusted for improved clarity.
- We have also provided additional details on the process for generating pseudo-labels for the hard data.
- We appreciate your insights and would welcome any further feedback to enhance the quality of our paper.
Thank you again for your time and effort in reviewing our paper! If our responses above do not fully address your concerns or if additional questions arise, please let us know. We are more than happy to address them and incorporate relevant additions to further enhance our manuscript.
REFERENCES
[1] Burns, Collin, et al. "Weak-to-strong generalization: Eliciting strong capabilities with weak supervision." arXiv preprint arXiv:2312.09390 (2023).
[2] Mudgal, Sidharth, et al. "Controlled decoding from language models." arXiv preprint arXiv:2310.17022 (2023).
[3] Trevor Hastie, et al. Multi-class adaboost. Statistics and its Interface, 2(3):349–360, 2009.
Dear Reviewer,
This is a gentle reminder regarding your initial concerns.
We have addressed them here and also updated the manuscript with ablation on sample-based weighing vs token-based weighing, Qwen-7B results, more details in Figure 1 and clarifications in the writing.
We kindly invite you to review our response. If you have any remaining concerns or suggestions, we would be delighted to engage in a final discussion to ensure a fair review process.
If we have answered all your questions here, we would be happy if you could reconsider your score.
We greatly appreciate your time and effort and look forward to hearing from you.
Thank You
Dear Reviewer,
We have edited the above response and added additional results on 7B sized models for ARC and challenging math dataset (math_mc). We have also made relevant additions to the paper based on your suggestions. If you find the above answers not sufficient to solve your concerns, we would be really happy to discuss more and resolve any other concerns you may have. If your concerns are resolved, it would be great if you could reconsider your score.
Thank a lot for your thoughtful feedback and valuable suggestions to improve our paper. We truly appreciate the time and effort you have invested in reviewing our work.
Best Regards Authors
Dear Reviewer,
We hope this message finds you well. We sincerely apologize for reaching out again, but we truly value your thoughtful feedback and the time you’ve taken to review our paper. Your suggestions have been instrumental in improving our work, and we have worked diligently to address your concerns.
In response, we have incorporated several updates, including experiments with 7B-sized Qwen models (on simple ARC and challenging Math_mc dataset), ablation studies, and improvements to the clarity of our contributions and manuscript.
With the rebuttal deadline approaching, we would greatly appreciate it if you could review our responses and let us know if we have successfully resolved your concerns. If our answers address your major concerns, we would be truly grateful if you could reconsider your score.
Thank you once again for your invaluable time and effort. Your feedback has greatly enriched our work.
Best regards, Authors
Dear Reviewer,
As the deadline is just a few hours away, we wanted to check in to see if you found our rebuttal complete and thorough to answer your questions and if you had any other questions.
Thanks a lot for your time and effort in reviewing our paper.
Regards Authors
ENSEMW2S explores weak-to-strong (w2s) generalization by combining the capabilities of multiple LLMs to enhance model performance. It presents an easy-to-hard (e2h) approach, inspired by AdaBoost, to use weaker models on simpler tasks to supervise and train stronger models for more complex tasks. This ensemble-based approach is validated through experiments on binary classification and generative tasks.
优点
- The paper addresses a highly relevant and timely topic within the machine learning community. Many efforts are being made to leverage smaller models for most practical applications.
- The authors have conducted extensive experiments, exploring the performance of multiple LLMs.
缺点
-
The manuscript lacks sufficient clarity, making it challenging for readers to follow the paper
-
The novelty of the proposed approach appears limited, as it closely resembles the traditional Adaboost method
-
The analysis throughout the paper, such as the statement on line 345 ("we aim to recover the performance gap (PGR) and elicit the full capability of the strong model using an ensemble of weak models"), frequently emphasizes the importance of the Performance Gap Recovery (PGR). However, in Section 5.1, Better Metric, the authors appear to question the adequacy of PGR as a metric.
-
typo: choice is written as choise a couple of times in the paper.
问题
- Can the authors please clearly state the difference between their approach and the baseline, Burns et al.?
- Can the authors report std for different runs in the experiments sections?
- Have the authors experimented with additional values for the rounds, beyond the examples of 5 and 10 for binary and multi-choice scenarios??
- In Section 5, are the data splitting strategies essentially random vs hard splits?
- Can we please have clear quantitative reports between different models instead of just the bar plots?
- From the bar plots, it appears that the performance gap narrows under the easy-hard settings. Have the authors explored this observation?
Have the authors experimented with additional values for the rounds, beyond the examples of 5 and 10 for binary and multi-choice scenarios??
-
Computational Efficiency: While adding more models may seem beneficial, it can lead to a significant increase in computational load without a corresponding improvement in performance. Our experiments indicate that we observe substantial performance gains within the first five rounds for both binary and SFT tasks. Although we did extend our binary classification experiments beyond five rounds, we noted a saturation effect after this point. Consequently, we did not find it compelling to investigate larger round counts further.
-
Motivation for Ensemble: Our primary objective is to advocate for an ensemble approach that leverages the principle of combining the wisdom of the crowd. We aim to inspire the community to explore more intrinsic combinations of multiple LLMs. It is important to note that many existing studies on multi-LLM frameworks, such as LLM debate, do not emphasize the intrinsic combination of these models, which is a central focus of our work.
In Section 5, are the data splitting strategies essentially random vs hard splits?
- Thank you for your question regarding our data splitting strategy. We provide a detailed explanation of this strategy in Section 2.2 of our paper.
- To clarify, could you please specify what additional information you are seeking so that we can address your inquiry about the nature of our data splitting strategies in Section 5?
Can we please have clear quantitative reports between different models instead of just the bar plots?
- In Appendix Section F.4 of the original submission, we have already provided detailed numerical results and improvement percentages that support our findings.
- Please let us know if there are particular aspects or comparisons you would like us to elaborate on further.
From the bar plots, it appears that the performance gap narrows under the easy-hard settings. Have the authors explored this observation?
- Yes, the performance gap ratio (PGR) is indeed lower in the easy-hard settings. This obervation is intutive since a model trained on easy data will find it hard to generalize on hard data.
- Consequently, investigating the easy-to-hard generalization (e2hg) is crucial as it forms an integral part of the wider understanding of the weak-to-strong generalization (w2sg) framework.
- We have addressed this observation in detail in our original manuscript, specifically in Sections 5.1 and 5.2.2.
Thank you again for your time and effort in reviewing our paper! We will try to add more clarity and improve presentation of manuscript in the next few days. If our responses above do not fully address your concerns or if additional questions arise, please let us know. We are more than happy to address them and incorporate relevant additions to further enhance our manuscript.
REFERENCES
[1] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
[2] Trevor Hastie, Saharon Rosset, Ji Zhu, and Hui Zou. Multi-class adaboost. Statistics and its Interface, 2(3):349–360, 2009.
[3] Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023
[4] Mudgal, S., Lee, J., Ganapathy, H., Li, Y., Wang, T., Huang, Y., ... & Beirami, A. (2023). Controlled decoding from language models. arXiv preprint arXiv:2310.17022.
Dear Reviewer,
This is a gentle reminder regarding your initial concerns.
We have addressed them here and also updated the relevant answers in the manuscript. We kindly invite you to review our response. If you have any remaining concerns or suggestions, we would be delighted to engage in a final discussion to ensure a fair review process.
If we have answered all your questions here and any additional queries as part of other responses, we would be happy if you could reconsider your score.
We greatly appreciate your time and effort and look forward to hearing from you.
Thank You
I would like to thank the authors for their response. Most of my concerns are addressed after reading the replies.
I will increase my score.
We thank Reviewer 8aG6 for the detailed feedback for making our paper better. We are encouraged that the reviewer finds our paper motivation, method and extensive experiments important for the community and for better weak-to-strong generalization (w2sg). Below, we address and clarify all of the reviewer's concerns and questions in detail.
The novelty of the proposed approach appears limited, as it closely resembles the traditional Adaboost method
We emphasize that our EnsembleW2S, which focuses on generation rather than classification, is notably different from Adaboost.
-
Adaboost not directly applicable to generative setting: Adaboost is an ensemble method primarily designed for binary classification tasks. Extensive research [1] has explored its generalization to test data, convergence properties, and its ability to avoid overfitting. However, the dynamics of binary classification differ significantly from those of generative tasks, where the next token distribution is derived from preceding tokens, culminating in a decoding strategy that produces the final output. Therefore, applying Adaboost directly to generative settings is not feasible.
-
EnsembleW2S vs Adaboost Approach: While EnsembleW2S draws inspiration from the principles of Adaboost, it fundamentally diverges by integrating outputs from multiple large language models at the token level. This approach combines the token distributions from various LLMs to generate a new token distribution. To effectively accommodate generation tasks, our algorithm significantly deviates from traditional Adaboost methodologies, aligning more closely with controlled decoding [4], where token distributions are adjusted during inference.
-
EnsembleW2S attempts to utilize Multi-Class Classification Adaboost. However, there is a notable lack of literature addressing the application of Adaboost in multi-class classification scenarios. Existing studies lack rigor and fail to sufficiently address tasks involving a large number of classes, akin to the vocabulary sizes in NLP tasks [2].
We would like to summarize EnsemW2S's Innovative Contributions here for the convenience of the reviewing committee. Our algorithm introduces several insightful enhancements that extend beyond mere modifications of Adaboost:
-
Handling Large Vocabulary Sizes: Existing multi-classification approaches [2] suggest using 1/c, where c is the number of classes, as a lower bound for error; however, this does not empirically hold for tasks with a large number of classes. To address this, we introduce the model's pre-training error term () based on the intuition that it represents the lowest achievable error for a model. This is because the model's performance should improve after training so error will be pretraining error i.e. ()
-
Prior Term Adjustment: The calculation of cannot rely solely on error, , as the traditional Adaboost method of calculation is valid only when the , a condition derived from binary classification. Applying the same equation in our context could yield negative values. Consequently, we introduce a prior term, , to rectify this issue.
-
Error Calculation Methodology: Determining an appropriate error metric was challenging. Sentence-based error assessments do not effectively enhance quality through the combination of multiple LLMs. In contrast, token-based error calculations facilitate the improvement of token distributions by leveraging the collective knowledge of the models. This approach encourages intrinsic collaboration among multiple LLMs, contrasting with existing multi-LLM frameworks, such as LLM Debate, which primarily focus on exploiting individual model capabilities without fostering mutual learning.
- Clarifications in Summary: Below we further aim to clarify certain points in the summary to enhance the understanding of our work’s novelty.
-
Easy-to-Hard (E2H) Approach: The e2h strategy is not inspired by AdaBoost. Instead, it is a data splitting strategy similar to random splitting. We believe that combining easy-to-hard generalization (e2hg) with w2sg creates a more realistic framework for addressing the superalignment challenge. Random splitting was initially employed to investigate the superalignment problem.
-
Methodology: To tackle the e2hg and w2sg problems, we first utilize a simple AdaBoost approach for a binary classification task to benchmark our method against that of Burns et al. Subsequently, we develop a more complex algorithm, EnsemW2S, which modifies the vocabulary distribution at the token level to integrate multiple large language models (LLMs). This algorithm draws inspiration from AdaBoost [1] and controlled decoding [4].
-
The analysis throughout the paper, such as the statement on line 345 ("we aim to recover the performance gap (PGR) and elicit the full capability of the strong model using an ensemble of weak models"), frequently emphasizes the importance of the Performance Gap Recovery (PGR). However, in Section 5.1, Better Metric, the authors appear to question the adequacy of PGR as a metric.
We acknowledge the use of Performance Gap Recovery (PGR) as a metric for performance comparison in our study, while also addressing some of its limitations.
-
In Section 5.1, titled "Better Metric," we highlight PGR's sensitivity to extreme scenarios, which can result in significantly large negative or positive values. However, this sensitivity does not invalidate PGR as a reasonable metric for studying weak-to-strong (w2sg) models. We believe it is important to share these demerits to inform the community and guide future research in w2sg.
-
Additionally, PGR serves as a common standard for comparing our results with previous work in the weak-to-strong (w2s) domain. Given that PGR is currently the only metric established in the literature for this purpose, we have chosen to utilize it in our analysis.
-
Throughout the paper, including the statement on line 345 ("we aim to recover the performance gap (PGR) and elicit the full capability of the strong model using an ensemble of weak models"), we consistently emphasize the significance of Performance Gap Recovery (PGR). While we discuss its limitations in Section 5.1, this does not detract from its overall utility as a metric in our research.
The manuscript lacks sufficient clarity, making it challenging for readers to follow the paper. typo: choice is written as choise a couple of times in the paper.
Thank you for your valuable suggestions to enhance the presentation of our paper.
- We have addressed the identified typo, correcting "choise" to "choice" throughout the manuscript.
- Additionally, we have made improvements to the writing for greater clarity and coherence.
- We welcome any further specific feedback you may have regarding areas that could benefit from additional refinement. Your insights are greatly appreciated, and we are committed to making our paper as clear and accessible as possible.
Questions:
Can the authors please clearly state the difference between their approach and the baseline, Burns et al.?
Our baseline for binary classification is directly taken from Burns et al. (2023)[3]. Here are the key differences and comparisons:
-
Use of Models:
- [3] use a single weak model for w2s in binary classification, chess puzzles, and reward modeling tasks.
- We use an ensemble of weak models, demonstrating our approach on the same binary classification task for direct comparison. Additionally, we develop a new ensemble approach to combine models at the token level for other generative tasks. Our general ensemble learning approach is tested on an SFT Q/A dataset.
-
Approach to Superalignment:
- [3] introduced multiple perspectives on the superalignment problem but did not talk about how weak-to-strong generalization (w2sg) problem is in complete without easy-to-hard generalization (e2hg). In this work we emphasize on e2hg for w2sg. Easy data is straightforward for human labeling and is abundant in quantity, but labeling hard data, needed by the strong model with superhuman performance to make them even better, is beyond the scope of human teachers.
-
Performance Comparison:
- Our method, which combines two or more weak models, performs better or on par with the baseline (single model) in the w2s training used by [3] for the binary classification task.
- For the SFT task, [3] do not provide experiments, so a direct comparison is not possible. However, we follow their approach of single weak with single strong model for the baseline.
Can the authors report std for different runs in the experiments sections?
- We appreciate your suggestion regarding the reporting of standard deviations for different runs in the experiments section.
- We would like to point out that this information is already provided in the appendix, specifically in the tables of Section F.4.
- We encourage readers to refer to this section for a comprehensive overview of the results.
Dear Reviewer,
Thank you for your thoughtful feedback and valuable suggestions to improve our paper. We truly appreciate the time and effort you have invested in reviewing our work.
In response to your feedback, we have now added aggregated bar plots to the appendix section G as we found your suggestion particularly insightful and valuable.
We would be happy to address any additional concerns or questions you might have since the rebuttal period is not yet over—please feel free to let us know. If not, we would greatly appreciate it if you could kindly share the reasons for your decision to reject the paper, as your insights have been instrumental in refining our work.
Thank you once again for your valuable suggestions and time.
Best regards, The Authors
Dear Reviewer,
Thank you for your last response and increasing the score. After your last response we added aggregated bar plots to the appendix section G. We would be really grateful if you could let us know if this solves your concerns. If not, we would love to look into any additional concerns you have if you let us know.
We truly value your thoughtful feedback and the time you’ve taken to review our paper. Your suggestions have been instrumental in improving our work, and we have worked diligently to address your concerns.
Thank You
Best regards,
Authors
We thank all reviewers for their valuable feedback. We are pleased that the reviewers found our approach of combining multiple LLMs novel (b44k, ogPp, vqfu, 8aG6) and recognized its importance to the community for addressing a timely topic like weak-to-strong generalization (w2sg) (8aG6, ogPp, vqfu). They recognized the importance of easy-to-hard generalization (e2hg) for w2sg (b44k, ogPp, vqfu) and mentioned the token distribution update algorithm for autoregressive LLMs a significant contribution in terms of idea and design (vqfu).
They also appreciated our comprehensive experiments for large range of weak models (8aG6), improvement of PGR over Burns et. al. (vqfu) and successful w2sg in some case using our approach (b44k). Additionally, they commended on clarity in motivation and comparison with previous works: the Single Weak Supervisor Limitation, Lack of Focus on Weak Model Enhancement, and Overlooking Task Complexity (ogPp).
Below, we address common concerns and misunderstandings and provide additional experimental results. We also outline the updates made to the paper.
Experimental Updates on large models and Conclusions for Cross-Data Performance (vqfu, ogPp, vqfu)
We conducted experiments on a challenging task (math_mc) and larger 7B sized Qwen models (b44k, vqfu) presented below. Other data values are added in the paper.
| Split | W2S Performance Baseline | W2S Performance Baseline-Ours | Percentage Improvement (%) |
|---|---|---|---|
| Easy-Hard | 53.2 | 53.6 | 0.8% |
| Random | 52.8 | 55 | 4.2% |
To study cross-data performance we trained on the math_mc dataset and tested on MMLU (Easy, Harder and Hardest) for easy-hard and random data split. Detailed results are in appendix E6. We show some results below and our observation is that cross-data generalization with our method outperforms the baseline for MMLU-Easy and Harder.
However, as the distribution gap between easy and hard data increases, generalization becomes more difficult. Nevertheless, our method still performs better than or equal to the baseline, as can also be seen in the bar plots comparing random vs. easy-hard splits (Fig. 2, 4, 5). This observation aligns with intuition: weak models trained on easy data without exposure to hard data find it challenging to generalize to harder data. While these models improve accuracy on easy data, they may not enhance accuracy on harder data. This makes e2hg a more ambitious problem compared to simple w2sg. This work aim to highlight e2hg as a more practical setting for w2sg to encourage future w2sg research to focus on this scenario.
Table: Weak Model - Qwen-1.5B; Strong model - Qwen-3B
| Method | Weak Model (Option Acc) | Weak-to-Strong Model (Option Acc) |
|---|---|---|
| Baseline (Evaluated on mmlu-school; easy) | 0.467 | 0.474 |
| EnsemW2S-Cross Data (Evaluated on mmlu-school; easy) | 0.47 | 0.486 |
| Baseline (Evaluated on mmlu-college; harder) | 0.4 | 0.36 |
| EnsemW2S-Cross Data (Evaluated on mmlu-college; harder) | 0.4 | 0.36 |
Clarification on easy-hard vs random, paper motivation and difference with Burns et. al. (8aG6, b44k, ogPp, vqfu)
The difference between prior works and their limitations is motivated in the paper introduction: the Single Weak Supervisor Limitation, Lack of Focus on Weak Model Enhancement, and Overlooking Task Complexity (e2hg as a more practical setting for w2sg). Our binary classification baseline is same as Burn. et al. For generation task there is no direct comparison so we use single weak model which is in line with Burns. et. al.
The main contribution of our work is the design of a method to combine multiple LLMs of autoregressive nature by adjusting their token probability distributions. We compare Burns et al. and our approach (EnsemW2S) in two settings: simple w2sg (random split) as well as a more ambitious setting (e2hg with w2sg).
We do not aim to achieve complete w2sg or e2hg, as this is beyond the scope of current literature. However, our combined multi-LLM setup shows promise by not only improving generalization in most cases across both settings but also achieving complete w2sg in certain cases with random splits.
Paper updates: All the updates in the paper are in blue.
(8aG6) Added Adaboost background in Appendix Section C3.
(vqfu) Intution behind prior term addition in Appendix Section C4.
(b44k) Experiment on 7B Qwen models.
(8aG6, b44k) Section 3 and Figure 1 clarity.
(b44k) Ablation on sample-level weighting vs token-level weighting.
(vqfu) Clarification on using PGR in section 5.1.
(b44k, ogPp) Computational cost analysis added in appx Sec F.
(8aG6) Aggregated bar plots in appx Sec G.
We thank all reviewers again for the time and effort in reviewing our paper! Please let us know if the above explanations do not address your concerns. We are happy to answer any further questions.
The paper proposes aapproach to weak-to-strong generalization by leveraging an ensemble of weak models to enhance the performance of stronger LLM. The research shows promising results. However, the reviewers raised several concerns regarding the clarity of the manuscript, the novelty of the approach, and the computational overhead involved in applying the method to large-scale models. Although some concerns were addressed, the final scores are still mixed. In particular, the paper would benefit from clearer explanations of the method’s novelty, better alignment between the abstract and experimental results, and a discussion of the computational challenges. The authors are encouraged to consider submitting a revised version of the paper to a future conference.
审稿人讨论附加意见
Reviewers raised questions about sufficient clarity in explaining the novelty of the method, particularly in how it differentiates from existing works such as Burns et al. 2023 and the adaboost algorithm. There are concerns about the computational cost of using an ensemble of weak models, which the paper does not address in sufficient detail. Some reviewers also questioned the method’s generalizability, specifically whether it would work effectively across datasets and how it handles distributional shifts. Although some concerns were addressed, the final scores are still mixed.
[1] Burns, Collin, et al. "Weak-to-strong generalization: Eliciting strong capabilities with weak supervision." arXiv preprint arXiv:2312.09390 (2023).
Reject