Test-Time Alignment via Hypothesis Reweighting
Inference-time alignment by reweighting prediction heads in one neural network
摘要
评审与讨论
The paper proposes Hypothesis Reweighting (HYRE), a framework for test-time model adaptation to address task underspecification in large pretrained models. The authors introduce an ensemble method that dynamically reweights individual ensemble heads at test time based on a small number of labeled examples from the target distribution. The authors state that the method outperforms traditional methods such as fine-tuning in low-data scenarios and adapts quickly without modifying the model parameters.
优点
- The paper is content-rich and centers on test-time adaptation, which is practical.
- The paper features extensive analytical experiments supported by a variety of figures and tables, which enhance the clarity and depth of the presented results.
- The understanding of method from the perspective of Bayesian inference is insightful.
缺点
- Your motivation, the claim that "the best single model (A) can substantially outperform the ensemble average (B)" does not directly lead to the conclusion that "it is more advantageous to view the ensemble as representing a set of candidate models (C) rather than aiming for a single 'best' function through uniform averaging (D)". The relationship between A, B, C, and D needs clearer justification. Are B and D describing the same approach? This logical connection requires more elaboration to be convincing.
- Your method is based on Strong Assumptions. It relies on training multiple heads as the basis for test-time adaptation, which implies:
- You assume that test tasks can be represented by a limited set of basis functions, which may not hold true in many real-world applications.
- You also assume that test tasks are linear combinations of these basis functions, another strong assumption that is often unrealistic.
- Your method has Dependence on Labeled Test Data. The requirement for a small set of labeled examples from the target distribution is a significant limitation, while standard test-time adaptation scenarios typically allow access only to unlabeled test data.
- This constraint makes your method unusable in conventional zero-shot scenarios.
- The labeled examples need to be independent and identically distributed (i.i.d.) from the test distribution, which limits applicability in non-i.i.d. environments.
- Your method is unsuitable for continuous or single-sample adaptation settings where labeled data may not be readily available.
- Even if we assume your theoretical framework holds, practical implementation poses challenges. For instance, how do you acquire data from different domains to train the basis heads? The quality and distinctiveness of this data directly impact the method’s effectiveness in real-world test scenarios.
- You need to compare your method with more alignment baselines such as CPO, KTO, SimPO, etc. Additionally, using established alignment evaluation benchmarks like the Open LLM leaderboard would strengthen your results and demonstrate broader applicability.
问题
See Weaknesses.
伦理问题详情
No
Thank you for your constructive feedback. We address each of your concerns below.
Comparisons with alignment baselines
Thank you for this suggestion. We have modified our original evaluation of RewardBench datasets to be directly comparable to the methods on the official leaderboard. Our evaluation of HyRe builds on the GEM-Gemma-2B reward model, which was the state-of-the-art at 2B scale at the time of submission.
The table below shows representative results, comparing with frontier generative models and open-source models trained via alignment methods such as DPO. Notably, HyRe achieves consistent improvements in overall score, with strong gains in the Chat, Safety, and Reasoning splits. We note a slight performance drop on the Chat Hard split, which includes less task ambiguity as the preference datasets used to train these models primarily focus on challenging chat responses.
| Model | Model Type | Score | Chat | Chat Hard | Safety | Reasoning |
|---|---|---|---|---|---|---|
| mistralai/Mixtral-8x7B-Instruct-v0.1 | DPO | 77.6 | 95.0 | 64.0 | 72.6 | 78.7 |
| allenai/tulu-2-dpo-13b | DPO | 76.7 | 95.8 | 58.3 | 79.5 | 73.2 |
| allenai/tulu-2-dpo-70b | DPO | 79.1 | 97.5 | 60.5 | 84.5 | 74.1 |
| allenai/llama-3-tulu-2-dpo-70b | DPO | 77.2 | 96.4 | 57.5 | 74.9 | 80.2 |
| stabilityai/stablelm-2-12b-chat | DPO | 79.9 | 96.6 | 55.5 | 78.1 | 89.4 |
| Anthropic/claude-3-5-sonnet-20240620 | Generative | 84.2 | 96.4 | 74.0 | 81.6 | 84.7 |
| openai/gpt-4o-2024-05-13 | Generative | 84.6 | 96.6 | 70.4 | 86.5 | 84.9 |
| openai/gpt-4o-2024-08-06 | Generative | 86.7 | 96.1 | 76.1 | 88.1 | 86.6 |
| Ray2333/GRM-Gemma-2B-rewardmodel-ft | Seq. Classifier | 84.5 | 89.4 | 75.2 | 84.5 | 88.8 |
| Ours (uniform ensemble) | Seq. Classifier | 84.5 | 88.6 | 72.9 | 83.7 | 89.8 |
| Ours (N=1) | Seq + HyRe | 85.3 | 88.5 | 72.7 | 85.5 | 91.4 |
| Ours (N=5) | Seq + HyRe | 86.4 | 90.3 | 72.6 | 89.1 | 91.4 |
| Ours (N=10) | Seq + HyRe | 87.2 | 90.4 | 72.5 | 90.0 | 92.3 |
| Ours (best head oracle) | HyRe upper bound | 90.0 | 92.3 | 81.8 | 92.5 | 93.1 |
We note that this is not a direct head-to-head comparison, as HyRe leverages labeled data not utilized by the other methods. The primary purpose of this experiment is to demonstrate the potential performance gains achievable through test-time alignment.
Your motivation, the claim that "the best single model (A) can substantially outperform the ensemble average (B)" does not directly lead to the conclusion that "it is more advantageous to view the ensemble as representing a set of candidate models (C) rather than aiming for a single 'best' function through uniform averaging (D)". The relationship between A, B, C, and D needs clearer justification. Are B and D describing the same approach? This logical connection requires more elaboration to be convincing.
We apologize for any confusion. B and D are indeed the same. Both describe a uniform average of ensemble members. We observe empirically that A (the best single model) can substantially outperform B (the uniform ensemble) in underspecified tasks (see Fig 1). This motivates C: dynamically selecting or weighting the models at test time to better align with the target task. HyRe is an instantiation of C. To improve clarity, we have made a major revision to section 3.1 (most important changes in red).
Your method is based on Strong Assumptions. It relies on training multiple heads as the basis for test-time adaptation, which implies: You assume that test tasks can be represented by a limited set of basis functions, which may not hold true in many real-world applications. You also assume that test tasks are linear combinations of these basis functions, another strong assumption that is often unrealistic.
We acknowledge that modeling target functions as linear combinations of basis functions is a simplifying assumption. However, this is a practical modeling assumption to strike a good bias-variance tradeoff and does not need to hold exactly for HyRe to be effective. As HyRe operates in a low-data regime (5-50 labeled examples), mitigating the risk of overfitting is important, even at the cost of constraining the hypothesis space. Despite the constrained hypothesis space, HyRe demonstrates strong performance across diverse real datasets and scales effectively to many basis functions (100), showing its practicality for real-world problems.
Your method has Dependence on Labeled Test Data. The requirement for a small set of labeled examples from the target distribution is a significant limitation, while standard test-time adaptation scenarios typically allow access only to unlabeled test data. This constraint makes your method unusable in conventional zero-shot scenarios.
We think that our use of terms like “test-time alignment” and “test-time ensemble recalibration” may have caused some confusion. Our setting has little relation to the (unsupervised) test-time adaptation setting, which we now realize is often associated with “test-time.” We intended to highlight that our ensemble reweighting happens at inference time rather than training time. We are willing to revise the title and key phrases to avoid this misunderstanding if you believe it would enhance clarity, e.g., “Inference-time alignment” or “post-hoc alignment”.
Our problem setting is closer to fine-tuning and domain adaptation, which leverages labeled target data. HyRe is designed to be highly data-efficient, needing as few as five labeled samples for ensemble reweighting. This is much less than conventional fine-tuning methods. We agree that zero-shot settings are outside the scope of our paper.
The labeled examples need to be independent and identically distributed (i.i.d.) from the test distribution, which limits applicability in non-i.i.d. Environments.
Thank you for pointing this out. We agree that in real-world scenarios, we cannot always assume that the target distribution is i.i.d. with few-shot adaptation data. We conducted an additional set of experiments simulating various non-i.i.d. Environments. Specifically, we created skewed distributions by mixing two datasets from RewardBench: math-prm and xstest-should-respond. We varied the ratio of these two datasets in the ensemble reweighting phase and measured the resulting weighted ensemble accuracy on the two datasets.
| Ratio A:B | A (math-prm) Acc | B (xstest-should-respond) Acc |
|---|---|---|
| 0.0:1.0 | 72.57% | 88.33% |
| 0.1:0.9 | 98.94% | 86.64% |
| 0.2:0.8 | 96.73% | 88.15% |
| 0.5:0.5 | 98.52% | 87.22% |
| 0.8:0.2 | 99.38% | 86.66% |
| 0.9:0.1 | 99.52% | 85.86% |
| 1.0:0.0 | 99.72% | 84.18% |
As expected, training exclusively on data from one distribution yields the best performance on that specific dataset. However, even when using mixed distributions, HyRe still achieves high accuracy. For instance, even when adjusting ensemble weights on a mixture with 10% A, we recover (98.94 - 72.57 / 99.72 - 72.57=) 97% of the accuracy gains from training on A only. This demonstrates that HyRe can effectively leverage small datasets even in non-i.i.d. environments.
Even if we assume your theoretical framework holds, practical implementation poses challenges. For instance, how do you acquire data from different domains to train the basis heads? The quality and distinctiveness of this data directly impact the method’s effectiveness in real-world test scenarios.
We would like to clarify that we do not use domain labels during training. The basis heads are trained on the same dataset, with diversity arising from initialization and regularization alone. While we agree that the coverage of training data plays an important role in diversity, high-coverage datasets are readily available in many real-world scenarios. For example, the datasets we use for training [1,2,3,4] inherently provide broad coverage across various distributions, even without explicit domain annotations.
Our use of data from different domains is purely for evaluation purposes. Ensemble reweighting uses a small target dataset from a single distribution, and we evaluate on multiple distributions to assess the generalization and robustness of our method. At no point does our pipeline require domain labels, making it significantly less restrictive than typical domain adaptation methods [5-7]. Our evaluations use publicly available, off-the-shelf datasets without modification. Several public datasets include data with multiple “target distributions,” for example, inputs testing different capabilities [8], different topics [9], different hospitals [1,10], or different regions [1,11].
[1] Koh, Pang Wei, et al. "Wilds: A benchmark of in-the-wild distribution shifts." International conference on machine learning. PMLR, 2021.
[2] Cui, Ganqu, et al. "Ultrafeedback: Boosting language models with high-quality feedback." (2023).
[3] Ethayarajh, K., Choi, Y. & Swayamdipta, S. (2022). Understanding Dataset Difficulty with -Usable Information. ICML.
[4] Wang, Zhilin, et al. "HelpSteer2: Open-source dataset for training top-performing reward models." arXiv preprint arXiv:2406.08673 (2024).
[5] Sun, Baochen, Jiashi Feng, and Kate Saenko. "Correlation alignment for unsupervised domain adaptation." Domain adaptation in computer vision applications (2017): 153-171.
[6] Sagawa, Shiori, et al. "Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization." arXiv preprint arXiv:1911.08731 (2019).
[7] Yao, Huaxiu, et al. "Improving out-of-distribution robustness via selective augmentation." International Conference on Machine Learning. PMLR, 2022.
[8] Lambert, Nathan, et al. "Rewardbench: Evaluating reward models for language modeling." arXiv preprint arXiv:2403.13787 (2024).
[9] Budzianowski, Paweł, et al. "Multiwoz--a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling." arXiv preprint arXiv:1810.00278 (2018).
[10] P. Bándi et al., "From Detection of Individual Metastases to Classification of Lymph Node Status at the Patient Level: The CAMELYON17 Challenge," in IEEE Transactions on Medical Imaging, vol. 38, no. 2, pp. 550-560, Feb. 2019
[11] Christie, Gordon, et al. "Functional map of the world." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
Thank you for your review! Please let us know if further detail is needed or if the new experiments address your concerns.
Thank you again for your valuable feedback! With the discussion period ending today, we kindly ask if the additional experiments and clarifications provided address your concerns or if there are any remaining points we can clarify before the deadline.
Thank you to the authors for addressing some of the issues raised. However, I remain unconvinced about the necessity of labeled data in the proposed method.
Specifically, the additional experiments involving mixing two datasets from RewardBench only demonstrate improvements under mixed distributions. While this partially addresses the second point of my Weakness 3:"The labeled examples need to be independent and identically distributed (i.i.d.) from the test distribution, which limits applicability in non-i.i.d. environments". However, it does not address scenarios involving "continuous or single-sample adaptation settings", where only one sample arrives in a streaming manner. In such cases, your method appears inapplicable.
Nonetheless, since part of my concerns has been addressed, I am willing to raise my score to 5.
We sincerely thank you for your constructive feedback and for acknowledging the improvements in our revised manuscript. Below, we address your remaining concerns.
I remain unconvinced about the necessity of labeled data in the proposed method.
Our experiments demonstrate that ensemble reweighting methods that do not utilize target data perform substantially worse. The table below compares methods with and without labeled data, highlighting the performance gap:
| Method/Samples | Uses Labeled Data | Accuracy |
|---|---|---|
| Average Single Model | X | 0.5903 |
| Confidence Weighted (DAN) | X | 0.6832 |
| Entropy Weighted | X | 0.6838 |
| Logit Ensemble (BEM) | X | 0.8344 |
| Prob Ensemble | X | 0.8365 |
| Majority Vote | X | 0.8371 |
| Convex Optimization (GEM, N=40) | O | 0.8449 |
| GEM Overfitting Oracle | O | 0.9035 |
| HyRe (N=1) | O | 0.8388 |
| HyRe (N=5) | O | 0.8573 |
| HyRe (N=10) | O | 0.8626 |
| HyRe (N=20) | O | 0.8711 |
| HyRe (N=40) | O | 0.8774 |
The results clearly demonstrate that methods leveraging labeled data (e.g., GEM and HyRe) significantly outperform those without it. If you have a specific point of comparison that you expect to outperform HyRe without using labeled data, we would be happy to consider it and include it in the next version.
...However, it does not address scenarios involving "continuous or single-sample adaptation settings", where only one sample arrives in a streaming manner. In such cases, your method appears inapplicable.
We acknowledge that our method does not directly address continuous or single-sample adaptation settings. While these are indeed important directions for future research, they are outside the intended scope of this paper. Our work focuses on scenarios with a reasonable number of labeled target examples, which is a common and practical assumption in many applications, such as offline reinforcement learning and multi-task learning.
Thank you again for your valuable feedback. We look forward to any further suggestions you may have.
- Motivated from the observation that a single model can, under some circumstances, achieve better performance than a naive ensemble, the paper proposes to use a combination of efficient ensemble learning algorithm (from previous work) and a fast ensemble weight learning algorithm to dynamically weigh the different members of the ensemble. They demonstrate on a range of tasks, regression in UCI datasets, vision datasets with distribution shifts like WILDS, and preference modeling with LLMs. Although the improvements are less noticeable in the regression problems, they illustrate consistent gains from their method in the vision and language domains.
- Despite these results, a few important analyses are missing
- Comparisons to prior work on dynamically learning weights for ensembles (e.g., [1], [2], [3])
- Analysis of how much improvement comes from the efficient ensemble learning vs. weight learning
- How the improvement in the rewards for the LLM preference modeling tasks actually translates to performance in head-to-head comparisons of the generations
[1] Ruan, Yangjun, et al. "Weighted ensemble self-supervised learning." arXiv preprint arXiv:2211.09981 (2022).
[2] Jiménez, Daniel. "Dynamically weighted ensemble neural networks for classification." 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No. 98CH36227). Vol. 1. IEEE, 1998.
[3] Shahhosseini, Mohsen, Guiping Hu, and Hieu Pham. "Optimizing ensemble weights and hyperparameters of machine learning models for regression problems." Machine Learning with Applications 7 (2022): 100251.
优点
- The paper begins with a crisp motivation that a single model can at times outperform a naive ensemble of models. Overall, it is very clearly written and easy to read, and the contribution is very well-scoped.
- The paper presents comprehensive results across a range of tasks, from regression tasks in UCI, distribution shifts in vision, and preference modeling in language.
缺点
- The preference model experiments with LLMs show consistent results across the different benchmarks, but ideally it would be good to see GPT4-based evaluations [1] to see whether this increase in the reward by 0.03 in Anthropic HH is perhaps meaningful difference at all.
- The novelty of this paper lies in the insight that in task underspecification settings, a single model can outperform an ensemble of models and that HyRE’s fast ensembling reweighting mechanism can indeed learn a good weighting. However, the paper lacks comparisons against other basic baselines of works that have proposed methods [2, 3, 4] to re-weight ensembles at inference-time. Despite the time constraint, some preliminary experiments comparing against these would be very helpful and I would be more keen to raise my score.
- Some further analysis seems warranted to see how much improvement is coming from the efficient ensemble learning method vs the ensemble weighting learning algorithm. The delta from the ensemble weighting learning algorithm is at least clear from the different results/tables.
[1] Zheng, Lianmin, et al. "Judging llm-as-a-judge with mt-bench and chatbot arena." Advances in Neural Information Processing Systems 36 (2023): 46595-46623.
[2] Ruan, Yangjun, et al. "Weighted ensemble self-supervised learning." arXiv preprint arXiv:2211.09981 (2022).
[3] Jiménez, Daniel. "Dynamically weighted ensemble neural networks for classification." 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No. 98CH36227). Vol. 1. IEEE, 1998.
[4] Shahhosseini, Mohsen, Guiping Hu, and Hieu Pham. "Optimizing ensemble weights and hyperparameters of machine learning models for regression problems." Machine Learning with Applications 7 (2022): 100251.
问题
- The authors’ main argument / hypothesis is that given task underspecification, a single model can outperform a naive ensemble. Can the authors also provide how the performance of a single model fares compared to an ensemble + HyRE — beyond the toy experiment in Figure 3? Essentially, it would be good to quantify if HyRE is able to learn the “optimal” weighting? If the authors could provide some additional results on perhaps the language model or vision (WILDS) experiments.
- Also, what is also the entropy of the learned weights over the members of the ensembles? Do the authors observe that it collapses onto a single model, if the hypothesis is indeed true that a single model can outperform all others in task underspecification settings?
- As for the fast ensemble reweighting method, for the personalization of LLMs, is it possible to directly leverage the reward models’ scores in order to learn the weighting instead of using negative log likelihood?
- The reviewer acknowledges that running scaling experiments may be difficult given the time constraint, but an obvious argument against this ensemble approach would be that it then takes up to K times the amount of training compute to train K models. How does this fare against training 1 larger model in the paper’s experiments?
- In Table 2, what are the numbers in the parantheses?
- How were the samples for tuning the ensemble weights selected? Random? If so, can the authors report the standard deviation across using different sets of random samples?
Thank you for your constructive feedback. We address each of your concerns below.
Comparison with reweighting baselines.
Thank you for pointing out this relevant literature. We have added new experiments comparing HyRe against these methods for ensemble reweighting. The table below shows the average accuracy across 15 RewardBench datasets. HyRe demonstrates substantial improvements over prior methods, with most benefits coming from the first few examples.
| Method/Samples | Accuracy |
|---|---|
| Single Model | 0.5903 |
| Confidence Weighted (DAN, [1]) | 0.6832 |
| Entropy Weighted | 0.6838 |
| Logit Ensemble (BEM, [1]) | 0.8344 |
| Prob Ensemble | 0.8365 |
| Majority Vote | 0.8371 |
| Convex Optimization (GEM, N=40, [2]) | 0.8449 |
| HyRe (N=1) | 0.8388 |
| HyRe (N=5) | 0.8573 |
| HyRe (N=10) | 0.8626 |
| HyRe (N=20) | 0.8711 |
| HyRe (N=40) | 0.8774 |
[1] D. Jimenez, "Dynamically weighted ensemble neural networks for classification," 1998 IEEE International Joint Conference on Neural Networks Proceedings.
[2] Shahhosseini, Mohsen, Guiping Hu, and Hieu Pham. "Optimizing ensemble weights and hyperparameters of machine learning models for regression problems." Machine Learning with Applications 7 (2022).
Isolating the effects of ensemble learning vs. weight learning. The authors’ main argument / hypothesis is that given task underspecification, a single model can outperform a naive ensemble. Can the authors also provide how the performance of a single model fares compared to an ensemble + HyRE — beyond the toy experiment in Figure 3? Essentially, it would be good to quantify if HyRE is able to learn the “optimal” weighting? If the authors could provide some additional results on perhaps the language model or vision (WILDS) experiments.
In the table above, all methods other than GEM and HyRe are based on ensemble learning only, without task-specific weight learning. We see that learning task-specific weights (GEM and HyRe) provides a substantial boost over methods for aggregating the ensemble alone.
As an upper bound for the achievable performance of ensemble reweighting, we also evaluate GEM [2] with access to the entire test set (no held-out examples). We note that this is not a fair point of comparison as it directly uses the test set, and should rather be seen as the highest achievable accuracy from “overfitting” during the weight learning stage.
| Method/Samples | Accuracy |
|---|---|
| HyRe (N=1) | 0.8388 |
| HyRe (N=5) | 0.8573 |
| HyRe (N=10) | 0.8626 |
| HyRe (N=20) | 0.8711 |
| HyRe (N=40) | 0.8774 |
| GEM Overfitting Oracle | 0.9035 |
Also, what is also the entropy of the learned weights over the members of the ensembles? Do the authors observe that it collapses onto a single model, if the hypothesis is indeed true that a single model can outperform all others in task underspecification settings?
Following your suggestion, we measured two metrics over all 15 RewardBench datasets after 40 examples each: Normalized entropy of the learned weights (scaled to [0, 1]). For a uniform ensemble, this would be 1.0 Maximum weight assigned to any single model. For a uniform ensemble of 100 members, this would be 0.01. The average normalized entropy was 0.4466, and the average maximum weight was 0.4395, indicating that the weights are far from uniform. The learned ensemble does not completely collapse onto a single model with finite data, reflecting the appropriate behavior to avoid overfitting. With infinite data, the method would necessarily converge to the best-performing model(s).
The reviewer acknowledges that running scaling experiments may be difficult given the time constraint, but an obvious argument against this ensemble approach would be that it then takes up to K times the amount of training compute to train K models. How does this fare against training 1 larger model in the paper’s experiments?
We clarify that HyRe does not require training K separate models. We use a single pre-trained backbone and K prediction heads, i.e., K small MLPs that take backbone embeddings as input. The computational overhead is negligible; for example, in our reward model experiments, the 100 ensemble heads (5.5e5) add less than 0.03% to the parameter count of the gemma-2b backbone (2.0e9).
As for the fast ensemble reweighting method, for the personalization of LLMs, is it possible to directly leverage the reward models’ scores in order to learn the weighting instead of using negative log likelihood?
We initially considered weighting schemes directly based on reward scores but found that these often underperform due to their sensitivity to outliers, which disproportionately affected the resulting weights. However, we agree that reward scores hold useful information and could be used more effectively. While the paper evaluates weighting schemes using accuracy rather than negative log-likelihood (NLL), your suggestion aligns with our broader aim of developing effective reweighting strategies. We view this as a promising direction for future work.
In Table 2, what are the numbers in the parantheses? How were the samples for tuning the ensemble weights selected? Random? If so, can the authors report the standard deviation across using different sets of random samples?
We randomly select test splits within each evaluation distribution in addition to the samples used for learning ensemble weights. The numbers in the parentheses are standard deviations across 20 random selections, which was sufficient to get a stable estimate.
Thank you for your review! Please let us know if further detail is needed or if the new experiments address your concerns.
Thank you for the clarifications and detailed comparisons against other ensembling baselines that show the improvement from efficient ensemble learning method and the ensemble weighting learning algorithm.
It is still a bit unclear how much improvement is coming from each aspect of the learning algorithm (ensemble learning and weighting learning algorithms) -- a clear experiment for this would have been, training an ensemble of models using different random seeds [1] and then just applying the weighting learning algorithm on top.
Some remaining questions:
- What is the size of the ensemble (# of members) in the baseline ensemble methods, such as logit ensemble?
- The single model performance is at 59% and the motivation is that a single model can outperform a naive ensemble in many cases. However, this result rather indicates otherwise where a single model is drastically worse in performance. Could the authors please clarify the hypothesis/motivation in relation to this result?
There have been many works both old and recent (with LLMs) [2,3] that showed that ensembles are more robust against reward hacking/overoptimization and improve performance, and, despite the more comprehensive comparison against the baselines, it seems that the paper yet lacks precise analysis as to why we are seeing the improvement (ensemble vs weighting learning), which is critical in assessing the novelty of the their proposed method. Thus, I currently stand with my original score.
[1] Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and scalable predictive uncertainty estimation using deep ensembles." Advances in neural information processing systems 30 (2017). [2] Coste, Thomas, et al. "Reward model ensembles help mitigate overoptimization." arXiv preprint arXiv:2310.02743 (2023). [3] Lu, Keming, et al. "Routing to the expert: Efficient reward-guided ensemble of large language models." arXiv preprint arXiv:2311.08692 (2023).
Thank you for your valuable feedback and for highlighting key areas where additional clarification and analysis could strengthen our work.
It is still a bit unclear how much improvement is coming from each aspect of the learning algorithm (ensemble learning and weighting learning algorithms) -- a clear experiment for this would have been, training an ensemble of models using different random seeds [1] and then just applying the weighting learning algorithm on top.
Thank you for the clarification on the suggested experiment. To more directly address your point on disentangling the improvement gains, we conducted an additional experiment using a vanilla ensemble with 100 members trained on different random seeds. The results of applying our weighting algorithm on top of this ensemble are shown below:
| Method/Samples | EpiNet | Vanilla Ensemble |
|---|---|---|
| Average Single Model | 0.5903 | 0.6397 |
| Confidence Weighted (DAN, [1]) | 0.6832 | 0.7865 |
| Entropy Weighted | 0.6838 | 0.7865 |
| Logit Ensemble (BEM, [1]) | 0.8344 | 0.8336 |
| Prob Ensemble | 0.8365 | 0.8318 |
| Majority Vote | 0.8371 | 0.8313 |
| Convex Optimization (GEM, N=40, [2]) | 0.8449 | 0.8477 |
| GEM Overfitting Oracle | 0.9035 | 0.8708 |
| Best Single Model | 0.8951 | 0.8790 |
We see that vanilla ensembles achieve higher Average Single Model performance, whereas the EpiNet achieves higher GEM Overfitting Oracle and Best Single Model performance. The explicit diversification of the EpiNet architectures improves the performance after adjusting ensemble weights. We observed similar tendencies in the other datasets early on in the project as well.
What is the size of the ensemble (# of members) in the baseline ensemble methods, such as logit ensemble?
The ensemble size is 100 for all methods in the table above and the one in our original rebuttal.
The single model performance is at 59% and the motivation is that a single model can outperform a naive ensemble in many cases. However, this result rather indicates otherwise where a single model is drastically worse in performance. Could the authors please clarify the hypothesis/motivation in relation to this result?
The “single model” in our table represents the performance of a randomly selected model from the ensemble, serving as a proxy for naive single-model performance. To reduce confusion, we added a row for the “Best Single Model.”
The motivation behind our work is not that all single models outperform an ensemble but that the best single model in an ensemble often outperforms the naive averaging or majority-vote ensemble. This motivates adaptive weighting strategies like the one we propose.
We appreciate your detailed feedback and suggestions, which have helped us refine our analysis. We wanted to follow up to see if you have any further comments or suggestions based on our latest response.
The paper proposes HYRE, dynamically reweights ensemble models at test time based on a few labeled examples from the target distribution, allowing the model to better align with specific user intent or task requirements. HYRE applies generalized Bayesian inference, updating ensemble member weights using non-differentiable performance metrics. Empirical results show HYRE’s robustness across multiple distribution shifts, personalization tasks, and preference alignment scenarios, achieving improved accuracy with minimal additional data.
优点
-The use of ensemble reweighting for test-time alignment is a novel solution to the underspecified task problem, offering quick adaptation without additional training.
- HYRE leverages scalable ensemble architectures, making it feasible to apply this approach to large-scale, pretrained models.
-The method is validated across varied tasks, including preference personalization, distribution shifts, and safety benchmarks, showing consistent improvements.
-HYRE’s adaptation requires only a few labeled examples, reducing computational costs compared to conventional fine-tuning and aligning with practical constraints.
缺点
-Although the active learning setup is mentioned, the paper lacks detailed analysis on how different active learning criteria (entropy, BALD, variance) affect performance across tasks.
-
The empirical studies are concentrated on well-known datasets, but the paper could benefit from evaluating HYRE on additional real-world datasets, especially those with more nuanced or complex underspecification.
-
HYRE is compared against fine-tuning and other ensemble-based models but lacks direct comparisons with recent advances in task alignment or ensemble calibration methods.
The results show that HYRE outperforms conventional ensemble approaches. Could the authors clarify how HYRE compares with models explicitly trained for task alignment, particularly in settings where task ambiguity is less pronounced?
问题
The paper suggests that HYRE performs well with only a few adaptation samples. Could the authors elaborate on how performance scales as the number of adaptation examples increases, and how the results compare with methods like fine-tuning under such conditions?
Thank you for your constructive feedback. We address each of your concerns below.
HYRE is compared against fine-tuning and other ensemble-based models but lacks direct comparisons with recent advances in task alignment or ensemble calibration methods. The results show that HYRE outperforms conventional ensemble approaches. Could the authors clarify how HYRE compares with models explicitly trained for task alignment, particularly in settings where task ambiguity is less pronounced?
Thank you for this suggestion. We have modified our original evaluation of RewardBench datasets to be directly comparable to the methods on the official leaderboard. Our evaluation of HyRe builds on the GEM-Gemma-2B reward model, which was the state-of-the-art at 2B scale at the time of submission.
The table below shows representative results, comparing with frontier generative models and open-source models trained via alignment methods such as DPO. Notably, HyRe achieves consistent improvements in overall score, with strong gains in the Chat, Safety, and Reasoning splits. We note a slight performance drop on the Chat Hard split, which includes less task ambiguity as the preference datasets used to train these models primarily focus on challenging chat responses.
| Model | Model Type | Score | Chat | Chat Hard | Safety | Reasoning |
|---|---|---|---|---|---|---|
| mistralai/Mixtral-8x7B-Instruct-v0.1 | DPO | 77.6 | 95.0 | 64.0 | 72.6 | 78.7 |
| allenai/tulu-2-dpo-13b | DPO | 76.7 | 95.8 | 58.3 | 79.5 | 73.2 |
| allenai/tulu-2-dpo-70b | DPO | 79.1 | 97.5 | 60.5 | 84.5 | 74.1 |
| allenai/llama-3-tulu-2-dpo-70b | DPO | 77.2 | 96.4 | 57.5 | 74.9 | 80.2 |
| stabilityai/stablelm-2-12b-chat | DPO | 79.9 | 96.6 | 55.5 | 78.1 | 89.4 |
| Anthropic/claude-3-5-sonnet-20240620 | Generative | 84.2 | 96.4 | 74.0 | 81.6 | 84.7 |
| openai/gpt-4o-2024-05-13 | Generative | 84.6 | 96.6 | 70.4 | 86.5 | 84.9 |
| openai/gpt-4o-2024-08-06 | Generative | 86.7 | 96.1 | 76.1 | 88.1 | 86.6 |
| Ray2333/GRM-Gemma-2B-rewardmodel-ft | Seq. Classifier | 84.5 | 89.4 | 75.2 | 84.5 | 88.8 |
| Ours (uniform ensemble) | Seq. Classifier | 84.5 | 88.6 | 72.9 | 83.7 | 89.8 |
| Ours (N=1) | Seq + HyRe | 85.3 | 88.5 | 72.7 | 85.5 | 91.4 |
| Ours (N=5) | Seq + HyRe | 86.4 | 90.3 | 72.6 | 89.1 | 91.4 |
| Ours (N=10) | Seq + HyRe | 87.2 | 90.4 | 72.5 | 90.0 | 92.3 |
| Ours (best head oracle) | HyRe upper bound | 90.0 | 92.3 | 81.8 | 92.5 | 93.1 |
We note that this is not a direct head-to-head comparison, as HyRe leverages labeled data not utilized by the other methods. The primary purpose of this experiment is to demonstrate the potential performance gains achievable through test-time alignment.
Empirical studies are concentrated on well-known datasets, but the paper could benefit from evaluating HYRE on additional real-world datasets, especially those with more nuanced or complex underspecification.
Thank you for the suggestion. We have additionally evaluated our ensemble reward model on the PERSONA dataset [1], which emphasizes underspecification by curating inputs designed to provoke disagreement. We test across ten personas with 200 preference pairs each. Details on each persona are in our appendix. As shown below, HyRe shows a significant improvement, achieving 83.0% accuracy with N=40 examples per persona compared to 14.8% for the base model.
| Method | Accuracy |
|---|---|
| GRM-Gemma-2B | 14.8% |
| Ours (uniform ensemble) | 21.6% |
| Ours (N=1) | 25.2% |
| Ours (N=5) | 40.1% |
| Ours (N=10) | 47.3% |
| Ours (N=20) | 66.3% |
| Ours (N=40) | 83.0% |
[1] Castricato, Louis, et al. "PERSONA: A Reproducible Testbed for Pluralistic Alignment." arXiv preprint arXiv:2407.17387 (2024).
The paper lacks detailed analysis on how different active learning criteria (entropy, BALD, variance) affect performance across tasks.
We have added a new experiment comparing the effect of different active learning criteria for HyRe in our RewardBench preference alignment tasks. We consider random sampling, BALD, and entropy, measuring their performance over 0 to 40 target examples. Across the acquisition of 40 examples, active learning methods (BALD and entropy) demonstrated slightly better performance compared to random sampling. We note that even random sampling consistently improved performance, suggesting that our reweighting process is robust to datapoint selection strategy.
| Method | N=0 | N=1 | N=5 | N=10 | N=20 | N=40 |
|---|---|---|---|---|---|---|
| Random | 84.40 | 85.33 | 86.97 | 87.34 | 88.01 | 88.83 |
| BALD | 84.06 | 84.28 | 87.13 | 87.78 | 88.60 | 88.99 |
| Entropy | 84.38 | 84.25 | 86.73 | 87.54 | 88.60 | 89.76 |
The paper suggests that HYRE performs well with only a few adaptation samples. Could the authors elaborate on how performance scales as the number of adaptation examples increases, and how the results compare with methods like fine-tuning under such conditions?
As shown in the additional experiments above, HyRe demonstrates strong performance with few adaptation examples, with performance gains tending to plateau at higher numbers of adaptation examples. Figure 6 in our paper shows this plateauing behavior across 27 target distributions. The original submission compared against fine-tuning for the Camelyon-WILDS dataset (Figure 5). We observe that HyRe outperforms fine-tuning in the low-data regime.
Thank you for your review! Please let us know if further detail is needed or if the new experiments address your concerns.
Thank you again for your valuable feedback! With the discussion period ending today, we kindly ask if the additional experiments and clarifications provided address your concerns or if there are any remaining points we can clarify before the deadline.
Thanks for the authors address some of the issues. I'll keep my score unchanged.
Thank you for your follow-up and for taking the time to review our revisions. If there are additional experiments or clarifications you believe would strengthen our approach, we would welcome your suggestions.
We thank the reviewers for their constructive feedback and insightful suggestions. These have significantly improved our paper. We summarize major changes below and address specific comments in individual responses. Major changes in the manuscript are highlighted in blue text.
Our main changes are:
- Comparisons with alignment methods (X7mQ, MVFb): Added points of comparison for our RewardBench experiment, including DPO
- Evaluation on harder distributions (X7mQ): Extended our reward model experiment to PERSONA, a large-scale dataset for pluralistic alignment.
- Weighted ensemble baselines (X7mQ, 85x2): new experiments comparing HyRe against ensemble baselines.
- Additional analyses (X7mQ, MVFb, 85x2): stress-testing the i.i.d. assumption, active learning criteria comparison, oracle reweighting performance, and measuring ensemble collapse.
- Clarity improvements (X7mQ, MVFb, 85x2): edited the manuscript for clarity in motivation and presentation.
The paper proposes Hypothesis Reweighting (HYRE), an architecture for test-time model adaptation to address distribution shifts. The main idea is to use a single backbone with multiple prediction heads, and then during test time adaptively ensemble these heads with weights that are also estimated. Reviewers have mixed comments regarding the paper -- reviewers are positive about the effectiveness of the proposed method. However, it is not clear whether the improvement comes from prior work (Osband et al. 2023) or the proposed reweighting strategy. The authors' responses during the rebuttal helped to resolve some of the questions, but the reviewers still stay unconvinced about the novelty of the method.
I took a read of the paper myself as well, and I share the same concern with the reviewers about the novelty of the proposed architecture. In particular, the multi-head architecture used in the paper is a typical model in the literature of multi-task learning, and in fact, the particular way of ensembling the multiple heads (Line 5, Algorithm 1) is also a standard technique in multi-task learning/multi-objective optimization, see for example [1-2]. This line of work is completely missing from discussion in the current version of the paper. One can also view the proposed method as a special case of the mixture-of-expert network. Although the authors have provided some discussion on the connection, but it is necessary to empirically compare against MoE-based models as well.
[1]. Robust Multi-Task Learning with Excess Risks [2]. Smooth Tchebycheff Scalarization for Multi-Objective Optimization
Overall, this submission paper needs further work before publication, and I strongly encourage the authors to incorporate all the reviews when preparing the next iteration of the paper.
审稿人讨论附加意见
Borderline paper, so I took a read of the paper myself. I've delineated my reasoning to justify the decision in my meta-review.
Reject