Thanks for the careful consideration of our submission! We were happy that you found the results convincing and the analysis insightful.

Computational expense of multiple pretrains: we agree, with two caveats. First, the effort to create a pretrain ensemble could be amortized across many reward models (as we have done in our own experiments), and might also be useful for instruction tuning. Second, our work motivates the search for more sophisticated strategies that may yield similar benefits to full-scale ensembling at lower cost, such as, e.g. LoRA ensembles [1, 2] and weight-space averaging [3] after projection into a shared weight space [4].

Ensemble diversity: as shown in Figure 3 the ensemble makes diverse predictions even before RLHF. Fine-tune ensembles have an average rank correlation between .45 and .62 on five outputs from the same reference model, and the average rank correlation for pretrain ensembles is significantly lower. That said, we agree that more could be done to encourage diversity in finetune ensembles, and the reviewer's suggestion of a bootstrap ensemble seems particularly promising. Our focus on random seed variability aligns with our emphasis on underspecification on deep networks from in-distribution training data [5], but we agree that other sources of variability are of interest.

Comparison of an ensemble of small RMs vs a single larger RM: to a limited extent, this comparison can be extracted from our results because the T5-XL models are roughly four times larger than the t5-large models (3B vs 770M parameters), making the T5-large ensemble comparable in parameter count to T5-XL. According to the fine-tuned autoeval model, performance of the two setups is similar. For example in BoN, a T5-large pretrain ensemble gets win rates of .85 (tldr) and .79 (helpfulness), vs .85 and .78 for a single T5-XL model. Morevoer, our results demonstrate that ensembling can further improve the performance of a policy trained with a T5-XL reward model. As for affordability, a reward ensemble of size 5xN may enable faster inference during RLHF than a single reward model of size 5N, since inference over the reward models can be easily parallelized; similarly, parallelization may make it easier to pretrain five LMs of size N than a single LM of size 5N.

[1] https://arxiv.org/abs/2310.00035 [2] https://arxiv.org/abs/2405.14438v1 [3] https://arxiv.org/abs/2306.04488 [4] https://arxiv.org/abs/2209.04836 [5] https://arxiv.org/abs/2011.03395