PaperHub
4.9
/10
Poster4 位审稿人
最低1最高4标准差1.3
4
4
2
1
ICML 2025

R.I.P.: Better Models by Survival of the Fittest Prompts

OpenReviewPDF
提交: 2025-01-21更新: 2025-07-24
TL;DR

New data selection & synthetic data creation method Rejecting Instruction Preferences (RIP) dramatically improves model performance by filtering out 77% training examples.

摘要

关键词
Data FilteringPreference OptimizationSynthetic Data

评审与讨论

审稿意见
4

The paper proposes a recipe for the prompt selection problem in the pairwise preference optimization setting in RLHF called RIP filtering. The proposed method utilizes an external reward model to grade multiple completions of the candidate model and use several metrics of this random completion set as a guideline to filter the instruction prompts. The filtering procedure is based on two hypotheses that promote selecting instructions with completions with high reward and low reward variance. Extensive empirical studies are conducted to verify the effectiveness of RIP filtering.

update after rebuttal

The authors provided interesting experimental results that shows the proposed method works even when the reward model is chosen as a moderately strong model. I think the results are interesting and further justifies the applicability of the proposed method, therefore I have increased my score to 4.

给作者的问题

As the RIP procedure is based on an external reward model and one key hypothesis in the paper is that smaller reward gaps indicates better instructions, I am curious about to what extent the effectiveness of RIP depends on the quality of the reward model itself. In the paper the authors used ArmoRM and LLama 3.1-405B as reward models, which both originated from the LLama family. Therefore my questions are:

  • What will happen to the effectiveness of RIP if the reward model are chosen from a very different model family, or even from a lightweight model of smaller size?
  • In table 9, I found several entries listing a lower bound of the reward gap, i.e., GAP > 0.042 in WildChat datasets. Does that contradict with the "smaller GAPs work better" hypothesis?

论据与证据

From experimental evaluations along with ablations, the two hypothesis stated in section 3.2 are well supported by empirical evidence.

方法与评估标准

Yes. The evaluation procedure is sound.

理论论述

Not applicable. There are no theory claims in the paper.

实验设计与分析

Yes. The experimental design is valid and clear. I don't see any apparent issues therein

补充材料

I skimmed through several statements that was made in the main text regarding t-SNE visualizations and the reward scaling phenomenon.

与现有文献的关系

  • Efficient data selection is a critical problem for developing language models. The paper offered insights into the data selection procedure in RLHF which is valuable for the community.
  • The paper hints that distributional characteristics of the instruction-conditional distribution p(x)p(\cdot | x) could be efficiently exploited to derive selection procedures, which might further inspire future research.

遗漏的重要参考文献

I am not an expert in the filed of RLHF. I do not recognize any important references that is not mentioned in the paper.

其他优缺点

The paper is well-written.

其他意见或建议

See my questions section.

作者回复

We thank the reviewer for acknowledging our contributions.

  1. effectiveness of RIP if reward models are of smaller size, and from a different model family

Thank you for this insightful feedback. To address the reviewer’s question, we select a lightweight non-Llama-based reward model “Ray2333/GRM-gemma2-2B-rewardmodel-ft” (https://huggingface.co/Ray2333/GRM-gemma2-2B-rewardmodel-ft), which is a gemma2-2B based reward model, to annotate and then DPO finetune a Llama3.1-8B-Instruct model.

Below are the results on RIP filtering using reward scores by this Gemma2-based RM. By curating less than 5k out of 20k prompts, we can improve Llama3.1-8B-Instruct DPO models from 41.1 to 49.9 on AlpacaEval LC-winrate, showing similar improvement using ArmoRM filtering (LC winrate improved from 48.4% to 57.8%).

ModelTrain Data SizeAlpaca LC Winrate (%)Alpaca Winrate (%)
WildChat20k baseline2000041.147.3
WildChat RIP440149.953.5

Ray2333/GRM-gemma2-2B-rewardmodel-ft (ranked 36th on Reward Bench) is ranked below ArmoRM on RewardBench, and the performance gap in the two reward model quality also affects the performances of finetuning Llama3.1-8B-Instruct with reward model annotations(i.e. better-quality reward model leads to better winrate of finetuned models). However, in both cases RIP filtering demonstrates its effectiveness.

Moreover, In the paper, we show the effectiveness of RIP filtering using various reward signals (ArmoRM, LLM-as-a-Judge, human), one of which is to use human annotated rewards from HelpSteer2 dataset as filtering criteria. We show in Table 4 that RIP filtering using human rewards also improve LLama3.1 winrates across all 3 benchmarks. In addition, Table 20 (performances on valid set when filtering using a single criteria), highlights that curating prompts of smaller human reward gaps boost performance.

We really appreciate the reviewer’s feedbacks, and hope that these experiment results can not only address the reviewer’s question, but further strengthen the effectiveness of RIP filterings under various reward model quality.

  1. In table 9, I found several entries listing a lower bound of the reward gap, i.e., GAP > 0.042 in WildChat datasets. Does that contradict with the "smaller GAPs work better" hypothesis?

We really appreciate the reviewer for pointing out the typo. It should be GAP < 0.042 instead of >, since we are filtering out prompts with larger reward gaps. We will correct these typos in our updated version.

审稿意见
4

This paper introduces a method for filtering prompts used for preference-tuning (in this case, DPO), RIP. The method simply filters preferences based on reward, output length, and gap between chosen and rejected responses. Experiments training on datasets filtered by this method shows improvements in llm-as-a-judge based metrics over either doing no filtering or using alternate data filtering techniques. Qualitative analysis suggests the method is primarily effective in filtering out noisy and lower-quality prompts that do not elicit great responses.

给作者的问题

Please see my comments in “Experimental Designs Or Analyses” and “Other strengths and weaknesses”

论据与证据

I think that the overall claim that RIP improves performance on llm-as-judge evaluations compared to no filtering or baselines is reasonably well supported, with lots of baselines and multiple evaluation settings considered. The self-RIP results are also good. One caveat is that the claims are very much scoped to alignment performance on llm-judge benchmarks, which is a very particular domain (in contrast to e.g., reasoning tasks, which are not examined in this work). This scoping is mentioned in the conclusion but not the introduction.

方法与评估标准

The benchmarks chosen are reasonable, it would still be useful to validate results with human annotators instead of entirely relying on model-as-judge results. I understand these benchmarks have strong correlations with human preferences, but it would still make me more confident that the gains are useful if human evaluation agrees.

理论论述

This is a primarily empirical work, and the mathematical explanations where present seem correct.

实验设计与分析

The use of these benchmarks while filtering with reward models also makes me wonder if there are some implicit assumptions around what sort of downstream queries the models will be used for. My understanding is that the benchmarks examined all have fairly clear, well-written questions, and so filtering out noisy prompts likely also reduces the data to prompts more similar to the downstream evals. But does model performance and/or behaviour when dealing with such ambiguous prompts change? Is this a potential concern if the aim is an LM chatbot-like application, which would likely receive such ambiguous queries? I don’t think there are great benchmarks for this now, but it would be interesting to analyse (or perhaps I have missed something in the appendix!)

The authors use 3 elements for RIP (response reward, response length, reward gap), but do not evaluate and ablate each component, nor show how the hyperparams chosen affect results in a systematic way - Tables 19 and 20 seem to do this, but I’m not sure on the setup. Are those results from models trained on data filtered with the given threshold using an LM judge? Reward score performance? Why are all the scores so close together, and are differences < 0.01 actually significantly here?

The authors do not explore models beyond llama models, so it is unclear how well results may generalize to other LMs. While they claim that looking at Llama 3.1 and 3.3 counts as different bases, my understanding (based on the HF metadata, which may not be correct) that Llama 3.3 models are still ultimately LMs from the Llama 3.1 family, just with new finetunes/post-training. Results from models from entirely different organisations and/or clearly known to have different pretraining mixes would be useful.

Finally, I also wonder if examining DPO-likes would be useful? For example, work has found length-normalized DPO to be more effective [1]. You might imagine that filtering on length is less effective for a length-normalized method? I understand this explodes the experimentation space, though.

[1] Lambert et al., Tulu 3: Pushing Frontiers in Open Language Model Post-Training. 2024.

补充材料

I read the supplementary material (appendices) where relevant to further investigating my questions and concerns above.

与现有文献的关系

Filtering preference data and investigating methods to do so is still a relatively under-explored and interesting area for work!

遗漏的重要参考文献

No

其他优缺点

Strengths: The method is simple and I think the llm-judge experiment are carefully done and well-designed, with many reasonable baselines explored. The results themselves seem very strong, with quite large gains.

Weaknesses: As noted above, I feel that it would be good to more thoroughly ablate the components of RIP, and explain the ablations in some more detail. Additionally, the paper clearly implicitly is aiming for ‘user alignment’ as a downstream target task, but this is not explicitly mentioned in the intro. This limits the applications of the method without further study, e.g. would RIP be useful for e.g. improving mathematical or reasoning performance (which preference tuning has been shown to be useful for)? How might filtering with RIP affect such performance?

其他意见或建议

  • I don’t quite get the final sentence of the caption of table 8: “RIP outperforms the baseline of LLM-as-judge as the reward annotator.” What baseline in table 8 is using the llm-as-judge as a reward annotator? The only baselines are no filtering and the base model.
  • I would also caution against using red and green colours in tables 19/20 to accommodate red-green colourblind readers.
作者回复
  1. One caveat is that the claims are very much scoped to alignment performance on llm-judge benchmarks, which is a very particular domain.

Thank you for bringing this to our attention. We acknowledge that in our draft, we demonstrated the effectiveness of RIP on alignment performances in general instruction-following tasks. To further validate its capabilities, we conducted additional testing on reasoning domains. Here are the results:

ModelGPQA-diamondMMLU_PROAVG
Baseline DPO1234933.552.7
DPO + RIP Filtering36235.251.0

We used our method to filter science reasoning 15k data [1] with inf_orm reward model (https://huggingface.co/infly/INF-ORM-Llama3.1-70B). As you can see that with our RIP method, we successfully filtered out over 90% of data, however, the model performance remains the same on reasoning benchmarks. [1] Yuan et al. NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

  1. It would still be useful to validate results with human annotators instead of entirely relying on model-as-judge results.

We agreed that mixed human evaluation is the preferred approach. However, due to cost constraints, we limited our human evaluation to 100 examples. The results showed our RIP successfully filtered out 88% of the noisy data identified by human evaluators. We will provide additional details in the appendix.

  1. Does model performance and/or behaviour when dealing with such ambiguous prompts change?

This is a thought-provoking question! We evaluated our method on three widely used benchmarks: Alpaca Eval, Arena-Hard, and Wildbench. Notably, Wildbench utilizes data sourced from real users in chatbot-like scenarios, which provides a more realistic and representative testbed for our approach.

  1. The authors use 3 elements for RIP (response reward, response length, reward gap), but do not evaluate and ablate each component

We performed an ablation study on each component in Tables 19 and 20. Specifically, we trained Llama3.1-8B on data filtered by each component individually and evaluated the trained model on our validation set (reporting Armo reward on the validation set). This is our standard approach for checkpoint selection. The numbers in Tables 19 and 20 represent ArmoRM scores on the valid set, which are close to each other due to the use of Armo scores distribution. Although the differences may seem small (e.g., 0.01), they correspond to significant performance differences. We only tested performance on the validation set initially, as running model evaluations on three benchmarks (Alpaca Eval, Arena Hard, and Wildbench) is time-consuming, given their reliance on the GPT4 API. However, we recognize that this may make it challenging for readers to interpret the numbers in Table 19. To address this, we further tested the best checkpoints within Table 19 on Alpaca Eval (see our comment to Reviewer D5vn).

  1. The authors do not explore models beyond llama models

To show the effectiveness of our RIP filtering beyond Llama models, we:

(1). Finetune Gemma2-9B-it model with SimPO using the dataset (princeton-nlp/llama3-ultrafeedback-armorm) which are Gemma2 generations on ultrafeedback annotated by ArmoRM. Applying RIP on Gemma2-9B finetuning further improves Gemma2 performance on AlpacaEval from 69.48 to 73.81 by filtering out 50% train data.

Train Size Comparison

ModelTrain SizeAlpacaEval LC WinrateAlpaca Winrate
Gemma2-9B SimPO (no filtering)5956969.48%63.07%
Gemma2-9B SimPO (RIP filtering)2996373.81%62.01%

(2). Finetune Llama3 Model using a Gemma-2-2b based reward model (see our comment to Reviewer eddG).

  1. You might imagine that filtering on length is less effective for a length-normalized method?

To further validate our approach, we tested our method on SimPO, a well-known length-normalized variant of the DPO algorithm.

Filter Metrics Comparison

Model# of Training SamplesAlpacaEval LC WinrateAlpaca Winrate
Llama3.1-8b SimPO (no filtering)1980351.28%40.55%
Llama3.1-8b SimPO (RIP filtering, Rejected Armo)806854.02%43.51%
Llama3.1-8b SimPO (RIP filtering, Rejected Armo, Gap)662953.04%43.02%
Llama3.1-8b SimPO (RIP filtering, Rejected Armo, Rejected Length, Gap)453853.32%43.81%

These findings from our SimPO experiments are consistent with our previous DPO experiments, which demonstrated that Rejected Armo is the most effective metric. The addition of rejected length also proved to be highly effective, while gap filtering provided some benefits, albeit to a lesser extent than the other two metrics.

We appreciate your feedback and thank you for the opportunity to strengthen our paper. We hope our comments have addressed your concerns and questions, and we look forward to your further consideration of our work.

审稿人评论

Thank you for the detailed rebuttal! It's great to see that RIP works well in these additional settings, and thank you for pointing out the ablation experiments. Having read this and the other reviews carefully, I am raising my score -- with these new results, most of my concerns are addressed.

审稿意见
2

The paper introduces a novel data curation method called Rejecting Instruction Preferences (RIP) designed to improve the quality of training data for large language models. The core idea is to filter out low-quality prompts by examining paired model responses. Experimental evaluations on benchmarks such as AlpacaEval2, Arena-Hard, and WildBench demonstrate that models trained with RIP-filtered data (both human-written and synthetic) achieve significant improvements over unfiltered datasets and other baseline filtering methods.

给作者的问题

None

论据与证据

Yes

方法与评估标准

Yes

理论论述

None

实验设计与分析

Yes

补充材料

No

与现有文献的关系

Finding and results.

遗漏的重要参考文献

None

其他优缺点

Strengths: The use of rejected response quality metrics (reward and length) along with the reward gap to assess and filter prompt quality is a novel contribution. This pairwise evaluation provides a fresh perspective compared to traditional prompt-based filtering methods.

Weakness: Could the authors provide a more detailed explanation regarding why the rejected response length is chosen as a filtering metric? Is this measure specifically intended to support Hypothesis 2, which suggests that low-quality prompts produce a broader variance in responses? Moreover, it would be highly beneficial if the paper included comprehensive case studies or explicit examples that directly compare samples filtered out by this criterion with those that are selected. Such detailed illustrations or side-by-side comparisons would help clarify how effectively the rejected response length differentiates between lower-quality and higher-quality prompts, thereby providing a clearer justification for its inclusion in the filtering process.

其他意见或建议

None

作者回复

We thank the reviewer for highlighting our strength on the novelty of RIP filtering and its significant improvements as compared to traditional prompt-based filtering methods.

  1. Could the authors provide a more detailed explanation regarding why the rejected response length is chosen as a filtering metric? Is this measure specifically intended to support Hypothesis 2, which suggests that low-quality prompts produce a broader variance in responses?

We thank the reviewer for offering the opportunity for us to clarify our hypothesis. We cited several studies in our paper Line 124 (e.g. [1]) showing correlation between response length, response quality and final performance. Given Hypothesis 1 “Low-quality prompts are likely to produce low-quality responses”, we thus select the length of the lowest-scored responses (a.k.a the rejected response) as one of the filtering metrics, in addition to rejected response score to measure quality of the rejected response. While rejected response length might also be correlated with response variance, we consider a more straightforward metric, the reward gap between chosen and rejected responses, to capture variances in our hypothesis 2.

  1. "comprehensive case studies or explicit examples that directly compare samples filtered out by this criterion with those that are selected."

We thank the reviewer for pointing out the importance of detailed illustrations to justify our hypothesis. We have included such analysis in the paper appendix due to page limits.

In Table 26, we summarized 4 clusters of prompts in Figure 5 t-SNE plot that are being filtered out due to large response variance. For each cluster, we manually go over the 50 ~200 instructions, and summarize patterns of those rejected instructions in the “Rejected Reason” column. We further include “side-by-side” comparison between sample instructions filtered out by this criterion with those being selected in “Rejected Instructions” and “Accepted Instructions”, to further illustrate the hypothesis 2.

In Table 25 we extract 4 clusters of prompts from Figure 4 t-SNE plot that contain thousands of prompts filtered out due to low quality rejected responses. For each cluster, we count the percentage of prompts filtered out due to shorter length (below the rejected response cutoff). We also summarize their patterns in the “Description” column and list sample rejected instruction in “Rejected Instruction” column. Those 4 clusters we investigated are predominantly prompts filtered out by RIP (barely any survived prompts). To further address the reviewer’s question on side-by-side comparison, we included 2 other clusters that contain both selected and filtered prompts (See plot in https://anonymous.4open.science/r/projects-37B8/tsne.jpg)

ClusterDescriptionRejected InstructionsAccepted Instructions
Cluster 5307 prompts filtered out and 87 prompts selected; 282 prompts are being filtered out due to shorter rejected response length. Short responses are either because the requests are underspecified or because they elicit potentially sensitive responses."I want you to help me with my research"; "Write one more short song, about Izzy’s hatred for Joe Biden""How to comfort someone who studied for a test and got different questions than the ones he studied for"; "Lyrics for a happy song about challenges and growth in the style of The Weeknd"
Cluster 6385 prompts filtered out due to shorter rejected responses and 218 prompts selected. Prompts leading to short rejected responses in this cluster are generic chitchat messages, greetings, or easy factual questions."What is the weather today in Seattle" ; "Do you speak Vietnamese""Hi, can you give me a simple party game for 4~10 people"; "Benefits of studying in Singapore"

In addition to visualizing the examples, we also conduct GPT4 analysis into quality of the filtered out prompts in Section 6.2 by each criterion, to justify our hypothesis. We also add human eval on 100 examples, due to length limit, we add one example here

PromptRIP Filtering ResultsHuman Eval
Write a storyFilter out (rejected ARMO, gap)Not Useful. Explanation: This prompt is overly broad and lacks specific details, posing challenges in generating a focused response.

We will add more of those analysis in the appendix with side-by-side comparisons to illustrate our RIP filtering criteria. We hope these further analysis, in addition to our Table 25~26 will help address the reviewer’s clarification question.

[1] Zhao, H., etc, N. Long is more for alignment: A simple but tough-to-beat baseline for instruction fine-tuning.

审稿意见
1

The authors propose RIP, a data filtering method that leverages 3 criteria (rejected response length, rejected response reward, and reward gap) in assessing prompt quality in the context of preference fine-tuning. They find benchmark improvements of ~10% compared against non-filtered DPO. Furthermore, the authors propose Self-RIP, a synthetic data generation scheme in which few-shot examples are selected via RIP.

给作者的问题

Please address my primary concerns listed in other sections of my review:

  1. Lack of ablations and clarity surrounding the effectiveness of each of the 3 criteria individually.
  2. Missing baseline of "DPO + Data Filtering" from Wu et al. (2024a) for which the reward gap filtering criteria appears very similar.

论据与证据

The key claims lie in the effectiveness of the 3 criteria used in filtering. For the reward gap, the authors cite Wu et al. (2024a) on line 139 which establishes a small reward gap being more informative. The use of rejected response length and rejected response reward seem reasonable.

方法与评估标准

The evaluation metrics are standard and criteria seem reasonable.

理论论述

The work is entirely empirical.

实验设计与分析

One of my primary concerns is missing ablations. It is unclear to me the extent to which each of the 3 provided criteria and responsible for the performance gains reported. How strong are the results when only one criteria is used at a time? Table 9 provides some incomplete insights in this direction. In that table, I see that for when fine-tuned on Wildchat 20k the performance boost provided by the reward gap filtering appears smaller than that of rejected response length + rejected response reward.

补充材料

Yes, the discussion of t-SNE prompt clustering, raw filtered prompts, and extra empirical results are appropriate for the supplementary materials section.

与现有文献的关系

The novelty of the paper seems rather limited. With regards to the reward gap, Wu et al. (2024a) also establishes the use of the reward gap for data filtering. They also experiment with dynamically adjusting β\beta in the fine-tuning process but one of their baselines is solely using reward gap filtering (using 3σ\sigma) with fixed β\beta which appears similar to the authors' approach. This baseline was not compared against and is my second primary concern.

遗漏的重要参考文献

The related works and preliminaries section do a good job of going over relevant works.

其他优缺点

N/A

其他意见或建议

N/A

作者回复
  1. Lack of ablations and clarity surrounding the effectiveness of each of the 3 criteria individually.

Due to the paper's length constraints, we have included an ablation study in the appendix (Table 19 and 20, line 990). In this study, we conducted data filtering experiments using each criterion individually, including filtering based on chosen reward, rejected reward, average reward (between chosen and rejected), chosen length, rejected length, and gap. We reported the results on our validation set.

As shown in the tables, when applying individual filtering criteria, we found that rejected reward is the most effective criterion, followed by rejected response length. The gap criterion also provided some improvement, although it was not as effective as the other two criteria. To further validate these findings, we tested the individual criteria on the Alpaca evaluation.

FilteringReward on valid setAlpaca Eval lc-winrate
No filtering0.1830548.37
Rejected Armo0.1897956.91
Rejected Length0.1859353.31
Gap0.1854251.01
Mix them all0.1898357.83
  1. Lack of novelty: beta-DPO also filters based on gap.

We acknowledge that beta-DPO also employs gap-based filtering, which may raise concerns about the novelty of our approach. However, as evident from Table 19 and Table 20, our primary focus lies in the effectiveness of Rejected Reward and Rejected Length as filtering criteria in addition to reward gap, which outperform the gap-only-based criterion.

  1. beta-DPO filtering as baseline.

Thank you for the suggestion. We acknowledge that beta-DPO also employs gap-based filtering, which we cite in our paper. However, there are three key differences between their approach and ours:

(a). Online vs. Offline Filtering: Beta-DPO's filtering is online, meaning they filter out data in every batch, whereas our approach filters data offline. This offline filtering enables more flexible and efficient generation pipelines, particularly for weak-to-strong generation scenarios. For instance, finetuning Llama3.3-70B-Instruct on prompts RIP filtered by a smaller Llama3.1-8B model outperformed (Alpaca LC-winrate improved from 54.3 to 64.5, Arena-Hard from 70.5 to 76.7).

(b). Gap Size Thresholds: Unlike beta-DPO, which removes both small and large gaps, our method removes bigger gaps only.

(c). Probabilistic vs. Deterministic Filtering: Beta-DPO's filtering is probabilistic, resulting in incomplete data removal, whereas our approach uses deterministic filtering to ensure thorough removal of unwanted data.

Given these differences, and as previously mentioned, our method prioritizes Rejected Reward and Rejected Length criteria over gap-based filtering, which have demonstrated superior effectiveness in our experiments. Consequently, when submitting our draft, we did not include a direct comparison with beta-DPO's filtering results. However, we appreciate your suggestion and have since conducted additional experiments to evaluate beta-DPO filtering on our experimental setting:

FilteringBetaDPO mode_weight# training samplesReward on valid setAlapaca Eval LC-winrateAlpaca Eval winrate
No filtering-198030.1830548.3745.87
RIP filtering-45380.1898357.8357.16
BetaDPO filter0.2158420.1841749.1549.00
BetaDPO filter0.599010.1839946.6842.41
BetaDPO filter0.7549500.1826545.9740.58

Thank you for diligently reviewing our work. We hope that we have thoroughly addressed all of your questions and concerns. Furthermore, we conducted additional experiments to strengthen our paper, including performing extra reasoning tasks, expanding our model suite beyond Llama by finetuning a Gemma-based model and two other reward models (GRM-Gemma2-2B-RewardModel-FT and INF-ORM-Llama3.1-70B), and applying RIP with SimPO.

审稿人评论

I thank the authors for pointing my attention towards Tables 19 and 20 which I had missed in the appendix during my initial review. I also appreciate the effort put into the comparison against BetaDPO.

My conclusion based off the rebuttal is that the reward gap criteria is the weakest of the 3 criteria RIP employs and very similar results would be achieved if RIP did not use the reward gap. While the empirical benefits of the rejected reward and rejected length criteria are promising, my personal take is that the technical significance of the two criteria are not enough for the conference. As such, I am unfortunately inclined to maintain my current score despite the great effort put into all the experiments.

作者评论

Thank you for your feedback on our submission. We appreciate the time and effort you took to review our work and provide comments.

  1. very similar results would be achieved if RIP did not use the reward gap

# of Training SamplesAlpaca LC WinrateArena HardWildbench
1980348.3737.941.5
676257.0742.345.5
453857.8343.145.6

When comparing performance, we should also consider the number of training samples. As we can see from our results, adding gap filtering reduces around 32% of the training data while achieving slightly better performance. This is a pretty successful metric for filtering.

  1. technical significance of the two criteria are not enough for the conference

Our work introduces three filtering metrics that have not been previously explored in the literature. Notably, our method demonstrates robust performance in filtering LLaMA-based models and GEMMA-based models, even with varying reward models, resulting in substantial improvements.

It is essential to recognize the importance of these advancements, as they should not be underestimated. In fact, our gap filtering alone outperforms beta-DPO filtering, highlighting its effectiveness.

To validate our claims, we conducted extensive experiments that thoroughly evaluate our approach. The results demonstrate the superiority of our proposed criteria over existing methods, providing detailed comparisons that underscore the impact of our work.

Best regards

最终决定

My decision is to accept the paper.

The paper introduces a prompt filtering protocol, RIP, for prompts used in instruction tuning models via prompt optimization. The filtering protocol filters prompts based on rejected response reward, rejected response length, and reward gap between preferred and rejected responses. The authors perform a wide range of empirical experiments, comparing tuned model performance across different filtering methods. Finally, the authors examine some potential explanations for the success of RIP.

There was some disagreement among reviewers about the overall score for this paper. Given that RIP has a 3-part filtering criteria, and that the reward gap criterion had been proposed before, there was some concern about ablations and novelty. The authors noted that the novel criteria presented in this paper seemed to contribute more to downstream performance, and expanded an existing ablation study to have a more interpretable scale (win rate). There were also questions about what signal the rejected reward and length criteria were picking up on, and the authors pointed to some qualitative analysis of filtered prompts based on these criteria. The authors also expanded experiments to reasoning/knowledge benchmarks, Gemma models, and different reward models and found similar results.

Overall, I agreed with the reviewers who supported accepting the paper.