Progressively Label Enhancement for Large Language Model Alignment
摘要
评审与讨论
The paper proposes PLE (Progressively Label Enhancement) for LLM alignment that could make use of all model generated responses. The proposed algorithm learns the contrast between principle guided response and original response using a ranking loss when the reward difference of the two are larger than a threshold. It also learns from both responses through weighted SFT loss when the reward difference is not that large. Experiments are conducted on three tasks: multi-turn dialogue, controlled text generation and summarization. The results show that PLE could effectively align LLMs with human preferences.
给作者的问题
Please see other parts.
论据与证据
Please see other parts.
方法与评估标准
The proposed method makes sense to me.
理论论述
Yes, I checked the proofs in the paper.
实验设计与分析
(1) If I understand correctly, the paper seems to assume that is . If that's the case, then why bother designing such a complicated algorithm, instead of just doing context distillation through SFT (basically just use as target when just providing )? If you want to claim that PLE has better sample efficiency or has better performance, then context distillation should be the SFT baseline which makes much more sense comparing to using offline data for SFT.
(2) According to the descriptions of the experiment setup, it is not clear what reward model is being used during model training? Do training and evaluation use the same reward models?
(3) Using BLEU and PPL as evaluation metrics for model alignment does not make sense to me either. Those two metrics are focusing on the similarity to references on a surface level which is not the goal of alignment. In addition, the references are already quite outdated, SOTA aligned models are expected to exhibit quite large differences to the references.
(4) The ablation studies show that both ranking loss and weighted SFT loss are required to achieve optimal performance, highlighting the importance of making use of all model generated responses. However, it is probably based on the condition that other hyperparameters remain the same. We could imagine that through tuning the threshold hyperparameter, model performance of using only ranking loss or weighted SFT loss would fluctuate as well, there should be separate sweet spots for using only one loss. It is necessary to check if the performance at those points are still worse than PLE trained model.
(5) The case study part is quite cursory, only showing a few examples without using more quantitive methods to show the usefulness and harmlessness of model generated responses on a larger scale.
(6) It is a bit hard to put the results of this paper in context. The trained models are evaluated using customized evaluation metrics instead of being evaluated on current popular alignment benchmarks such as AlpacaEval, Arena-hard etc which typically have standard evaluation protocols.
补充材料
Yes I have read the Appendix.
与现有文献的关系
The paper explores new algorithm for aligning LLMs that make full use of model generated responses. The idea of using all data is somewhat in contrast to the line of work that explores "less is more for alignment".
遗漏的重要参考文献
I think the paper is missing some literature and important comparisons to context distillation methods such as [1].
[1] Learning by Distilling Context
其他优缺点
Please see other parts.
其他意见或建议
Please see other parts.
Thank you for taking the time to review the paper and providing valuable feedback. I appreciate your efforts in ensuring the quality of the research. Regarding your concerns, I would like to provide the following explanations:
If I understand correctly, the paper seems to assume that is . If that's the case, then why bother designing such a complicated algorithm, instead of just doing context distillation through SFT (basically just use as target when just providing ? If you want to claim that PLE has better sample efficiency or has better performance, then context distillation should be the SFT baseline which makes much more sense comparing to using offline data for SFT.
Thank you for your thoughtful question. We would like to clarify that and are fundamentally different in our framework. As defined in Equation (2),
represents the optimal policy that maximizes expected reward. In contrast, is only a component used within the PLE algorithm to generate a reference response under principle-guided prompting.
Our goal is not to imitate the principle-guided response directly via context distillation, but rather to progressively guide the base model toward by comparing its outputs to those generated under the principle prompt, and selectively applying ranking or weighted learning based on reward differences. As we show in Theorem 5.3, this progressive strategy ensures that converges toward under certain conditions.
According to the descriptions of the experiment setup, it is not clear what reward model is being used during model training? Do training and evaluation use the same reward models?
For the HH dataset, we used RM-Gemma-2B as the reward model during both training. For the IMDb dataset, we trained a sentiment classifier using the 0/1 labels provided in the dataset. The reward score is defined as the logit of the positive class predicted by this classifier. For the TL;DR dataset, we trained a reward model based on the preference pairs from the tldr-preference-trl-style dataset, which was then used consistently during both training and evaluation. In all cases, the same reward model is used throughout training and evaluation to avoid distribution mismatch.
Using BLEU and PPL as evaluation metrics for model alignment does not make sense to me either. Those two metrics are focusing on the similarity to references on a surface level which is not the goal of alignment. In addition, the references are already quite outdated, SOTA aligned models are expected to exhibit quite large differences to the references.
We agree that BLEU and perplexity (PPL), while widely used, do not fully capture the goals of alignment. Additionally, our evaluation does not rely solely on BLEU and PPL. We additionally report:
• Reward model scores to directly reflect alignment with learned human preference signals,
• Human evaluation based on qualitative assessments of response helpfulness and harmlessness,
• Evaluations using a strong LLM (Claude API) to provide an automated yet high-quality comparative judgment.
These complementary metrics together offer a more comprehensive view of alignment performance. As shown in our results (Tables 1–3 and Figure 2), our method consistently outperforms baselines across these diverse evaluation settings.
Due to space limitations, other responses can be found at the anonymous link https://anonymous.4open.science/r/ICML_rebuttal_PLE-1F6E/responses_2.md
Thanks for your efforts, I have increased my score.
Thank you for your valuable feedback and for reconsidering your evaluation. We appreciate your time and thoughtful suggestions.
The paper introduces a novel framework named PLE addressing inefficiencies in aligning large language models (LLMs) with human preferences. Current methods like RLHF face stability and scalability challenges, while alternative approaches rely heavily on large high-quality datasets and treat data generation and model training as decoupled processes, leading to suboptimal data utilization. PLE dynamically integrates these phases by generating principle-guided and original query responses, employing a reward-based dynamic threshold to select training strategies: it uses ranking loss to prioritize high-margin improvements when reward differences exceed the threshold and weighted loss to incorporate both responses proportionally when differences are small. Theoretical analysis demonstrates PLE's convergence to an optimal policy with bounded error rates.
给作者的问题
- Equation (2) defines the optimal policy as , omitting the KL divergence regularization term commonly used in RLHF (e.g., ) to prevent excessive deviation from the initial policy. Could the authors clarify why KL regularization is absent in their theoretical framework?
- In Table 1 (HH dataset), the PPO-aligned model shows higher perplexity and lower reward scores compared to SFT. What might explain this? A discussion on the failure modes of PPO in specific tasks (e.g., multi-turn dialogue) would strengthen the paper’s critique of existing methods.
论据与证据
The authors’ claim that existing methods inefficiently utilize generated data by treating training and data generation as static, separate processes is well-supported. The dynamic threshold mechanism, which adaptively selects training strategies based on reward differences between principle-guided and original responses, directly addresses this inefficiency by incorporating both high- and mid/low-quality data into training—evidenced by improved reward scores and human evaluations across tasks. Theoretical guarantees (Lemma 5.2, Theorem 5.3) further validate that progressive threshold updates bound approximation errors and ensure convergence, aligning with the core motivation.
方法与评估标准
The experimental design and evaluation methodology in this paper are comprehensive and well-justified. The authors comprehensively validate PLE across three distinct tasks—multi-turn dialogue (HH dataset), controlled generation (IMDb), and summarization (TL;DR)—using both large-scale (LLaMA-8B, Qwen-7B) and smaller models (GPT-2), and employ both objective metrics (PPL and reward model scores) and subjective assessments (human and model evaluations) for comparison.
理论论述
I checked the theoretical claims in the main text (Lemma 5.2 and Theorem 5.3), which are logically coherent. Lemma 5.2 establishes that the dynamic thresholding strategy progressively expands the "pure level set" by iteratively tightening the threshold, while Theorem 5.3 builds on this to bound the approximation error between the learned policy and the optimal policy, ensuring convergence.
实验设计与分析
The experimental design is generally sound: the authors evaluate PLE across multiple tasks (dialogue, text generation, summarization) using diverse models (LLaMA-8B, Qwen-7B, GPT-2) and metrics (PPL, reward scores, BLEU, human/API evaluations), demonstrating broad applicability. Ablation studies (Tables 4–5) confirm the necessity of both ranking and weighted losses, and training curves (Figure 3) empirically validate progressive improvement.
补充材料
I reviewed all the supplementary material.
与现有文献的关系
The paper’s key contributions build on prior alignment methods like Self-Instruct [1] (using self-generated principles to guide responses) and RLHF [2] (reward models for preference learning) but introduce a novel dynamic threshold mechanism to address inefficiencies in data utilization.
[1] Self-instruct: Aligning language models with self-generated instructions
[2] Training language models to follow instructions with human feedback.
遗漏的重要参考文献
I think all related work has already been mentioned.
其他优缺点
Strengths:
- The idea of coupling data generation and training via dynamic thresholds to mitigate the problem of low data utilization is interesting.
- The convergence proofs strengthen the method’s credibility.
Weaknesses:
- The proofs lack intuitive explanations for key parameters, leaving their practical impact unclear.
- Figure 1’s current design fails to clearly illustrate the interplay between principle-guided responses, thresholding, and training phases; a redesigned figure with step-wise visuals would improve understanding.
其他意见或建议
No other comments or suggestions. Seen above.
Thank you for taking the time to review the paper and providing valuable feedback. I appreciate your efforts in ensuring the quality of the research. Regarding your concerns, I would like to provide the following explanations:
The proofs lack intuitive explanations for key parameters, leaving their practical impact unclear.
Thank you for your valuable feedback. We would like to clarify that all key parameters and symbols used in the theoretical analysis are formally defined and explained in Sections 4 and 5 of the paper. If there are any specific parameters whose meaning remains unclear, we would be happy to provide further clarification or improve the exposition accordingly.
Figure 1’s current design fails to clearly illustrate the interplay between principle-guided responses, thresholding, and training phases; a redesigned figure with step-wise visuals would improve understanding.
Thank you for your helpful suggestion. We will improve the framework diagram in the next version to provide a clearer representation of the overall process.
Equation (2) defines the optimal policy as , omitting the KL divergence regularization term commonly used in RLHF (e.g., ) to prevent excessive deviation from the initial policy. Could the authors clarify why KL regularization is absent in their theoretical framework?
Thank you for pointing this out. Equation (2) presents a general formulation of the alignment objective, which captures the goal of maximizing expected reward. Common approaches such as PPO introduce a KL divergence regularization term to prevent the learned policy from deviating too far from the reference policy. Such regularized objectives can still be encompassed within our formulation by considering the reward function to implicitly include regularization terms. This perspective has also been adopted in prior work [1].
[1] FUNDAMENTAL LIMITATIONS OF ALIGNMENT IN LARGE LANGUAGE MODELS, ICML2024
In Table 1 (HH dataset), the PPO-aligned model shows higher perplexity and lower reward scores compared to SFT. What might explain this? A discussion on the failure modes of PPO in specific tasks (e.g., multi-turn dialogue) would strengthen the paper’s critique of existing methods.
To ensure a fair comparison across methods, we adopted an offline training setup for PPO, using the same set of queries from the HH dataset. This constraint limits PPO’s ability to fully explore and optimize responses beyond the fixed dataset, which may explain the relatively modest performance gains compared to SFT. Similar observations have been reported in prior work [1].
[1] KTO: Model Alignment as Prospect Theoretic Optimization, ICML 2024
We hope that our revisions have addressed all of your concerns, but please let us know if there is anything else we can do to improve the manuscript. We would be happy to answer any additional questions or provide any further information you may need.
Thank you for the clarifications. My concerns have been addressed. After considering other reviewers' feedback, I will maintain my positive recommendation.
Thank you for taking the time to review our responses. We sincerely appreciate your thoughtful feedback and consideration.
Authors propose a novel framework that couples data generation and model training, leading to inefficient utilization of generated data. Authors provide a theoretical prove that with the progressively updated threshold strategy, our approach can bound the error rate between the trained model and the optimal model, ensuring convergence within a controlled range. Authors use LLama3-8B base model and Qwen2.5-7B model for tests on the HH dataset, and GPT2 model for tests on IMDb and the TL;DR datasets. Authors compare proposed approach with SFT, DPO, PPO, and RAFT, showing consistent improvements across tasks.
update after rebattle
I think this is a solid paper and keep my score at 4
给作者的问题
N/A
论据与证据
Yes
方法与评估标准
Yes
理论论述
Yes, I tried to check Sec 5 Theoretical Analys, although possible I missed some details.
实验设计与分析
Authors compare proposed methods with various alternative tuning methods on three popular transformer models, and show consistent improvements.
补充材料
No
与现有文献的关系
Authors propose a novel method to train models with reward.
遗漏的重要参考文献
N/A
其他优缺点
Paper is clear and well-written. Experiments are performed with a variety of models on different datasets, and sound convincing.
其他意见或建议
N/A
Thank you for taking the time to review the paper and providing valuable feedback. I appreciate your efforts in ensuring the quality of the research. We would be happy to answer any additional questions or provide any further information you may need.
he paper introduces Progressively Label Enhancement for LLM Alignment, a framework designed to improve the alignment of Large Language Models (LLMs) with human expectations, addressing ethical and legal concerns. PLE tackles these issues by dynamically adjusting the model's training process based on the quality of generated data.
给作者的问题
Refer to the above points.
论据与证据
Yes.
方法与评估标准
- The proposed approach PLE shares considerable similarity with prior alignment methods, raising concerns about novelty. In particular, RAFT (Dong et al., 2023) already “expands the SFT dataset by generating additional samples and selecting those with high reward scores”. PLE’s core idea – generate model responses and utilize reward scores to decide how to train on them – is a direct generalization of RAFT. The main difference is that instead of discarding low-scoring samples entirely (as RAFT does), PLE continues to use them with reduced weight or as negative examples. This is a fairly incremental change: it addresses data utilization inefficiency noted in RAFT but does not introduce a fundamentally new alignment paradigm. In essence, PLE combines RAFT’s data generation with a ranking loss akin to RRHF (Yuan et al., 2023), which “encourages the model to generate preferred responses with higher probability and poor responses with lower probability” by integrating a preference-based regularization. PLE’s ranking loss when serves the same role as RRHF’s regularizer, and its weighted fine-tuning when the difference is small is reminiscent of label smoothing techniques.
- The idea of using a set of principles to guide a model’s own generated responses for further fine-tuning has been explored in recent “self-alignment” or “constitutional AI” approaches (e.g., Sun et al., 2023). The authors do cite Sun et al. (2023) as motivation for designing the principle prompt, but PLE does not substantially go beyond those ideas – it uses principles in a straightforward way (simply prepending a fixed set of rules to the query during generation). Other works (like Constitutional AI by Bai et al., 2022b) used principles to generate feedback or to iteratively refine outputs, which is arguably a more novel use of model-generated data. PLE’s use of a principle-guided prompt is comparatively basic and could be seen as a minor variation on these existing alignment techniques.
理论论述
While a theoretical convergence analysis is provided, it relies on strong assumptions about the relationship between model probabilities and reward rankings. For instance, Eq. (8) introduces an assumption that bounds the model’s error by a term related to ranking inconsistencies with respect to the reward model. This is a non-trivial assumption – essentially presuming the reward model’s ranking is a reliable guide to the optimal policy. If this assumption does not hold (e.g., the reward model is imperfect or there are multiple optimal responses), the convergence proof may not apply. Moreover, the paper only sketches the proof idea (due to space), so a reader cannot fully verify the claims. In summary, the methodology section leaves some gaps in understanding (reward model setup, training dynamics, hyperparameter choice) that could undermine confidence in the approach’s soundness.
实验设计与分析
- Missing Baseline Comparisons: While the paper evaluates PLE against several state-of-the-art baselines (SFT, PPO, DPO, RAFT) on multiple datasets, it notably omits some relevant comparisons. For example, RRHF (which was discussed in related work) is not included in the experiments. Since PLE’s loss essentially incorporates an RRHF-like term, it would be important to see a direct comparison to the RRHF method. Additionally, other “principle-driven” alignment methods (like Constitutional AI or self-critiquing approaches) are not quantitatively compared. The absence of these baselines leaves a gap in demonstrating PLE’s superiority. A reviewer might question whether PLE’s gains are simply due to using more data (since it utilizes all generated samples) rather than the particular strategy, something a comparison with a simpler “use all data” baseline could clarify.
- A potential concern is that PLE is tuned to these specific setups and might not generalize broadly. For instance, how would PLE perform on aligning a model to code and math reasoning?
补充材料
Yes.
与现有文献的关系
The topic is important.
遗漏的重要参考文献
The paper generally does a good job citing relevant literature. It references the key prior works in large-language-model alignment, including RLHF methods, instruction-following tuning, and recent alternatives like DPO, RAFT, LIMA, and RRHF. We did not identify major foundational papers that were omitted. This suggests the authors are well-versed in the literature. Moreover, when introducing concepts like “self-align” principles or “label enhancement,” they cite sources (e.g., Sun et al., 2023 for self-alignment, and Xu et al., 2021 for label enhancement), which helps place their contributions in context.
其他优缺点
Refer to the above points.
其他意见或建议
Refer to the above points.
Thank you for taking the time to review the paper and providing valuable feedback. I appreciate your efforts in ensuring the quality of the research. Regarding your concerns, I would like to provide the following explanations:
The proposed approach PLE shares considerable similarity with prior alignment methods, raising concerns about novelty. In particular, RAFT (Dong et al., 2023) already “expands the SFT dataset by generating additional samples and selecting those with high reward scores”. PLE’s core idea – generate model responses and utilize reward scores to decide how to train on them – is a direct generalization of RAFT. The main difference is that instead of discarding low-scoring samples entirely (as RAFT does), PLE continues to use them with reduced weight or as negative examples. This is a fairly incremental change: it addresses data utilization inefficiency noted in RAFT but does not introduce a fundamentally new alignment paradigm. In essence, PLE combines RAFT’s data generation with a ranking loss akin to RRHF (Yuan et al., 2023), which “encourages the model to generate preferred responses with higher probability and poor responses with lower probability” by integrating a preference-based regularization. PLE’s ranking loss when serves the same role as RRHF’s regularizer, and its weighted fine-tuning when the difference is small is reminiscent of label smoothing techniques.
we would like to emphasize that the core motivation and overall framework of PLE are fundamentally different. As stated in our paper, many existing methods treat data generation and model training as separate, static processes, which leads to suboptimal use of generated data. In contrast, PLE explicitly couples these two stages through a dynamic and adaptive training strategy, where the model’s behavior and reward feedback guide the evolving use of generated responses. This synergy between data generation and adaptive training is central to PLE’s novelty. Therefore, rather than being an incremental extension of RAFT or a hybrid with RRHF, PLE proposes a new unified framework that systematically addresses the inefficiency in prior alignment approaches.
The idea of using a set of principles to guide a model’s own generated responses for further fine-tuning has been explored in recent “self-alignment” or “constitutional AI” approaches (e.g., Sun et al., 2023). The authors do cite Sun et al. (2023) as motivation for designing the principle prompt, but PLE does not substantially go beyond those ideas – it uses principles in a straightforward way (simply prepending a fixed set of rules to the query during generation). Other works (like Constitutional AI by Bai et al., 2022b) used principles to generate feedback or to iteratively refine outputs, which is arguably a more novel use of model-generated data. PLE’s use of a principle-guided prompt is comparatively basic and could be seen as a minor variation on these existing alignment techniques.
As we mentioned above, the core contribution of our work lies not in the specific design of the principle-guided prompt itself, but in the overall framework that couples data generation and model training in a dynamic and synergistic manner.
The principle-guided prompt serves as one component within this framework—its role is to produce alternative responses that reflect desirable alignment properties, which are then evaluated and integrated into the training pipeline using our adaptive strategy. While our current implementation adopts a simple prompting mechanism, we do not claim novelty in the prompt design itself.
In fact, one of the strengths of our method is that it is modular and compatible with more sophisticated principle-guided strategies, such as feedback-based refinement or multi-turn rewriting used in works like Constitutional AI. We chose a straightforward setup to demonstrate the effectiveness of our approach even under minimal settings. We appreciate the suggestion and will explore integrating more advanced principle-guided generation techniques in future work to further enhance the overall performance.
Due to space limitations, other responses can be found at the anonymous link https://anonymous.4open.science/r/ICML_rebuttal_PLE-1F6E/responses_1.md
My concern has been solved, I would increase my score.
The paper proposes a method, PLE, for Progressively Label Enhancement for Language Model Alignment. PLE is motivated by the intuition that model training should be considered jointly with data generation. Specifically, the authors design a set of principles to guide the model decoding process in accordance with human expectations. The model training will leverage the generated data based on normalized reward scores. The reviewers overall found the method to be interesting and reasonable, though they also raised some concerns like the empirical evaluations to diverge from standard alignment setups (e.g., comparing with stronger alignment baselines, and using popular benchmarks like AlpacaEval and Arena-Hard). These concerns were partially addressed in the rebuttal, leading to increased scores. We recommend the authors to incorporate these results in the revision.