PaperHub
7.5
/10
Spotlight4 位审稿人
最低6最高8标准差0.9
6
8
8
8
3.3
置信度
正确性3.0
贡献度2.8
表达2.8
ICLR 2025

Advantage-Guided Distillation for Preference Alignment in Small Language Models

OpenReviewPDF
提交: 2024-09-27更新: 2025-03-02

摘要

关键词
Preference Alignment; Large language model; Knowledge Distillation; Advantage Function

评审与讨论

审稿意见
6

This paper explores knowledge distillation (KD) for LLMs during the preference alignment stage. It first introduces a simple baseline DCKD which applies Vanilla KLD-based KD on both the positive and negative examples. Then, the paper propose ADPA to enhance the contrastive signals of KD. Experiments show the effectiveness of ADPA and its components, suggesting the importance of leveraging the perference signals for KD.

优点

  1. How to leverage the preference signals for KD is an important but under-explored problem.
  2. The methods are concise and easy to implement.
  3. The empirical results of ADPA is strong.

缺点

  1. Although ADPA seems effective empirically, it is still unclear how the improvement is related to the motivation of the method. From Section 3.3 and Algorithm 1, it seems ADPA does not need the preference labels: only the prompts and the grouth truth reponses works for ADPA. How is ADPA related to preference alignment?
  2. From Table 2, it seems that the reference teacher model is critical for the effectiveness of ADPA. It would be better to add more explanation on why the difference between πdpo\pi_{dpo} and πref\pi_{ref} should be considered, rather than πdpo\pi_{dpo}. Furthermore, what if πdpo\pi_{dpo} is removed from Equation (11)? (to show the effect of logπsft-\log \pi_{sft} alone)

问题

See Weaknesses.

评论

Thank you for your thoughtful review and for recognizing the importance of our contribution to leveraging preference signals in knowledge distillation.

W1: Unclear how the improvement is related to the motivation of the method; ADPA seems not to need preference labels during training; how is ADPA related to preference alignment?

Response:
Thank you for the insightful question. While ADPA does not directly use preference labels during the student model's training phase, it crucially relies on the advantage function: A_dpo(s_t,a_t)=β(logπ_dpo(a_ts_t)logπ_ref(a_ts_t)),A\_{\text{dpo}}(s\_t, a\_t) = \beta \left( \log \pi\_{\text{dpo}}(a\_t|s\_t) - \log \pi\_{\text{ref}}(a\_t|s\_t) \right), which requires log probabilities from both the DPO teacher and the reference teacher.

The DPO teacher is trained using preference data, and thus, the advantage function inherently captures the preference information learned by the teacher. By distilling this advantage function, the student model learns to prioritize actions that are preferred according to the teacher's alignment with human preferences.

In essence, ADPA transfers the preference alignment from the teacher to the student, even though the student does not directly use the preference labels during its own training. We have clarified this relationship in the revised paper, emphasizing how the advantage function serves as a channel for preference information.


W2: Why consider the difference between π_dpo\pi\_{\text{dpo}} and π_ref\pi\_{\text{ref}} instead of π_dpo\pi\_{\text{dpo}} alone? What happens if π_dpo\pi\_{\text{dpo}} is removed from Equation (11)?

Response:
Thank you for raising this important point and for providing us with the opportunity to explain our design choice in detail. The reference model π_ref\pi\_{\text{ref}} serves as a baseline to quantify the improvement in alignment achieved by the DPO teacher π_dpo\pi\_{\text{dpo}}. The advantage function: A_dpo(s_t,a_t)A\_{\text{dpo}}(s\_t, a\_t) measures how much more preferable an action is compared to the reference policy. This relative measure is crucial because it highlights the specific adjustments made during preference alignment.

If we remove π_ref\pi\_{\text{ref}} and use π_dpo\pi\_{\text{dpo}} alone, the training objective becomes minimizing the reverse cross-entropy between the student model and the DPO teacher. Therefore, student models would attempt to mimic the teacher's policy without understanding which aspects are aligned with human preferences. In our experiments, we observed that removing π_ref\pi\_{\text{ref}} from the advantage function leads to diminished performance, as shown in an ablation study (Table 2: ADPA w/o Ref teacher).

If we remove π_dpo\pi\_{\text{dpo}} and use π_ref-\pi\_{\text{ref}} alone, the student model's objective becomes maximizing the reverse cross-entropy with the reference teacher. This encourages the student model to produce outputs that diverge as much as possible from the reference teacher's outputs, actively pushing it away from the reference teacher without positive guidance on where to move towards. Without the positive guidance provided by π_dpo\pi\_{\text{dpo}}, the supervised fine-tuning loss term L_SFT\mathcal{L}\_{\text{SFT}} alone is insufficient to regularize the training process. Consequently, this approach is prone to training instability or even collapse, as the student model lacks direction on which actions are preferable and may deviate in unintended ways, leading to degraded performance. In practice, we found that student models trained in this setting often produce repetitive or meaningless content. Therefore, we did not show their performance on the test set.

评论

Thank you for the response. I think the current rating is appropriate.

审稿意见
8

This works combines the concepts of RLHF and knoledge distillation to propose the advantage-guided distillation method for preference alignment. This realizes the impressive the performance improvement of the small models in the preference alignment stage.

优点

  1. The idea of improving the performance of small model in the preference alignment stage is interesting.
  2. Advantage-guided distillation from the preference-aligned teacher model to the student model is novel for knowledge distillation.
  3. The experiments are detailed and the presentation is good.

缺点

This is a good work. The proposed method is simple yet effective. I do not have additional concerns for it.

问题

  1. Why experiment on H2O-Danube3-500M and H2O-Danube2-1.8B-Base, given the availability of more popular models today, such as LLaMA-3.2-1B, Qwen2.5-0.5B . If the experiment on more popular models can be provided, the result will be more solid.
评论

We appreciate your encouraging remarks and are glad that you find our work interesting and effective.

Question: Experiments on more popular models.

Response:
Thank you for this valuable suggestion. In the revised paper, we have added experiments using LLaMA-3.2-1B as the student model and LLaMA-3.1-8B as the teacher model. We chose LLaMA-3.2-1B over other models like Qwen-0.5B due to its wider usage in the research community.

Below, we present the evaluation results for LLaMA-3.2-1B trained on DPO-MIX-7K and HelpSteer2. We can observe that our proposed ADPA+ surpasses all baseline methods, indicating that ADPA+ is a robust and generalizable approach for the preference alignment of small language models.

StudentDatasetMethodMT-BenchAlpacaEvalDatasetMethodMT-BenchAlpacaEval
LLaMA-3.2-1BDPO-MIX-7KSFT3.3435.7HelpSteer2SFT3.1338.7
LLaMA-3.2-1BDPO-MIX-7KDPO3.4033.2HelpSteer2DPO3.3839.3
LLaMA-3.2-1BDPO-MIX-7KVanillaKD3.4034.1HelpSteer2VanillaKD3.5840.2
LLaMA-3.2-1BDPO-MIX-7KSeqKD3.7429.7HelpSteer2SeqKD3.4444.4
LLaMA-3.2-1BDPO-MIX-7KATKD3.6232.4HelpSteer2ATKD3.5942.2
LLaMA-3.2-1BDPO-MIX-7KPLAD3.4229.3HelpSteer2PLAD3.3637.8
LLaMA-3.2-1BDPO-MIX-7KDDPO3.2128.7HelpSteer2DDPO3.3437.3
LLaMA-3.2-1BDPO-MIX-7KDPKD3.2928.9HelpSteer2DPKD3.1036.5
LLaMA-3.2-1BDPO-MIX-7KDCKD3.5037.5HelpSteer2DCKD3.4440.7
LLaMA-3.2-1BDPO-MIX-7KADPA3.8850.0HelpSteer2ADPA3.6250.0
LLaMA-3.2-1BDPO-MIX-7KADPA+4.0253.8HelpSteer2ADPA+3.9960.9
评论

Thanks for your response. I will keep my rating for the paper.

审稿意见
8

This paper studies the alignment of small language models. It points out that directly applying alignment techniques to small language models may not work well due to the limited capacity of these models. To this end, the authors present 1) a straightforward approach Dual-Constrained KD by integrating both positive and negative signals, and 2) an enhanced approach Advantage-Guided Distillation for Preference Alignment that involves a distribution-0level reward signal given by an advantage function. Extensive experiments with three teacher-student model settings demonstrate the effectiveness of the proposed method.

优点

  1. This paper investigates the challenge of aligning small language models, the proposed methods are well-motivated and principal.

  2. The ADPA method considers the distribution-level reward signal, which is an advancement compared to the previous KD methods.

  3. The KD baselines are comprehensively compared, and the studied teacher-student settings are representative.

  4. This paper is well organized and the presentation is decent.

缺点

  1. Limited baseline of alignment methods. In Table 1, while the KD baselines are comprehensively compared, the alignment baseline only includes DPO.

  2. Computation overhead of the KD methods. It is not clear whether the proposed methods (and other baseline KD methods) consume more training or inference resources compared to directly applying alignment techniques to the student models.

问题

  1. I wonder if directly applying some advanced alignment methods to the smaller language models can present a competitive performance.

  2. Could you elaborate more on the computational cost of the proposed methods v.s. directly applying alignment techniques?

评论

Thank you for your positive assessment and constructive suggestions regarding our work.

W1: Limited baselines of alignment methods; only DPO is included as an alignment baseline.

Response:
We sincerely appreciate your insightful feedback regarding the selection of alignment baselines in our study. To clarify, our original submission included additional baselines beyond DPO. Specifically, we evaluated DPKD (utilizing the teacher model as the reference in DPO), DDPO (aligning implicit rewards derived from the student model with those from the teacher model), and PLAD (leveraging teacher-student generated pseudo-preference pairs). These methods investigate diverse dimensions of alignment through knowledge distillation or model guidance, and are closely related to the preference alignment objective of our work.

We fully agree that incorporating more recent preference alignment techniques can further strengthen the robustness of our comparative analysis. In response to your suggestion, we have extended our experiments to include the latest preference alignment methods, such as Simple Preference Optimization (SimPO) [1] and Weighted Preference Optimization (WPO) [2]. Our updated empirical results indicate that while these advanced preference alignment approaches yield moderate improvements in alignment metrics over the DPO baseline, our proposed ADPA framework consistently demonstrates more substantial performance gains.

Policy ModelDatasetMethodMT-BenchAlpacaEval
Danube2-1.8BDPO-MIX-7KDPO3.8730.6
Danube2-1.8BDPO-MIX-7KSimPO3.9716.8
Danube2-1.8BDPO-MIX-7KWPO4.0331.8
Danube2-1.8BDPO-MIX-7KADPA4.3350.0
Danube2-1.8BDPO-MIX-7KADPA+4.7461.3
Danube2-1.8BHelpSteer2DPO3.8740.7
Danube2-1.8BHelpSteer2SimPO3.9558.4
Danube2-1.8BHelpSteer2WPO4.2354.5
Danube2-1.8BHelpSteer2ADPA4.0250.0
Danube2-1.8BHelpSteer2ADPA+4.3362.4

References:
[1] SimPO: Simple preference optimization with a reference-free reward. 2024. In NeurIPS.
[2] WPO: Enhancing RLHF with Weighted Preference Optimization. 2024. In EMNLP.


W2: Computational overhead of the KD methods compared to directly applying alignment techniques.

Response:
We appreciate the reviewer’s insightful comment regarding the computational overhead of KD-based alignment methods compared to directly applying alignment techniques.

To address this, we provide the following clarifications. First, while KD methods involve the use of additional teacher models during training, the associated computational cost can be mitigated by precomputing the required teacher outputs offline before training begins. Second, once the student model is trained, the inference stage operates without incurring any additional computational overhead. Third, importantly, the primary challenge in aligning small language models lies not in the computational cost during training, but in overcoming the inherent limitations in their capacity and capabilities. Therefore, KD has been widely used as a crucial technique to address these challenges. Notable examples include its application in models such as LLaMA3.2 [3] and Gemma2 [4].

References:
[3] LLaMA 3.2: Revolutionizing edge AI and vision with open, customizable models. Meta AI.
[4] Gemma 2: Improving Open Language Models at a Practical Size. Gemma Team, Google DeepMind.

评论

Thank you for your response. I believe the current rating is appropriate.

评论

Dear Reviewer 346S,

We sincerely appreciate the time and effort you have devoted to reviewing our work, especially considering your busy schedule. As the discussion phase between authors and reviewers draws to a close, we kindly request your attention to our responses. In light of your feedback, we have incorporated additional alignment baselines, including recent methods like SimPO and WPO, and found that our ADPA framework consistently outperforms them. Regarding your concern about computational overhead, we have clarified that our method involves minimal additional computations during training, which can be mitigated through offline precomputations, and there is no added cost during inference.

We hope that we have effectively addressed your concerns and would welcome any additional questions or points you might like to discuss. We highly value your feedback and look forward to further discussion.

Thank you for your thoughtful consideration.

Best regards,

The Authors

审稿意见
8

The paper explores how to enhance the effectiveness of small language models to make their generated outputs more aligned with human preferences through preference alignment techniques. To address the issue that technologies like RLHF do not align well with human preferences on small language models, the paper proposes two methods: Dual-Constrained Knowledge Distillation (DCKD) and Advantage-Guided Distillation for Preference Alignment (ADPA). The experimental results show that both methods can significantly improve the alignment of small language models and narrow the performance gap with large models.

优点

  • The two methods proposed in the paper combine RLHF with distillation, providing insights into the preference alignment of small language models and solving problems that previous methods had not addressed.
  • The experimental section is well-designed, with numerous comparative and ablation experiments to verify the performance improvements of the methods.
  • The paper is well-written, with detailed descriptions of the two methods, including algorithm steps and formulas, which are easy to understand.

缺点

  • ADPA provides more nuanced guidance signals, but the additional computations introduced by fine-grained signals may increase the computational overhead, and the paper seems to lack specific quantitative metrics on this point.
  • The experimental section utilizes a rather singular evaluation dataset, without providing results on a broader range of models, domains, and data. Additionally, there is a lack of experiments assessing the impact of teachers of varying proficiency levels on the results.

问题

  • In Formula 11, LsL_s seems to be undefined. Does it refer to LSFTL_{SFT}?
评论

W2 (the latter part): Absence of experiments assessing the impact of teachers of varying proficiency levels.

Response: We appreciate the reviewer's insightful comment regarding the impact of teachers with varying proficiency levels on small language (student) models. To address this, we conducted additional experiments to directly assess this factor. Specifically, we compared the effects of two teacher models with distinct proficiency levels: H2o-danube2-1.8B, representing a lower-proficiency teacher, and Mistral-7B, a higher-proficiency teacher previously used in our experiments. In both cases, the student model was H2o-danube3-0.5B, and all experiments were performed on the DPO-MIX-7K dataset.

The empirical results are summarized in the following table. We observe that the teacher model's proficiency influences knowledge transfer results. Notably, higher-proficiency teacher models, such as Mistral-7B, consistently enable superior performance in student models through the distillation process. These results validate the critical role of teacher proficiency in knowledge transfer and will be included in the revised manuscript for clarity and completeness.

StudentTeacher ModelKD MethodMT-BenchAlpacaEval
H2O-Danube3-0.5BH2O-Danube2-1.8BVanillaKD2.5537.8
H2O-Danube3-0.5BH2O-Danube2-1.8BSeqKD2.6136.6
H2O-Danube3-0.5BH2O-Danube2-1.8BDCKD2.6739.8
H2O-Danube3-0.5BH2O-Danube2-1.8BADPA2.7150.0
H2O-Danube3-0.5BH2O-Danube2-1.8BADPA+2.7554.2
H2O-Danube3-0.5BMistral-7BVanillaKD2.6039.5
H2O-Danube3-0.5BMistral-7BSeqKD2.5339.3
H2O-Danube3-0.5BMistral-7BDCKD2.7739.2
H2O-Danube3-0.5BMistral-7BADPA2.6750.0
H2O-Danube3-0.5BMistral-7BADPA+2.7848.3

Q1: In Formula 11, L_sL\_s seems to be undefined. Does it refer to L_SFTL\_{\text{SFT}}?

Response: Thank you for pointing this out. You are correct that in Equation (11), L_sL\_s refers to the supervised fine-tuning loss L_SFT\mathcal{L}\_{\text{SFT}}. We apologize for the oversight in the notation. In the revised manuscript, we have thoroughly reviewed and corrected the notation to ensure all terms are clearly defined and consistent throughout. Specifically, we have replaced L_sL\_s with L_SFT\mathcal{L}\_{\text{SFT}} in Equation (11) and throughout the manuscript where applicable.

评论

We appreciate your positive feedback and thoughtful review of our work.

W1: Computational overhead introduced by fine-grained signals.

Response: We appreciate the reviewer's concern regarding the computational overhead introduced by the fine-grained reward signals in our ADPA framework. We would like to clarify that the training process of ADPA does not significantly increase computational overhead, especially when compared to traditional knowledge distillation (KD).

Specifically, the computation involving the fine-grained reward signals in ADPA requires calculating E_a_tπ_stu(s_t)(logπ_dpo(a_ts_t)logπ_ref(a_ts_t)),E\_{a\_t \sim \pi\_{\text{stu}}(\cdot|s\_t)} \left( \log \pi\_{\text{dpo}}(a\_t|s\_t) - \log \pi\_{\text{ref}}(a\_t|s\_t) \right), which is somewhat similar to the KLD objective commonly employed in traditional KD. Both objectives require computing an expectation over the difference between two log probability distributions. Although ADPA introduces an additional teacher, the computational cost associated with this is minimized through precomputation of these values offline prior to training.

During our experiments training a h2o-danube2-1.8B student model on the DPO-MIX-7K dataset using 8 RTX 3090 GPUs with a per-device batch size of 8 (global batch size of 128) and DeepSpeed ZeRO-2, APKD resulted in only a slight increase in both memory usage and training time, compared to Vanilla KD. Specifically, memory usage rose by less than 3% (from 21,957 MB to 22,541 MB), and training time increased by less than 4% (from 78 minutes to 81 minutes). These minor overheads are well within acceptable limits for practical applications.

Thus, the overall impact on computational resources remains minimal and well within acceptable bounds.


W2: Utilizing a rather singular evaluation dataset; lack of results on a broader range of models, domains, and data.

Response: Thank you for this valuable feedback. To address this concern, we have expanded our experimental scope by incorporating additional models and evaluation datasets, as outlined below:

  1. Additional Models: We have included experiments with an additional student model, LLaMA-3.2-1B, trained under the guidance of the teacher model LLaMA-3.1-8B on the DPO-MIX-7K dataset. This addition aims to demonstrate the adaptability of our proposed method to popular models widely utilized within the research community.

  2. Broader Evaluation Datasets: To ensure a more comprehensive assessment, we evaluated our trained models on the Open LLM Leaderboard, which benchmarks open-source language models across six diverse domains, including scientific reasoning, commonsense reasoning, multi-task understanding, and factual accuracy. This broader evaluation provides a robust analysis of our approach's effectiveness across various tasks.

We present the evaluation results for LLaMA-3.2-1B on MT-Bench, AlpacaEval (ADPA as reference) and the Open LLM Leaderboard below. Our proposed methods, ADPA and ADPA+, consistently outperform all baseline approaches across most benchmarks, demonstrating exceptional performance on ARC-Challenge, HellaSwag, and TruthfulQA. This underscores their effectiveness in tackling diverse tasks. These results will be included in the revised manuscript for clarity and completeness.

StudentMethodMT-BenchAlpacaEvalARCCHellaSwagTruthfulQAMMLUWinograndeGSM8kOpen LLM Leaderboard Average
LLaMA-3.2-1BSFT3.3435.741.0468.3840.9134.3860.466.8242.00
LLaMA-3.2-1BDPO3.433.240.2768.5145.1734.3861.256.6042.70
LLaMA-3.2-1BVanillaKD3.434.141.0467.2442.9134.0462.907.0542.53
LLaMA-3.2-1BSeqKD3.7429.740.0266.4944.1135.2861.255.8442.17
LLaMA-3.2-1BATKD3.6232.440.2767.3344.6634.3860.776.2942.28
LLaMA-3.2-1BPLAD3.4229.339.9368.2244.5534.4261.565.1642.31
LLaMA-3.2-1BDDPO3.2128.739.5166.8444.5533.9161.725.6142.02
LLaMA-3.2-1BDPKD3.2928.938.7466.1845.1834.5761.325.2341.87
LLaMA-3.2-1BDCKD3.537.541.2167.6443.3834.5761.807.5142.69
LLaMA-3.2-1BADPA3.8850.041.8168.6647.9834.7461.805.3143.38
LLaMA-3.2-1BADPA+4.0253.842.1570.4945.8934.6059.915.1643.03

评论

Dear Reviewer PkN8,

We sincerely appreciate the time and effort you have devoted to reviewing our work, especially given your undoubtedly busy schedule. As the discussion phase between authors and reviewers draws to a close, we kindly invite you to review our responses to your comments.

Our primary aim has been to address your feedback comprehensively and to engage further on any remaining questions or suggestions you may have. Your insights have been invaluable, and we are eager to continue this constructive dialogue.

To address your comments and strengthen our findings, we have undertaken the following actions:

  1. Demonstrated that the additional cost is minimal.
  2. Expanded our experiments to include additional models, such as LLaMA-3.2-1B, and evaluated them on the OpenLLM Leaderboard.
  3. Conducted experiments with teacher models of varying proficiency levels, which reaffirmed the critical role of teacher quality in effective knowledge transfer.

We sincerely thank you once again for your thoughtful feedback and careful consideration. We look forward to any further insights you may have.

Best regards,
The Authors

评论

Thanks for your response. All the questions I raised have been well resolved, and I have updated the rating accordingly.

AC 元评审

This paper studies alignment for small language models. The motivation is that we’d prefer to use small language models for efficiency reasons, but standard alignment approaches directly applied to these struggle. Instead, the authors propose a distillation-style approach where a (larger) aligned teacher is used to guide the small model.

Both the motivation for the problem and the approach are nice. The results are quite strong.

There are no strong weaknesses here; the reviewers brought up questions around the clarity of the writing, but these were addressed by the authors during rebuttal. Another potential question is whether the results hold up for other small models (or more generally, other combinations of teacher and student).

Overall, this is a solid work and worth accepting.

审稿人讨论附加意见

Reviewers generally asked about aspects of the writing, baselines, and experimental details. The authors clarified everything successfully.

最终决定

Accept (Spotlight)