On Teacher Hacking in Language Model Distillation
We study teacher hacking: does over-optimization of the distillation objective harm the ground-truth performance?
摘要
评审与讨论
The paper introduces the phenomenon of "teacher hacking," where using a fixed offline dataset for distillation degrades performance, and proposes solutions like online data generation and increased data diversity to mitigate this issue.
给作者的问题
See weaknesses above.
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
Yes.
实验设计与分析
Yes.
补充材料
Yes.
与现有文献的关系
The key contributions of the paper are well-grounded in the existing scientific literature and provide new insights and practical strategies for improving the distillation process of language models. The paper effectively builds upon previous research while addressing an understudied limitation of knowledge distillation, contributing to the broader understanding of language model training and optimization.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
- The paper introduces a novel phenomenon, teacher hacking, and provides a systematic framework for its analysis. The findings have important implications for the development of more robust language models.
- The authors conduct extensive experiments on multiple datasets and model configurations, providing a thorough understanding of the teacher hacking phenomenon and its mitigation strategies.
- The paper offers practical strategies to mitigate teacher hacking, such as using online data generation and increasing dataset diversity, which can be directly applied in real-world scenarios.
Weaknesses:
- The paper lacks significant theoretical claims or proofs, focusing primarily on empirical observations. A more theoretical understanding of the teacher hacking phenomenon could strengthen the paper.
- The paper does not provide a comprehensive comparison with other distillation methods or techniques that could potentially mitigate teacher hacking. This limits the understanding of the relative effectiveness of the proposed strategies.
- While the authors use multiple datasets, the scope is limited to specific tasks such as summarization, translation, and instruction following. Expanding the scope to include other tasks and datasets could provide a more comprehensive understanding of the phenomenon.
- The paper does not provide a clear differentiation between teacher hacking and reward hacking, nor does it explore how insights from one phenomenon could inform the other. This limits the understanding of the unique aspects of teacher hacking and its broader implications.
- The paper lacks detailed descriptions of the experimental procedures, and the code used for the experiments is not open-sourced. This makes it difficult for other researchers to reproduce the results and verify the findings, limiting the transparency and credibility of the research.
其他意见或建议
See weaknesses above.
We would like to thank Reviewer jZCq for the valuable feedback. In the following, we address the questions raised in the review.
The paper lacks significant theoretical claims or proofs, focusing primarily on empirical observations. A more theoretical understanding of the teacher hacking phenomenon could strengthen the paper.
We agree that the theoretical statements could strengthen the paper, but we find the derivation of the theory very challenging since the statement of teacher hacking highly relies on the presence of multiple models as well as the need to take into account the optimization procedure itself. We leave theoretical explanations as a promising direction for further work.
The paper does not provide a comprehensive comparison with other distillation methods or techniques that could potentially mitigate teacher hacking. This limits the understanding of the relative effectiveness of the proposed strategies.
In the Appendix, we provide an experiment with multiple distillation losses and data generation strategies, none of which were shown to be resistant to the effect of teacher hacking.
While the authors use multiple datasets, the scope is limited to specific tasks such as summarization, translation, and instruction following. Expanding the scope to include other tasks and datasets could provide a more comprehensive understanding of the phenomenon.
While more experiments are always welcome, we believe our experiments encompass three relatively different tasks.
The paper does not provide a clear differentiation between teacher hacking and reward hacking, nor does it explore how insights from one phenomenon could inform the other. This limits the understanding of the unique aspects of teacher hacking and its broader implications.
The effects of teacher hacking and reward hacking are different because there is no reward function during the distillation procedure and no teacher during the reinforcement learning from human feedback. However, we can interpret teacher hacking as an analogy to reward hacking that happens during the other part of the post-training pipeline. Nevertheless, the roots of both effects come from over-optimization of imperfect proxy objectives, but the nature of proxies is different.
The paper lacks detailed descriptions of the experimental procedures, and the code used for the experiments is not open-sourced. This makes it difficult for other researchers to reproduce the results and verify the findings, limiting the transparency and credibility of the research.
Unfortunately, we cannot open source our code. However, we believe we have provided ample experimental details in the appendix to reproduce all our experiments.
Thank you for your detailed response and clarifications. The paper introduces teacher hacking, an interesting and novel phenomenon. However, it lacks sufficient theoretical analysis to support and contextualize this finding. A more robust theoretical foundation would help fully realize its potential impact. Given the theoretical focus of the paper and the modest experimental improvements, I am inclined to maintain my current evaluation. While the work shows promise, it would benefit from further theoretical exploration in future research directions.
We thank Reviewer jZCq for highlighting that "The paper introduces teacher hacking, an interesting and novel phenomenon," which is the main goal of our paper. We also agree with Reviewer jZCq that "it would benefit from further theoretical exploration in future research directions." We hope this current paper would be a first step in that direction.
This paper identifies and formally defines phenomenon of teacher hacking, which describes a trend for student LM to "overfit" to the teacher model instead of the ground-truth, golden oracle distribution we want it to learn. Authors identify the use of fixed offline dataset to be a key reason for teacher hacking to occur, and advocates the use of online data generation to avoid teacher hacking during LM distillation. They also highlight the data diversity as a key factor and suggest that teaching hacking can be effectively detected by observing when the optimization deviates from polynomial convergence laws.
Update after rebuttal
I am satisfied with author's rebuttal and keep my score of 3.
给作者的问题
None.
论据与证据
The claims made by this paper is quite clear and evidence show satisfying support for their claims. I don't have questions regarding this part.
方法与评估标准
The evaluation criteria and dataset selection make sense to support the claim of this work.
理论论述
This paper mainly discuss an empirical phenomenon and does not highlight theoretical analysis; there is some theoretical discussion based on JS divergence and KL divergence regarding the loss used in LM distillation and this part looks fine to me.
实验设计与分析
The soundness of experimental settings are generally fine. I have following questions for the experimental design and result analysis:
Q1: You designed two stages for the experiment, and in the first stage, oracle LM generates oracle dataset for SFT on both teacher and student models to provide an initial checkpoint. Why would it be necessary to do this for student model? Since the student model needs to learn from the teacher model in the second stage by distillation (and implicitly learn from the oracle model anyways).
Q2: Not enough analysis for absense of teacher hacking when using online data sources. Why the teacher hacking problem naturally disappear in the online data source distillation setting? Intuitively, the more (diverse) data samples used in distillation, the more similarly should student model mimic the teacher's behaviour. It is not surprising that proxy metric continue to decrease with longer training, but it is surprising (and lacks explanation) why the golden metric can also generalize without degradation as well. Can authors further elaborate on this and provide some insights?
补充材料
I have reviewed each section of Appendix and don't have additional questions regarding them.
与现有文献的关系
The paper identifies degradation of learning effects in the context of language model distillation. The core idea is not very new and the results obtained here are not so surprising, but this work indeed provides some take-away messages for langauge model trainining practitioners.
遗漏的重要参考文献
None.
其他优缺点
Paper writing is well-structured, and the take-away information is quite clear and easy to consume, which is an advantage of this paper.
其他意见或建议
None.
We would like to thank Reviewer KCu2 for the helpful feedback! In the following, we answer the questions raised in the review.
Q1: You designed two stages for the experiment, and in the first stage, oracle LM generates oracle dataset for SFT on both teacher and student models to provide an initial checkpoint. Why would it be necessary to do this for student model?
We totally agree; it is indeed possible to distill directly, using the pretrained (PT) model as an initial checkpoint. There are two main reasons why we utilize SFT instead of PT:
- The quality of the student generation should not be too poor in order to benefit from online student generations;
- We want to draw a more direct analogy with RLHF, which is typically performed with an initial SFT checkpoint.
Q2: Not enough analysis for absense of teacher hacking when using online data sources. Why the teacher hacking problem naturally disappear in the online data source distillation setting? [...] Can authors further elaborate on this and provide some insights?
We do not have a solid theoretical explanation for this behavior. In our understanding, when the student model observes the teacher’s logit in the same context for too long, it starts memorizing the teacher's logits rather than generalizing to the ground-truth behavior. However, the clear threshold between memorization and generalization in our example is not clear, and we left the identification of it as an interesting direction for further work. Additionally, we can connect the observed behavior with the increase in diversity of the dataset, which is known to be highly beneficial for the generalization abilities of the network (see, e.g., Bukharin et al, 2024, Chen et al. 2024, Zhang et al., 2025).
References
Alexander Bukharin, Shiyang Li, Zhengyang Wang, Jingfeng Yang, Bing Yin, Xian Li, Chao Zhang, Tuo Zhao, and Haoming Jiang. 2024. Data Diversity Matters for Robust Instruction Tuning. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3411–3425, Miami, Florida, USA. Association for Computational Linguistics.
Chen, H., Waheed, A., Li, X., Wang, Y., Wang, J., Raj, B., & Abdin, M. I. (2024). On the Diversity of Synthetic Data and its Impact on Training Large Language Models. arXiv preprint arXiv:2410.15226.
Zhang, C., Zhong, H., Zhang, K., Chai, C., Wang, R., Zhuang, X., ... & He, C. (2025). Harnessing Diversity for Important Data Selection in Pretraining Large Language Models. ICLR 2025
Thank you for the rebuttal. The generalization/memorization point of view is interesting and makes sense.
This work investigates a novel phenomenon termed "teacher hacking," where student language models (LMs) over-optimize to imperfections in the teacher model during knowledge distillation, leading to degraded performance on the true objective. The authors propose a controlled experimental setup involving an oracle model (ground-truth distribution), a teacher model distilled from the oracle, and a student model distilled from the teacher. Through systematic experiments, they demonstrate that teacher hacking occurs when using fixed offline datasets for distillation but can be mitigated using online data generation techniques. The study highlights data diversity as a critical factor in preventing teacher hacking and provides practical strategies to address the issue.
Pros:
- The paper identifies and formally defines the "teacher hacking" phenomenon, drawing an insightful analogy to "reward hacking" in reinforcement learning from human feedback (RLHF). This perspective bridges gaps in understanding distillation limitations.
- The semi-synthetic framework with an oracle model provides a rigorous way to measure ground-truth performance (golden metrics) and proxy metrics (teacher-student alignment). This setup allows clear detection of teacher hacking through U-shaped proxy-golden curves.
- The authors provide actionable solutions, such as online data generation, increasing prompt diversity, and expanding datasets with multiple completions. These strategies are validated across multiple tasks and model sizes.
- The study includes experiments across diverse tasks (summarization, translation, instruction following), model architectures (T5 variants), and loss functions (forward/reverse KL, Jensen-Shannon). This breadth strengthens the generalizability of findings.
Cons:
- Dataset Diversity Limitations:
- While the paper emphasizes data diversity, the experiments on the WMT-14 en-de translation task show minimal impact of dataset diversity on teacher hacking (Fig. 9). This inconsistency suggests the phenomenon may be task-dependent, with translation tasks being less sensitive to diversity manipulations.
- The "x0.5 prompts, 2x gen" and "x0.2 prompts, 5x gen" experiments (Fig. 6) reduce prompt diversity but increase generations per prompt. However, the analysis does not quantify the trade-off between prompt diversity and generation redundancy, leaving ambiguity about optimal resource allocation.
- Model Size and Capacity Mismatch:
- The experiments distilling T5-large to T5-small show proxy metric increases indicative of classical overfitting rather than teacher hacking (Fig. 11). This suggests the framework may conflate overfitting with teacher hacking when model capacity gaps are large, weakening the specificity of the teacher hacking diagnosis.
- The study focuses on T5-based models, limiting generalizability to other architectures (e.g., transformer-decoder LMs). The phenomenon's dependence on architectural differences remains unexplored.
- Experimental Design Gaps:
- The offline-online data mixture experiments (Fig. 13) use fixed α values (10%, 50%, 90%) but do not systematically vary α across a continuous range. This prevents identifying the minimum online data proportion required to suppress teacher hacking.
- The golden metric improvements from increased generation budgets (Fig. 7) are marginal for proxy metrics, suggesting diminishing returns. The paper does not analyze cost-benefit trade-offs for different generation strategies.
- Task-Specific Sensitivity:
- Teacher hacking effects are more pronounced in instruction-following tasks (Natural Instructions) than translation tasks (WMT-14 en-de). The paper does not investigate why certain tasks are more susceptible, potentially due to inherent dataset properties or evaluation metrics.
- Evaluation Metric Limitations:
- The golden metric (distance to oracle) is task-agnostic and may not correlate with downstream task performance. The paper lacks task-specific evaluations (e.g., BLEU for translation, ROUGE for summarization) to validate practical implications.
给作者的问题
see weakness
论据与证据
The claims in the submission are generally supported by clear and convincing evidence, though with some notable exceptions:
Supported Claims
- Teacher hacking occurs during knowledge distillation with fixed offline datasets: Strongly supported by experimental results showing U-shaped proxy-golden curves where golden metrics (distance to oracle) deteriorate while proxy metrics (distance to teacher) improve (Fig. 4, Fig. 5).
- Teacher hacking can be detected by deviations from polynomial convergence laws: Convincingly demonstrated through log-log plots comparing online and offline training dynamics, where offline methods show clear deviations from expected convergence patterns (Fig. 5).
- Online data generation effectively mitigates teacher hacking: Well-supported across multiple tasks and model sizes, showing consistent improvement in golden metrics when using online data sources (Fig. 5, Fig. 8).
Problematic Claims
- Data diversity is the key factor in preventing teacher hacking: While supported by experiments manipulating dataset diversity (Fig. 6, Fig. 9), this claim has limitations:
- The impact of diversity varies significantly across tasks (minimal effect on translation tasks)
- The analysis doesn't quantify the trade-off between prompt diversity and generation redundancy
- The experiments don't establish diversity as the sole or primary factor, as other aspects of data quality may also play roles
方法与评估标准
The proposed methods and evaluation criteria are appropriate for investigating teacher hacking in LM distillation:
- Controlled experimental setup: Effectively isolates and measures teacher hacking through oracle, teacher, and student models.
- Golden and proxy metrics: Capture both ground-truth performance and distillation alignment.
- Diverse experimental scenarios: Demonstrate generalizability across tasks, model sizes, and loss functions.
Some aspects could be strengthened with task-specific evaluations and more rigorous dataset diversity quantification. Overall, the methods and criteria are well-suited to the problem.
理论论述
The paper makes several theoretical claims about teacher hacking, including its definition, detection via deviation from polynomial convergence laws, and mitigation strategies. These claims are logically consistent and supported by mathematical formulations and experimental evidence. The theoretical framework for measuring distances between language model distributions (forward/reverse KL divergence, Jensen-Shannon divergence) is sound and appropriately applied in the analysis. The paper does not present formal proofs requiring rigorous verification but rather builds its arguments on established information-theoretic measures and empirical validation. The theoretical claims appear correct within the context of the problem and experimental setup presented.
实验设计与分析
Some minor issues could be addressed:
- The impact of dataset diversity shows variability across tasks (particularly translation tasks), suggesting task-dependent effects that aren't fully explored.
- The distinction between classical overfitting and teacher hacking could be more clearly established in some model size comparisons.
- The experimental analysis would benefit from more systematic variation of parameters in the offline-online data mixture experiments.
Overall, the experimental designs are robust and valid for investigating the teacher hacking phenomenon.
补充材料
I reviewed relevant parts of the supplementary material that relate to the experimental designs and analyses, including additional experiments on different datasets, model size variations, loss function comparisons, and details on hyperparameters and dataset configurations.
与现有文献的关系
The paper's key contributions relate to broader scientific literature as follows:
- Teacher hacking analogy: Extends reward hacking research in RLHF, linked to Goodhart's law (Amodei et al., 2016; Gao et al., 2023).
- Controlled setup: Similar to experimental designs in reward hacking studies (Gao et al., 2023), using golden/proxy metrics.
- Mitigation strategies: Align with ML principles emphasizing data quality/diversity.
- Convergence analysis: Relates to scaling law research (Kaplan et al., 2020).
- Knowledge distillation foundations: Builds on established techniques (Hinton et al., 2015; Sanh et al., 2020).
遗漏的重要参考文献
The paper adequately covers relevant prior work in knowledge distillation and reward hacking, but some of the recent works can be discussed, e.g., KD for LMs [1,2], reward hacking in RLHF [3]
[1] Dual-Space Knowledge Distillation for Large Language Models. In Proc. of EMNLP 2024. [2] Revisiting Knowledge Distillation for Autoregressive Language Models. In Proc. of ACL 2024. [3] Mitigating Reward Hacking via Information-Theoretic Reward Modeling. In Proc. of NeurIPS 2024.
其他优缺点
Strengths
- Originality: The paper introduces the novel concept of "teacher hacking" in LM distillation, analogous to reward hacking in RLHF, and provides a systematic framework to study it.
- Significance: The findings have practical implications for developing more reliable and safe language model distillation pipelines.
- Clarity: The paper is well-structured with clear explanations of the experimental setup, methodology, and results.
Weaknesses
- Limited architectural exploration: The study focuses on T5-based models, limiting generalizability to other architectures. Task-specific evaluations: The paper lacks comprehensive task-specific metric evaluations to validate practical implications.
其他意见或建议
- Terminology: Ensure consistent terminology usage.
- Task metrics: Include task-specific metrics for comprehensive assessment.
- Architectures: Experiment with different model architectures.
- Dataset diversity: Use quantitative metrics to measure dataset diversity.
We would like to thank Reviewer z8Ux for the detailed and valuable feedback! In the following, we address the issues raised in the review.
Dataset diversity limitations.
The analysis doesn't quantify the trade-off between prompt diversity and generation redundancy
The experiments don't establish diversity as the sole or primary factor, as other aspects of data quality may also play roles
Dataset diversity: Use quantitative metrics to measure dataset diversity.
We emphasize that our study focuses not just on the diversity of the prompts but on the diversity of the entire prompt-completion dataset. We found that our approach is the only one that satisfies the following three key properties:
(1) Preserving the conditional distribution of completions given the prompt – ensuring that answer quality remains the same.
(2) Reducing the total diversity of the dataset.
(3) Preserving dataset size – eliminating data quantity as a confounding factor.
We are unaware of any experimental setup that maintains dataset size and generation redundancy while modifying prompt diversity. However, if the reviewer can suggest one, we would happily implement it. The closest related work, Song et al. (2024), varied prompt dataset diversity by increasing the number of prompts sampled. This violates (3) and prevents us from isolating the effect of data diversity from total data quantity.
Following Song et al. (2024), we report dataset diversity as the ratio of unique bigrams in tokenized prompt-completion pairs to the total number of tokenized bigrams, multiplied by the dataset size. Additionally, we report the dataset diversity ratio to the dataset's diversity with a 1:1 prompt-generation ratio.
| Dataset \ prompt-generation ratio | 0.2:5 | 0.5:2 | 1:1 | 1:2 | 1:3 |
|---|---|---|---|---|---|
| XSum | 4813.3 (0.47) | 7292.1 (0.72) | 10140.1 (1.0) | 10645.8 (1.05) | 11112.9 (1.1) |
| WMT-14 | 48079.4 (0.77) | 54511.5 (0.87) | 62652.0 (1.0) | 85855.5 (1.37) | 105242.1 (1.68) |
| Natural Instructions | 8464.1 (0.45) | 13311.6 (0.71) | 18751.5 (1.0) | 19631.7 (1.05) | 20430.0 (1.08) |
These measurements explain why the influence of diversity reduction on WMT is much smaller: diversity changes in the prompt-completion dataset are not as dramatic as in the summarization or instruction-following datasets. Another factor is that the prompts and completions in the WMT-14 dataset are much shorter than in XSum and Natural Instructions. We will integrate this discussion into the manuscript.
Reference:
Song et al. (2024). Scaling data diversity for fine-tuning language models in human alignment. COLING-2024.
Model Size and Capacity Mismatch
[...] the framework may conflate overfitting with teacher hacking when model capacity gaps are large, weakening the specificity of the teacher hacking diagnosis.
Limited architectural exploration
We agree that it would be interesting to study the phenomenon of teacher hacking on different architectures and with different model capacity gaps. However, we leave this direction for further work.
Experimental Design Gaps
[...] The paper does not analyze cost-benefit trade-offs for different generation strategies.
The experimental analysis would benefit from more systematic variation of parameters in the offline-online data mixture experiments.
We agree that it would be valuable to identify the minimal additional compute needed to suppress the effect of teacher hacking. We left the question of exploration of optimal trade-offs as a direction for further work.
Task-Specific sensitivity
[...] The paper does not investigate why certain tasks are more susceptible, potentially due to inherent dataset properties or evaluation metrics.
We agree that studying teacher hacking on a larger variety of datasets and tasks might be interesting, especially given the task-dependent strength of some of our recommendations. We leave this study as a promising direction for further work.
Evaluation metric limitations
Task metrics: Include task-specific metrics for comprehensive assessment.
The decision to use the distance to the oracle as the only ground-truth evaluation metric is connected to the final problem we are solving. In particular, ROUGE / BLEU metrics between oracle and student generations will serve only as a very noisy and imperfect proxy to the distance between student and oracle distributions, which we can evaluate using a less noisy estimation. At the same time, using these metrics with human references would violate the assumption of our setup that the oracle model is a source of ground-truth distribution.
Missing related work
We will happily add suggested references to the related work section.
This paper presents and studies the phenomenon known as "teacher hacking." Essentially, the author observed that large language models (LLMs) can over-optimize for an imperfect teacher during knowledge distillation, similar to reward hacking in RLHF. The reviewers agree that this observation is both novel and insightful. Although the paper lacks a fundamental understanding of this phenomenon, I still believe it is valuable to share these findings with the community.