Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations
This paper introduces a few-shot task-aware knowledge distillation method that leverages counterfactual explanations to improve model performance with fewer data samples.
摘要
评审与讨论
This paper introduces COD that addresses the poor performance of standard methods when data is scarce. The core idea is to leverage Counterfactual Explanations (CFEs)—minimal input perturbations that flip the teacher model's prediction. Because these CFEs are located near the teacher's decision boundary, they provide highly informative training signals. The COD framework augments the small training set with these CFEs, enabling the student model to mimic the teacher's boundary more effectively. The work provides theoretical guarantees and experimental results showing COD significantly outperforms baselines in low-data regimes.
优缺点分析
Strength:
- Compressing a large language model to a smaller one with knowledge distillation is an important area.
- The paper provides both theoretical and empirical results to justify its method.
Weakness:
- The practical significance of the proposed Counterfactual Explanation (CFE) generation method is questionable. For the simple classification tasks evaluated in the paper, such as sentiment analysis, simpler data augmentation techniques are likely sufficient, making the complexity of CFE generation potentially unnecessary. Conversely, for complex, open-ended tasks where LLMs excel (e.g., long-form generation or reasoning), the concept of a CFE—a minimal perturbation that "flips a prediction" —becomes ill-defined and difficult to apply. This suggests the method is caught in a difficult position: potentially redundant for simple tasks, yet inapplicable for complex ones, thus limiting its real-world impact.
- The setting of "few-shot knowledge distillation" seems rare nowadays. Can the authors provide some example scenarios in reality of this setting?
问题
N/A
局限性
Yes
最终评判理由
My concerns are resolved.
格式问题
N/A
We thank the reviewer for the review of the paper!
On practical significance few-shot knowledge distillation and CFE-infusion
Distilling LLMs into smaller resource-constrained environments is of utmost importance. The unprecedented surge of AI comes with ever-increasing model size and complexity [1,4], leading to an unprecedented demand for energy. As models get larger, deploying them on resource-constrained environments such as mobile phones, Internet of Things (IoT), and edge devices gets particularly challenging due to their limited processing power, memory, and battery life [3,4]. To this end, this work addresses the broader question: can we develop efficient reduced-order models that can achieve high performance with minimal training data?
In fact, a recent survey [1] on LLM compression highlights the significance and timely nature of our work:
"Crucially, the success of Knowledge Distillation (KD) in LLMs hinges on Dataset Distillation (DD) techniques, which enable the creation of compact, information-rich synthetic datasets that encapsulate the diverse and complex knowledge of the teacher LLMs." [1]
Our CFE-infused approach embodies this principle by injecting precisely those high-value samples to better approximate the teacher’s decision boundary.
Data-efficient machine learning is also rapidly growing field of research [7]. Two primary motivations for data-efficient ML is: (i) data acquisition or labeling cost [5]; and (ii) training or fine-tuning costs. In this context, our work shows how it is possible to distill using so few carefully selected examples – essentially, how few examples hold the key to knowledge transfer. Few-shot distillation can have applications for on‑device personalization (keyboard, voice dictation), low‑resource or dialectal languages in under-served regions, clinical or legal settings, finance applications with privacy rules restricting data sharing data, deployment in hospitals/firms that have very few local examples.
Classification with LLMs still have various applications today beyond sentiment analysis, e.g., security/safety, toxicity detection, hate-speech/abuse detection, spam email detection, fraud detection, topic categorization, and tabular dataset prediction in high-stakes applications such as finance, healthcare, education, etc [2,3, 6, 8].
On the extension to generative LLMs: One possible way to extend this to generative LLMs is to define counterfactual explanation as a minimal change to the input (prompt) that flips a chosen property of the generated text while keeping everything else as similar as possible. Formally, given a generative model , a prompt , and a binary attribute function (e.g., sentiment, toxicity, factuality, topic relevance), a counterfactual explanation prompt would be a small semantic perturbation of such that the generated output flips the value of .
Another possible extension is to rethink counterfactual generation in terms of model sensitivity: identifying minimal semantic perturbations to the input that cause large shifts in the output distribution or likelihood. For generative models, this corresponds to large changes in the sequence-level probability: , where is the generated sequence for an input prompt . In this framing, CFEs could reveal parts of the input space where the model is uncertain, making them especially informative for distillation. While outside the scope of the current paper, this extension would be an interesting future direction, potentially inspiring novel and efficient pathways for data selection for a broad range of tasks, including distillation, supervised finetuning (SFT), post-training, etc.
We thank the reviewer for raising these important points; we will update the paper to include these clarifications and believe that the paper will be significantly strengthened as a result of this discussion.
References
[1] Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions, Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, et al.
[2] Smart Expert System: Large Language Models as Text Classifiers Zhiqiang Wang, Yiran Pang, Yanbin Lin.
[3] Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis, Ruiyang Qin, Jun Xia, Zhenge Jia, Meng Jiang et al.
[4] Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques, Sanjay Surendranath Girija, Shashank Kapoor, Lakshit Arora, Dipen Pradhan, Aman Raj, Ankit Shetgaonkar.
[5] Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis, Shiho Matta, Yin Jou Huang, Fei Cheng, Hirokazu Kiyomaru, Yugo Murawaki.
[6] Less is More: Task-aware Layer-wise Distillation for Language Model Compression Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, Tuo Zhao.
[7] Foundations of Data-efficient Machine Learning, Siddharth Joshi · Baharan Mirzasoleiman, ICML Tutorial 2025
[8] TabLLM: Few-shot Classification of Tabular Data with Large Language Models Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, David Sontag.
Thank you for the detailed response.
Regarding the extension to generative LLMs: if the authors believe that the extension to generative LLMs is outside the scope, they should explicitly clarify that the method is for classification in their title and the paper's main body. This is because current LLMs perform tasks in a generative manner.
Thank you for your suggestion. We will update the title and manuscript to better clarify that our focus is on classification tasks, while preserving the broader motivation and possible extensions to generative LLMs.
Please let us know if our rebuttal has addressed your other concerns. We are happy to clarify further or respond to any additional questions you may have.
Thank you for the follow-up. My concerns are addressed, and I have raised the score.
This paper presents a novel few-shot knowledge distillation framework called KD-FEW, which distills knowledge from large language models (LLMs) into compact student models under a low-resource (few-shot) setting. Unlike traditional KD methods that require many labeled examples or fine-tuned teachers, KD-FEW innovatively leverages LLMs in a black-box manner by prompting them to generate soft labels for few-shot instances.
优缺点分析
Strengths
- The approach suggested in the paper should be able to work in a realistic set of constraints, which would amount to constrained access to labeled data and constrained access to gradient information in large language models (LLMs). The scheme--calibration, soft label memory, and knowledge distillation--is a technically valid and practically applicable pipeline.
- The paper is clearly and well-organized. All the components (the calibration, memory, and soft label learning) are presented in the context of logic as well as being intuitively understandable.
- The authors approach an important issue of efficiency and accuracy of large-model inference and propose a reasonable approach to extract knowledge based on proprietary or expensive APIs like GPT-4 (no fine-tuning is required).
- Soft label distillation is not an original idea, but to the few-shot task, adding black-box prompting and a memory-based distillation algorithm, is already an original contribution.
Weaknesses
- The described improvements largely require the timely quality and template design. There is little account as to what sensitivity KD-FEW has had when it comes to prompt selection or whether a fully automatic prompting can be combined with the same, without affecting the performance.
- The soft label memory mechanism is computational and memory-wise extensible as the number of shots retrieved escalates. The complexity of the method does not undergo explicit assessment.
- Ablation is relatively light: For instance, what happens if calibration is removed? What if random soft labels are used instead of the memory buffer?
- Empirical testing can only be applied to English data, and the scope of its translation to a multilingual context and task-specific activities is thus ambiguous.
问题
-
What is the degree at which KD-FEW relies on the choice of prompts? Is performance variable when various prompt templates are utilized? Would it be beneficial to have automatic prompt generation methods (e.g. AutoPrompt or prompt search with RL)?
-
The authors are requested to provide the quantitative difference, which shows the enhancement of the quality of soft labels through the calibration step, and the consequent downstream performance.
-
How does the size of soft label memory impact model performance as well as computational demands?
-
Were the authors able to compare KD-FEW to (an alternative such as) RAG, wherein the data that is retrieved from a memory bank provides additional input to the generation process? The similarities between these approaches on the conceptual level deserve a straightforward comparison.
局限性
Although the paper makes some short comments about the black-box character of LLMs and the necessity of distillation, it does not thoroughly take into consideration page 3 limitations of the distillation process, such as:
- The reliance on good, timely engineering.
- The overhead involved the expense of storing or generating labels of the memory (e.g., API use).
- The difficulties with applying the method to tasks with structured outputs (e.g., QA or generation).
最终评判理由
After carefully reviewing the authors' responses, I find that they have addressed my concerns. Taking into account both the feedback from other reviewers and my own evaluation, I believe that my initial positive assessment remains appropriate. Therefore, I would like to maintain my original score.
格式问题
N/A
We thank the reviewer for taking out the time to review this paper, and appreciate the positive opinion about our work!
On template designing and prompt choices
Thank you for the thoughtful question. We conducted an experiment varying four prompt templates for generating CFEs and observed that our method (CoD) is robust to prompt choices, showing low standard deviation across variants (e.g., ≤0.018) and consistently outperforming the KD baseline low shot cases (see table below (SST2 dataset)). This suggests CoD is not overly sensitive to prompt used. Automatic prompt generation (e.g., AutoPrompt, RL-based search) are typically more compute-intensive. Given our already strong and stable performance using simple manually designed prompts, such complex techniques may not be necessary. We will include this prompt robustness study and the varying prompts used in the updated version.
| Method | 8 | 16 | 32 | 64 | 128 | 512 |
|---|---|---|---|---|---|---|
| KD | 0.617 | 0.712 | 0.757 | 0.820 | 0.848 | 0.899 |
| + CoD (v1) | 0.719 | 0.781 | 0.821 | 0.827 | 0.853 | 0.892 |
| + CoD (v2) | 0.754 | 0.789 | 0.841 | 0.872 | 0.890 | 0.872 |
| + CoD (v3) | 0.738 | 0.778 | 0.819 | 0.835 | 0.856 | 0.901 |
| + CoD (v4) | 0.734 | 0.783 | 0.830 | 0.834 | 0.883 | 0.888 |
| CoD (mean) | 0.736 | 0.783 | 0.828 | 0.842 | 0.870 | 0.888 |
| (std) | 0.012 | 0.004 | 0.009 | 0.018 | 0.016 | 0.010 |
On soft-label computational and memory requirements
As the sample increases, we observe a consistent improvement in model performance (accuracy) for both KD+CoD and LWD+CoD (see table below (SST2 dataset)). However, this comes at the cost of increased computational demands. We use codecarbon package to track energy usage across compute components, excluding the counterfactual generation step (which is API-based and not locally measured). Runtime and energy consumption (CPU, GPU, RAM) grow with larger , indicating more processing and memory usage. LWD+CoD is more computationally intensive than KD+CoD at every , due to its additional computational steps of aligning the teacher and students intermediate representations. We will include this ablation in the revised manuscript.
KD + CoD
| k | Accuracy | Duration (s) | CPU Energy (kWh) | GPU Energy (kWh) | RAM Energy (kWh) |
|---|---|---|---|---|---|
| 8 | 0.719 | 478.13 | 0.01336 | 0.00966 | 0.02479 |
| 16 | 0.781 | 488.04 | 0.01341 | 0.00972 | 0.02499 |
| 32 | 0.821 | 491.14 | 0.01382 | 0.01041 | 0.02547 |
| 64 | 0.827 | 547.58 | 0.01472 | 0.1362 | 0.02723 |
| 128 | 0.853 | 569.50 | 0.01639 | 0.01634 | 0.02952 |
| 512 | 0.892 | 705.21 | 0.03120 | 0.03593 | 0.04536 |
LWD + CoD
| k | Accuracy | Duration (s) | CPU Energy (kWh) | GPU Energy (kWh) | RAM Energy (kWh) |
|---|---|---|---|---|---|
| 8 | 0.694 | 485.12 | 0.01263 | 0.01102 | 0.02514 |
| 16 | 0.785 | 496.07 | 0.01394 | 0.01158 | 0.02572 |
| 32 | 0.832 | 517.78 | 0.01472 | 0.01245 | 0.02654 |
| 64 | 0.830 | 536.01 | 0.01515 | 0.01311 | 0.02775 |
| 128 | 0.835 | 668.52 | 0.01882 | 0.01670 | 0.02960 |
| 512 | 0.880 | 814.65 | 0.04621 | 0.04712 | 0.04902 |
On the quality of soft label calibration on downstream performance
Thank you for the suggestion. We have added an ablation that isolates the effect of the soft labels. As shown in the table below, removing the soft label term ( = 0) leads to a substantial drop in performance across all shot levels. Although CoD still improves over KD in this setting, the gains are significantly reduced. This highlights that the soft label calibration from the teacher is a key contributor to the effectiveness of counterfactual explanation data. Additionally, when replacing soft labels with random values, performance degrades sharply, likely due to inconsistency with the hard labels, introducing conflicting supervision signals in the training objective. We will include this analysis in the revised paper.
| Method (SST2) | 8 | 16 | 32 | 64 | 128 | 512 |
|---|---|---|---|---|---|---|
| KD (no soft label, α = 0) | 0.553 | 0.622 | 0.697 | 0.712 | 0.791 | 0.815 |
| + CoD (no soft label, α = 0) | 0.613 | 0.651 | 0.701 | 0.727 | 0.793 | 0.792 |
| KD (random soft label) | 0.582 | 0.533 | 0.543 | 0.601 | 0.617 | 0.649 |
| + CoD (random soft label) | 0.573 | 0.548 | 0.552 | 0.602 | 0.623 | 0.632 |
| KD (default) | 0.617 | 0.712 | 0.757 | 0.820 | 0.848 | 0.899 |
| + CoD (default) | 0.719 | 0.781 | 0.821 | 0.827 | 0.853 | 0.892 |
On the extension to generative LLMs and structured tasks
One possible way to extend this to generative LLMs is to define counterfactual explanation as a minimal change to the input (prompt) that flips a chosen property of the generated text while keeping everything else as similar as possible. Formally, given a generative model , a prompt , and a binary attribute function (e.g., sentiment, toxicity, factuality, topic relevance), a counterfactual explanation prompt would be a small semantic perturbation of such that the generated output flips the value of .
Another possible extension is to rethink counterfactual generation in terms of model sensitivity: identifying minimal semantic perturbations to the input that cause large shifts in the output distribution or likelihood. For generative models, this corresponds to large changes in the sequence-level probability: , where is the generated sequence for an input prompt . In this framing, CFEs could reveal parts of the input space where the model is uncertain, making them especially informative for distillation. While outside the scope of the current paper, this extension would be an interesting future direction, potentially inspiring novel and efficient pathways for data selection for a broad range of tasks, including distillation, supervised finetuning (SFT), post-training, etc.
On RAG comparison
We thank the reviewer for highlighting the conceptual connection to RAG. However, our setup and that of RAG systems differ fundamentally. Our method is focused on classification using supervision from soft labels and counterfactual explanations, whereas RAG augments data from a memory bank for generation using instruction-tuned models. A direct comparison would require switching to instruction-tuned LLMs capable of ingesting augmented data (e.g., "Given this input and context, classify "). However, the models we use are classification-specific (added classification head, similar to related works [1]), making such a comparison incompatible. Moreover, instruction-tuned LLMs used in RAG systems are typically very large (billions of parameters), which would make the comparison unbalanced and outside the scope of our focus on few-shot distillation under resource constraints. We believe bridging these paradigms is a promising direction for future work.
On limitations
We appreciate the reviewer’s for their suggestion. We will expand the limitations section to explicitly discuss the engineering overhead involved in our system.
[1] Less is More: Task-aware Layer-wise Distillation for Language Model Compression Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, Tuo Zhao
Thank you very much for your response and the effort you put into it. Taking into account both the feedback from other reviewers and my own evaluation, I believe that my initial positive assessment remains appropriate. Therefore, I would like to maintain my original score.
The paper studies the problem of few-shot Knowledge Distillation with the aid of counterfactual explanations. They provide both theoretical justifications and experimental results to support the proposed method.
优缺点分析
Strengths: I generally enjoy reading this paper: the problem is well-defined and formulated, the method is clear and elegant. I also find the theoretical analysis interesting: while the Fisher Information is kind of intuitive (Theorem 1), the Hausdorff Distance analysis is more interesting to me. The paper also have good experiments.
Weaknesses: I have one question and one concern regarding the method. My question is why when we have more samples (k) for KD, the usage of CFE seems to be worsen the learning? Concern: CFE is defined based on the model's labels. What happen with the method when the number of labels is large (and the set of CFE has much higher variants), or when we consider generative LLMs?
问题
See weaknesses.
局限性
Yes
最终评判理由
After consideration, I've decided to retain my original score.
格式问题
No
We thank the reviewer for the thoughtful review and are delighted to hear that the reviewer has found our work “well-defined and formulated” and the method to be “clear and elegant”!
Below, we provide detailed responses to the questions asked:
On the effect of more samples () in the distillation process
Firstly, note that for a fixed distillation dataset budget , our method effectively uses half the number of real samples for distillation. E.g., for , KD uses 8 randomly selected labeled examples, while our strategy only uses 4 labeled examples complemented with their counterfactuals (CFEs) and still significantly outperforms regular KD. Thus, it achieves more with less in the few-shot regime.
Increasing the number of samples () does not necessarily worsen our approach, but it could also make regular KD/LWD without CFE more competitive (reducing the performance gap). We hypothesize that this could be because as increases, real datapoints start to densely cover the decision space and hence we see diminishing returns from CFEs. Indeed, if the entire training data would be available during distillation (), the student’s performance is expected to keep getting better using KD/LWD.
On new experiments for multi-class classification
Yes, the strategy could be extended to multi-class classification setting. One way to extend this to multiclass could be as follows: Consider a multi-class setting with classes . For a data point belonging to a specific class , we can generate several CFEs to flip the label to each of the other classes, excluding .
As an initial demonstration, we implemented this variant of our proposed strategy on SST5 Dataset. Stanford Sentiment Treebank with 5 labels: very positive, positive, neutral, negative, very negative. We filtered it to classes to have a good separation among labels. When regular KD/LWD uses real samples, our proposed strategy effectively uses real samples + CFEs (so basically of what regular KD/LWD uses).
KD: KD+CoD (Ours):
LWD: LWD+ CoD (Ours):
Notably, there could be several other variants of our strategy for the multi-class setting. For instance, one could also consider only one CFE per datapoint corresponding to the smallest perturbation leading to any changed class, or even considering the counterfactual of the counterfactual, etc. It will be an exciting direction of future research to study the performance tradeoffs among different variants of our strategy for the multi-class setting.
On extension to generative LLMs
One possible way to extend our approach to generative LLMs is to define counterfactual explanation as a minimal change to the input (prompt) that flips a chosen property of the generated text while keeping everything else as similar as possible. Formally, given a generative model , a prompt , and a binary attribute function (e.g., sentiment, toxicity, factuality, topic relevance), a counterfactual explanation prompt would be a small semantic perturbation of such that the generated output flips the value of .
Another possible extension is to rethink counterfactual generation in terms of model sensitivity: identifying minimal semantic perturbations to the input that cause large shifts in the output distribution or likelihood. For generative models, this corresponds to large changes in the sequence-level probability: , where is the generated sequence for an input prompt . In this framing, CFEs could reveal parts of the input space where the model is uncertain, making them especially informative for distillation. While outside the scope of the current paper, this extension would be an interesting future direction, potentially inspiring novel and efficient pathways for data selection for a broad range of tasks, including distillation, supervised finetuning (SFT), post-training, etc.
We will include these discussions in the revised paper.
I have checked the responses from the authors and I acknowledge that they have addressed my concerns. I decide to keep my original score.
This paper proposes a way of distilling LLMs into smaller models by producing data points with minimal perturbations that change the LLM output. Reviewers agree on the clear formulation and the significance of the problem; concerns raised about significance were addressed in the rebuttal.