PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
5
4
3.5
置信度
创新性2.8
质量3.0
清晰度3.0
重要性3.3
NeurIPS 2025

Learning Task-Agnostic Representations through Multi-Teacher Distillation

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We show that interval estimation based methods produce better distilled embedders in multi-teacher distillation settings compared to MSE or Cosine base methods.

摘要

关键词
molecular representationnlpknowledge distillationembedding modelsrepresentation learning

评审与讨论

审稿意见
4

This paper proposes a novel framework for task-agnostic multi-teacher distillation that leverages a majority vote loss formulation, theoretically grounded in mutual information between student and teacher embeddings. The authors develop a principled loss function using conditional entropy and demonstrate the approach across three domains: NLP, CV, and molecular modeling. Extensive experiments show that the distilled student models achieve competitive or superior performance compared to existing baselines of similar sizes, improving the Pareto frontier of model size vs performance.

优缺点分析

Strengths:

  1. The paper connects Bayesian classifier disagreement bounds with conditional entropy and mutual information, offering a principled and task-agnostic loss formulation.

  2. The method is validated on three diverse domains (text, vision, molecules), strengthening the claim of generality. Across multiple benchmarks (MTEB, TDC, fine-grained vision tasks), the distilled models consistently outperform or match larger models, pushing the size-performance tradeoff frontier. The comparison to standard MSE, cosine, and CompRess baselines is thorough and convincing.

Weaknesses:

  1. While the proposed method is compared to MSE, cosine similarity, and CompRess, the set of baselines remains relatively narrow. There exist several established feature-distillation methods from the literature (e.g., Correlation Congruence for Knowledge Distillation [ICCV 2019]) that, while originally developed for task-specific distillation, can be straightforwardly adapted to a task-agnostic multi-teacher setup (e.g., by averaging teacher features). Including one or two such baselines (even only on a single modality like vision) would strengthen the empirical case and more convincingly isolate the advantages of the proposed information-theoretic loss.

  2. Although the method is claimed to be more stable than MSE or cosine methods, the paper lacks deeper analysis (e.g., convergence curves, variance across runs) to support this point.

  3. Additionally, there are some minor presentation issues. Several figures (e.g., in Section 4 and 5) are rasterized and appear blurry when zoomed in — using vector graphics would improve readability. There are also a few small typos and formatting inconsistencies: 1) Line 39: missing the “T” in “To our knowledge”; 2) Line 97: verb tense (“introduce”) is in past tense, inconsistent with surrounding text; 3) Line 176–177: the formatting of the letter “M” is inconsistent (italicized in one line but not the other). These issues do not affect the technical content but could be improved in the final version.

问题

  1. Have you considered adapting other feature-based distillation methods as task-agnostic baselines? For example, Correlation Congruence (ICCV 2019) and Relational Knowledge Distillation (CVPR 2019) could be easily applied in your setup by averaging over teacher features. Including them on a single domain like vision is enough to provide a more comprehensive comparison.

  2. Could you analyze the stability of training compared to MSE or cosine baselines? Maybe you can provide convergence curves and variance across runs.

  3. Could the proposed method be extended to cross-modal distillation, as hinted at in the conclusion? What challenges do you anticipate?

局限性

Yes. The authors acknowledge that maximizing mutual information does not explicitly preserve the structure of the embedding space (e.g., cosine similarity), since information-theoretic objectives are invariant to invertible transformations. While this is theoretically valid, it does not appear to pose a practical problem in the reported experiments. The student models still achieve strong performance across tasks, suggesting that the learned representations remain effective despite the lack of explicit structure preservation. As such, I view this limitation as mostly theoretical and not impactful to the empirical validity or applicability of the method.

最终评判理由

I appreciate the authors' response. The paper is overall well-executed, and the experimental section is quite thorough. That said, I believe that a score of 4 (borderline accept) remains the most suitable assessment.

格式问题

No

作者回复

We thank reviewer 3Aq4 for their thorough and insightful review. We appreciate that they found our method principled and our experimental setup thorough and convincing. We added additional baselines to our vision experiments and we are thankful to the reviewer for pointing out the unclear reference to the MSE training stability, as well as the mistaken reference in the conclusion. We will correct these in the revised version of the paper.

MSE distillation stability

We would like to clarify that our reference to the instability of Mean Squared Error (MSE) was drawn from previous works in the field of reinforcement learning 1,2,31, 2, 3 , where stability is context-specific. Our intention was to use this as motivational background for our work, rather than making a direct claim about stability within our own research. We acknowledge that this distinction was not clear in our initial presentation, and we will revise those references. Most importantly, we mistakenly referenced stability in our conclusion, which we will correct in the revised version.

Additionally we will add a dedicated section comparing the MSE loss and our NLL loss (with the training curves):

We observed that when training with the MSE loss, the loss reaches a minimum in only a few epochs (~40), but the distilled students achieve lower performances on downstream tasks. This could be due to the fact that the NLL loss is more expressive, and harder to optimize (see below). As a result the student learns more informative features compared to when trained with the MSE loss. (training curves will be included)

We can provide a theoretical insight to explain this phenomenon. Training using the negative log-likelihood over a Gaussian kernel is a simple generalization of the MSE. For a given multivariate Gaussian kernel parameterized by \mu and \Sigma, we have:

\-log (p\_{\\mu, \\Sigma}(x)) \= \-\\log C \+ \\frac12 \\det \\Sigma \+ \\frac12 (x-\\mu)^T \\Sigma^{-1} (x-\\mu)

Minimizing the MSE loss boils down to minimizing this equation over mu\\mu only, with \\Sigma \= I. Therefore, minimizing the negative log-likelihood of a Gaussian kernel is strictly more expressive than minimizing the MSE directly, which could account for the performance gains we observe.

11 Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
Jesse Farebrother and al. 2024

22 A Distributional Perspective on Reinforcement Learning, Marc G. Bellemare and al. 2017

33 Regression as Classification: Influence of Task Formulation on Neural Network Features
Lawrence Stewart and al. 2022.

Extension to multi-modal setting

The method can indeed be extended to a multimodal setting, where modality specific encoders would share the same backbone to embed different types of data. The challenge we would like to address is training a distillation model when limited cross-modal labels are available, to enable pretraining on larger datasets. So far, we did not observe a significant advantage compared to single modality training, and we are running additional experiments to answer the question: “How much cross-modal information is required to distill multi-modal representation that outperform the single modality ones?”

Additional baselines

We have added relational KD (RKD) Parketal.,CVPR2019Park et al., CVPR 2019 , Correlation Congruence with Gaussian RBF, and Bilinear (normalized features) kernels (CC-grbf, CC-Bilinear) Pengetal.,ICCV2019Peng et al., ICCV 2019 for vision. As shown in the following table, our method (NLL) has better performance compared to the additional baselines. Our intuition for the difference of accuracy is that both RKD and CC are proposed to work alongside the task loss, which could be an important signal for their optimization in practice.

MethodCIFAR10DTDSTL10SVHNFGVCAircraftCUB
RKD87.6452.2389.6361.6630.5447.85
CC-grbf84.0761.8693.0359.9633.4857.55
CC-Bilinear92.9561.2295.4263.7135.1664.7
NLL94.7665.8596.4576.9148.1369.37

Presentation Issues

Thank you for your thoroughness. You are absolutely right, we’ll update the figures with vectorized versions, and we’ll fix the formatting inconsistencies you pointed out.

Thank you again for your valuable feedback and suggestions, which have significantly improved our paper. Especially, thank you for pointing out the lack of clarity in our reference to the instability of MSE value estimation in reinforcement learning, that we have corrected.
We hope we addressed all of your concerns so that you might consider raising the score of your review.

评论

Thanks for the detailed rebuttal! You have responded thoroughly to most of my concerns.

Just a quick question: do you expect the multimodal experiments to be ready within the next week? It’d be great to see even a preliminary result if feasible.

评论

We are glad we answered most of reviewer 3Aq4’s feedback.

We have begun extending our method to the multimodal setting with a focus on the medical domain, where unpaired (i.e images and text are not assigned one to another), unlabeled or inconsistently labeled data is especially prevalent.

For training, we used unpaired data from different modalities: histopathology image datasets for vision 1,21, 2 , and medical textbook-style 33 corpora including for text. Our current setup comprises 3 teachers per modality We trained one multimodal student on both modalities to encourage knowledge coordination, inspired by work in coordinated representation learning, and one student trained only on the vision datasets.

As an initial evaluation, we compared unimodal and multimodal training setups on downstream classification tasks using PCAM and CRC benchmarks. Results are as follows:

ModalityCRCPCAM
Vision only95.9185.95
Vision + text95.6986.02

Overall the performances of the multimodal student are on par with the single modality student. Hence, training a student to embed both image and vision does not seem to degrade the quality of the vision embeddings, but does not make them more informative for the moment.

We are working on adding the cross-modal information, but unfortunately we don’t believe we will have results by this week for this.

11 Jewsbury, Robert, et al. "StainFuser: Controlling diffusion for faster neural style transfer in multi-gigapixel histology images." arXiv preprint arXiv:2403.09302 (2024).

22 Kather, J. N., Zöllner, F. G., Bianconi, F., Melchers, S. M., Schad, L. R., Gaiser, T., Marx, A., & Weis, C.-A. (2016). Collection of textures in colorectal cancer histology DatasetData set . Zenodo.

33 Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., & Szolovits, P. (2021). What disease does this patient have? A large-scale open-domain question answering dataset from medical exams. Applied Sciences, 11(14).

评论

I see. Thank you for your quick response!

审稿意见
5

This paper presents a novel approach to Multi-Teacher Knowledge Distillation (KD), addressing the problem of generating general-purpose, task-agnostic embeddings. Specifically, the core contribution is a task-enabling setting for multi-teacher distillation. Instead of traditional MSE-based losses, which can be unstable in high-dimensional spaces, the proposed method trains a student model to align its downstream task predictions with the collective predictions of an ensemble of teacher models. This is achieved through an ensembling loss that measures agreement between Bayesian predictors derived from student and teacher embeddings. A key theoretical finding is that this loss can be bounded independently of the specific task, utilizing the conditional differential entropy of the teachers' embeddings given the student's output, thereby providing a robust, task-agnostic student-teacher reconstruction loss. To evaluate the conditional entropy of the teachers’ embeddings given the student’s embedding, the authors propose using a parametric Gaussian model whose parameters are learned during the student’s training. Finally, the paper demonstrates high-quality generalized embedders across molecular modeling, natural language processing, and computer vision, with trained student models achieving competitive performance on a range of downstream tasks (e.g., classification, regression, clustering, sentence similarity).

优缺点分析

Strengths: This well-crafted paper proposes a fairly novel and interesting method, complemented by comprehensive and compelling experimental findings.

Weaknesses: No major weaknesses.

问题

I feel the paper should be accepted (I gave it a 5) but I am not sure if there's anything that could be added for me to give a higher score.

局限性

Yes

最终评判理由

Having read the authors' rebuttals and the other comments from the reviewers, I still maintain my score and positive opinion about this paper — which I feel that should be accepted.

格式问题

No conerns.

作者回复

We thank reviewer zNKm for their review, and we are glad they appreciated our work, and believe the paper should be accepted. We remain available to answer any new question if needed.

审稿意见
5

This paper proposes a task-agnostic framework for multi-teacher distillation that learns general-purpose representations without requiring task-specific labels or supervision. The approach introduces a novel loss function based on a “majority vote” principle, which is shown to be bounded by the mutual information between the student and teacher embeddings. This results in a task-agnostic objective that encourages the student to align with the ensemble of teachers across a wide range of potential downstream tasks.

The method leverages a differentiable, Gaussian mixture-based estimator of conditional entropy to implement this loss in practice. The training procedure minimizes the negative log-likelihood of the teacher embeddings conditioned on the student’s output, allowing end-to-end learning of a compact, informative student embedder.

The paper evaluates the approach across three domains - natural language processing, computer vision, and molecular modeling - using a range of classification, regression, and clustering tasks. The results demonstrate that the distilled student models achieve competitive or superior performance relative to both teacher models and size-matched baselines. The distilled model by the proposed method also shows strong size-performance trade-offs, advancing the Pareto frontier for efficient representation learning.

优缺点分析

Strengths

  • The paper introduces a theoretically grounded, task-agnostic distillation objective based on minimizing the conditional entropy of teacher embeddings given the student’s. This formulation allows the student to learn diverse, informative representations without reliance on task-specific labels, offering a general and conceptually sound approach.
  • The method is validated across three domains - language, vision, and molecular modeling - demonstrating its versatility. The breadth of evaluation strengthens the claim that the learned representations are useful across a wide range of downstream tasks.
  • The distilled student models consistently achieve high performance for their parameter count. In many cases, they match or outperform significantly larger models, indicating the method’s effectiveness in capturing information from multiple teachers.

Weaknesses

  • While the authors acknowledge that the proposed objective does not preserve structural relationships (e.g., cosine similarity) in the embedding space, which helps clarify the scope and nature of the learned representations, the paper does not include a dedicated discussion of the limitations of the method in general. For instance, the handling of teacher inconsistency, conditions where the method may underperform, etc., are not discussed.

问题

  1. Despite the strengths of the work, there is no dedicated discussion of its limitations, either theoretical or practical. For example, the method may face challenges in embedding structure preservation (as the authors briefly mentioned), memory costs from storing teacher embeddings, or reliance on high-quality teachers. Could the authors add an explicit limitations section/paragraph discussing practical boundaries of the approach, including cases where it may underperform or become inefficient? Can the authors elaborate on the practical implications of unstructured embeddings? Have the authors observed cases where this limitation actually causes issues?
  2. It may also be related to the above - Line 253, Page 8: Have any reasoning behind why the default 8-teachers model struggles and performs worse than other 1 or 2-teachers models with the BBB (Distribution) task?

Minor issues

  • Line 39, Page 2: “o our knowledge, …”
  • No caption for Figure (probably) 4

局限性

Please refer to the weaknesses and the questions above.

最终评判理由

This paper proposed a novel approach for task-agnostic multi-teacher distillation method. The paper was well written, while the authors addressed the concerns raised, mainly about discussion of limitations. The paper should be accepted so that the community benefits from the work and extend it.

格式问题

no

作者回复

We thank reviewer dFG5 for their insightful review and are pleased that they found our work interesting. Following the reviewer's advice, we will include a section dedicated to the limitations of our approach, referencing some results of other sections.

Limitation discussions

While we discussed some limitations in the appendices and experimental section, we agree our work would benefit from a dedicated limitation section. We will add a specific section covering the following limitations of our method:

  • Application Scope: Our method develops student embedding models primarily for diverse, unknown tasks. For single pre-defined tasks, task-specific distillation approaches might be more suitable.
  • Overhead in Distillation: Like any distillation setting, especially multi-teacher distillation, there is an overhead due to the distillation of large teachers into a smaller model. This overhead can be computational (if teacher embeddings are obtained by running inference for each training) or memory-related (if teacher outputs are precomputed and used during training). We opted for the latter as it drastically speeds up student training, initially storing all necessary embeddings on disk. For text (our most computationally demanding application), this amounts to approximately 100GB of embeddings for the largest teacher.
  • Teacher Quality: Our approach requires high-quality teachers relevant to the downstream tasks. In Section D.4, Table 26, we explore the impact of training a student embedder with task-specific teachers (classification, object detection and segmentation). We demonstrate that while task-specific teachers may offer limited benefits outside their domain, they do not negatively impact the students’ learning when used alongside task-relevant teachers.
  • Structural relationship: As pointed out in Section 4.2, our metric only optimizes the mutual information between the student and the teachers, it does not directly enforce any structure on the embedding space, which could harm the performances of our models for clustering tasks for instance. For textual embeddings benchmark, we observe clear gains in classification tasks (where a small classifier is trained on top of the embeddings), but the gains are less clear for clustering and STS tasks that rely on the dot product between embeddings to assess text similarity (See full results in App. C.2.2).
  • Representative dataset: To effectively embed data for future tasks, our method requires a training set that is representative of the data distributions of these tasks. This limitation, however, is common to all embedding models, which all require a diverse and representative dataset for training.

Comparison of 8-teachers to 2-teachers on the BBB benchmark.

The student trained in molecular modeling with 8 teachers indeed shows slightly lower performance on the BBB (blood-brain barrier) benchmark compared to students trained with 1 or 2 teachers. The BBB benchmark data distribution significantly differs from our training set, representing a domain shift compared to the training set of the students. Furthermore, it is one of the benchmarks where teacher performance is most tightly packed, with variations within 1.45 times the average standard deviation of the results. This could explain why training with 8 teachers performs closely to 1 or 2 (the differences in AUROC being half the standard variation), as all teachers demonstrate comparable performance on this specific task. We believe this explains the slightly lower average performance of the 8-teacher student compared to the 1 or 2-teacher students.

Thank you again for your valuable feedback. We believe these clarifications and updates will address your concerns and enhance the quality and relevance of our work. We remain available to clarify any additional points if needed.

评论

Thank you for addressing the concerns that I had and asked. I hope the discussion about the limitations as well as the BBB result are included in the revised paper. Also, please correct and update minor issues, including typos and missing figure captions.

审稿意见
4

This paper proposes a multi-teacher distillation technique that is task agnostic. Through a majority vote objective function and ensembling loss, they show that this loss can be bounded independently of the task, making the distillation process task agnostic.

优缺点分析

Strengths:

  • This paper formalizes the problem of task agnostic distillation well.
  • The use of Gaussian mixture based estimator to formulate loss is quite interesting and novel
  • They show a good understanding of the problem in designing embedding models, and show the relevance of this technique.
  • The paper shows application of this method for distilling molecular embedders, which is an interesting and important application.

Weaknesses:

  • The paper is a bit outdated.
  • The paper does not motivate the need for task agnostic distillation well. A good comparison of modern embedding models on the benchmarks would have been useful, with more elaborate benchmarks.
  • The existing evaluation has multi-teacher distillation techniques as the baseline, but overall, it should have newer embedding models in the baseline as well.
  • the evaluations are done on really small models and small benchmarks.

问题

please see weakness

局限性

no

最终评判理由

I would like to keep the initial score. This paper addresses tiny models and from this discussion it's pretty clear that the technique proposed by the authors of this paper wont be beneficial on models that are >1B params. and even >700M params. So it casts doubts on the motivation for the paper and the relevance of the technique in light of the modern embedding models out there.

格式问题

no major concerns

作者回复

We thank reviewer Aw1B for their thorough review, and we are glad they appreciate the novelty in our proposed method. We politely disagree our experiments focused on 'really small models and small benchmarks', and we aim to provide a clear justification below.

Small Models

We trained models up to 300M parameters for textual embeddings, while this is not particularly large (<1B), it is a standard size for embedding models in text. Contrary to text generation where clear performance gain can be seen when using very large models, performances of embedding models in text can achieve very competitive scores, while having fewer parameters (e.g., the Stella-500M model outperforming several 8B models on the MTEB). This scale is comparable to other well-established and widely used embedders, such as Stella and GIST, which are recognized for their strong performance in the field.

Small benchmarks

We politely disagree with this statement, across the three modalities evaluated in our study we conducted comprehensive assessments on a total of 71 widely used datasets. These included 6 datasets for computer vision, 32 for molecular modeling, and 33 for text analysis (ie the Massive Text Embedding Benchmark, which constitutes the reference for textual embedders evaluation).

Motivation of the task-agnostic distillation

In our introduction, we aim to motivate task-agnostic distillation in two steps. First, we highlight the development of embedding models, which are inherently task-agnostic. These models compress objects into numerical representations, thereby facilitating a wide array of downstream tasks. Next, we argue that the diversity of embedding models available in each field can be leveraged to build an embedder that benefits from these diverse representations through distillation.

We will extend the paragraph motivating the use of embedding models by mentioning their current applications in a wide range of scenarios, including classification, clustering, and information retrieval. We will also emphasize their key advantage in terms of computational efficiency, particularly in scenarios with limited labeled data.

'The paper is a bit outdated'

We are not sure what part of the paper you are referring to when you say it is a bit outdated. Could you be more specific so we can update and improve our work accordingly? Nevertheless, we would like to make a few remarks as the field of text embedding models have gained a lot of focus in research lately:

  • We chose our baselines based on their leading performance on the MTEB benchmark at the time of submission. Notably, Stella 400m v5 was the top-performing model in its weight category, only recently surpassed by the Qwen embedders released on June 15th 2025.
  • We compared our students against the best models in each category, including GIST models, which were among the highest-performing embedders across most weight categories until early 2025.

In any case, we will update the MTEB benchmark tables in our paper to include the most recent results published.

Choice of baselines

We did not only compare our approaches with multi-teacher distillation approaches. While our analysis mainly focuses on the comparison of our method with multi-teacher distillation methods, we also compared its performances with several embedders for each modality (of the same weight category). The objective of our experimental section is to validate that our distillation approach is efficient in compressing the information of several teachers into a smaller student, thus our focus on distillation baselines. It is hard to provide a fair comparison of methods when comparing with other embedders since they have been trained with widely different settings, datasets, infrastructures and training objectives, whereas we provide distillation comparison in a valid controlled setting for fair comparisons to answer our initial scientific question.

We hope we have successfully addressed all your concerns, and we would be grateful if you could reconsider your score based on our revisions. Thank you again for your insightful review.

评论

We thank reviewer Aw1B for their involvement in the rebuttal process.

We want to insist that we do compare with the most recent/modern embedders of similar sizes at the time of the submission in the main part of the paper (only the best one for each weight categories and we provide the full MTEB results in appendix C.2) and provide such a head-to-head comparison. And indeed, our method produces models that outperform all modern embedders in their size categories (for fair comparison), thus suggesting that the multiteacher setting indeed provides significant advantages. For comprehensiveness' sake, we include in this rebuttal the most recent results from the MTEB for the biggest/best models. (We had to trim the full table for it to fit in this answer, but we can provide any part of that huge table) compared to our own models. The most recent one is the Qwen 600M embedder, released only a few weeks ago.

We agree there is no harm in comparing with larger models (we provide the results here as well) however, it is important to keep in mind that it is not an apples-to-apples comparison. Models of different sizes and computational costs have different applications. Showing that we can achieve higher information density using our models has practical application for low-resource or on-edge deployment settings for which larger models are impractical.

Our medium model (109M) parameters are on par with models 5 times its size (Average performance 80.23), only outperformed by Stella 400M (still the best model of its category released early 2025 and included in the paper) and KALM (494M). The best performing and most recent model by far is Qwen 600M, which only outperforms our models by 5 points. Only models above 1B parameters achieve significant gains over our medium (109M parameters) model. If you have any additional specific model in mind that you deem more recent, please let us know we will add it if it's present on the MTEB benchmark. We will update the final version of the paper with the most recent version of the MTEB.

Model#ParamsEmbDimAmazonCtfBanking77IMDBMTOPDOMainMassiveIntMassiveScen.ToxicConv.TweetSent.Average
Student-s-nll3238477,386,788,395,576,780,766,160,679,0
Student-m-nll10976879,688,088,396,278,682,767,161,380,2
stella_en_400M_v5435409694,389,396,598,380,589,684,073,688,2
KaLM-embedding-multilingual-mini-instruct-v149489681,584,995,092,269,874,289,076,582,9
KaLM-embedding-multilingual-mini-v149489676,479,291,692,570,976,170,862,777,5
KaLM-embedding-multilingual-mini-instruct-v249489695,389,595,298,977,886,089,378,688,8
jina-embeddings-v3572102490,984,191,975,284,191,371,484,1
snowflake-arctic-embed-l-v2,0568102465,681,872,893,571,576,265,959,673,4
Qwen3-Embedding-0,6B595102491,581,095,496,080,483,682,176,085,8
stella_en_1.5B_v51500896094,189,896,798,784,589,786,874,889,4
jasper_en_vision_language_v11000896093,887,297,099,285,391,291,377,290,3
Qwen3-Embedding-4B4000256093,786,397,297,885,088,891,478,489,8
评论

I thank the authors for their comments.

My biggest point of disagreement so far is that this paper does not compare its approach with modern embedding models. And that is the reason why I mentioned that the paper might be a bit outdated, since the embedding models that the paper compares to are increasingly outdated. Even if the authors feel that comparing it with modern embedding models is not a fair comparison, it would still be good to have those experiments in the paper. And that might directly question the motivation of the paper. The question is, "In light of how well modern embedding models perform, do we really need a multi-teacher distillation approach to train embedding models? And if so, could you motivate it by including head-to-head comparisons with modern embedding models?"

What's the harm in comparing it with models with a larger size and a larger context window?

评论

Thank you for the updated experiments. I would still like to keep my current score, as i believe that the role of multi-teacher distillation is increasingly being replaced by more powerful models. While the authors focus specifically on tiny to small models, that are sub 0.5 to 1 B parameters (mostly sub 0.5B params), the SOTA embedding models are largely > 1B params. The cost of using these models is already very low and further going down. Further, it can be safely extrapolated from the current table that almost all, if not all, models of size > 1B params will outperform the Multi-Teacher Distillation strategy presented in this paper.

评论

We would like to thank reviewer Aw1b for their reviews and engagement in the rebuttal process.

We believe that to assess the value of multi-teacher distillation to train embedding models, models of similar sizes should be compared for a fair comparison (ideally in a controlled setting).

To compare with these larger models, a larger student should be trained. We acknowledge this is a limitation of our work (and of distillation that often aims to distill large models into smaller ones) and we will discuss it in the limitation section of the revised version of the paper.

We thank again reviewer Aw1b for their reviews, engagement along this rebuttal period, and despite our disagreements we completely respect their final decision.

评论

Yes, distillation is from larger to smaller. But this paper deals with tiny models. I am just curious about how models that are of the order of a billion parameters fare with this methodology. I believe it's unfair to say that it's a limitation of distillation to be able to compare larger models than the ones that this paper studies.

评论

We agree it would have been interesting to experiment with >=1B embedders.

To perform these experiments, we would have had to use large embedders which were, at the time of the submission, not as performant relative to their size (Qwen3 and KALM models being released a month after this submission).

Besides, this would have required further resources we unfortunately didn't have to expand the text experiments.

We will discuss this limitation (comparison to these new embedders) in the revised version of the paper.

We thank reviewer Aw1b once more for his continuous feedback.

最终决定

This paper proposes a task-agnostic method for multi-teacher distillation utilizing a newly proposed loss function which is bounded resulting in a task agnostic objective. All reviewers agree that this is a paper with good methodological contribution and novelty (which is rare for the fairly saturated topic of knowledge distillation), and strong results across multiple datasets and domains. The authors did a good job with rebuttal addressing most of the authors' concerns. There is only some issue raised pertinent to what extent the method can be used for distilling very big models but the authors have already shown results with a 300M model, so this limitation can be considered only minor. Overall, a clear accept.