PaperHub
6.5
/10
Poster4 位审稿人
最低6最高8标准差0.9
6
6
8
6
3.0
置信度
ICLR 2024

Towards the Fundamental Limits of Knowledge Transfer over Finite Domains

OpenReviewPDF
提交: 2023-09-18更新: 2024-03-06
TL;DR

We settle the sample complexity of knowledge transfer at various levels of privileged information in the tabular setting.

摘要

We characterize the statistical efficiency of knowledge transfer through $n$ samples from a teacher to a probabilistic student classifier with input space $\mathcal{S}$ over labels $\mathcal{A}$. We show that privileged information at three progressive levels accelerates the transfer. At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the minimax rate $\sqrt{{|\mathcal{S}||\mathcal{A}|}/{n}}$. The second level has the teacher probabilities of sampled labels available in addition, which turns out to boost the convergence rate lower bound to ${{|\mathcal{S}||\mathcal{A}|}/{n}}$. However, under this second data acquisition protocol, minimizing a naive adaptation of the cross-entropy loss results in an asymptotically biased student. We overcome this limitation and achieve the fundamental limit by using a novel empirical variant of the squared error logit loss. The third level further equips the student with the soft labels (complete logits) on $\mathcal{A}$ given every sampled input, thereby provably enables the student to enjoy a rate ${|\mathcal{S}|}/{n}$ free of $|\mathcal{A}|$. We find any Kullback-Leibler divergence minimizer to be optimal in the last case. Numerical simulations distinguish the four learners and corroborate our theory.
关键词
knowledge transferclassificationminimax optimalitydensity estimationknowledge distillation

评审与讨论

审稿意见
6

The authors consider the problem of knowledge transfer between a teacher (e.g., LLMs) and a student classifier (given limited privileged information). They show the fundamental limit of the transfer under three regimes:

  1. only input-label pairs are observed. The optimal rate is SAn\sqrt{\frac{|S||A|}{n}}, achieved by MLE.
  2. teacher probabilities of sampled labels are also available, then the optimal rate is SAn\frac{|S||A|}{n}, achieved by empirical SEL loss
  3. complete logits are also given, then the optimal rate is Sn\frac{|S|}{n}, achieved by KL divergence minimization Numerical simulations are also provided.

优点

quality: high, provide thorough investigation under three regimes, including matching upper and lower bounds. clarity: good, the setting is easy to under stand, and the structure of the paper is clear

缺点

The setting of "knowledge transfer" seems to be fancy. But according to the definition, the teacher distribution is considered as "ground truth" that is already given, thus the "hard labels" regime seems to fall in the common statistical learning framework. I am not sure whether tabular settings under usual statistical learning framework have been studied.

问题

  1. What is the main difference between your setting and the usual statistical learning setting? For example, if I change your notation in the following way: "input" ss change to "feature" xx, label aa change to yy, then your setting (hard labels) seems to be completely the same as the usual setting of observing data (xi,yi)_i=1n\\{(x_i,y_i)\\}\_{i=1}^{n}, and the goal is to predict yy from xx. It seems that there is not "knowledge transfer" (unless you call the usual machine learning "knowledge transfer from ground truth").

  2. The results is fully tabular. Can you generalize your results to continuous case?

评论

Thank you for your positive feedback and let us address your questions as follows.

Q1: I am not sure whether tabular settings under usual statistical learning framework have been studied.

A1:

  • As mentioned at the end of our Section 1.1, previous works [1, 2] similar to our Hard Labels\mathsf{H}\text{ard }\mathsf{L}\text{abels} setting consider the teacher to be also an estimator that learning from samples generated via the Bayes (ground truth) probability. [1, 2] also largely fall into the ERM (empirical risk minimization) framework. In contrast, motivated by modern (proprietary) language (teacher) models, we treat the teacher model to be the Bayes (ground truth) probability and our target of interest is just a kind of symmetric closeness metric between the student and the teacher, and thus is different from the ERM-related ones that mainly concern generalization bounds or the comparison between the generalization bound of the student and that of the teacher.

  • The term "knowledge transfer" we considered is to transfer any information from the teacher classifier to the student classifier as we mentioned in our Introduction section, any portion of {π(si)}i=1n\{\pi^\star(\cdot|s_i)\}_{i=1}^n is considered as privileged (i.e., additional) information.

  • At the very beginning of Section 3 (for Hard Labels\mathsf{H}\text{ard }\mathsf{L}\text{abels}), we already emphasize that

    "This is equivalent to the standard setting for estimating the conditional density π()\pi^\star(\cdot|\cdot)"

Q2: Can you generalize your results to continuous case?

A2: In the case of log-linear models πθ()exp(θ,ϕ(,))\pi_{{\theta}}(\cdot|\cdot) \propto \exp(\langle {\theta}, {\phi}(\cdot, \cdot)), we do have some preliminary results that the conditional total variation between two models can be upper bounded by the 2\ell_2-norm between the model parameters given uniformly bounded feature mapping ϕ{\phi}. We have added these discussions and related conjectures (for log-linear or general softmax π\pi^\star) on function approximation when the state space is allowed to be any standard Borel space in Appendix H (highlighted in blue) and leave the complete characterization as future work.

References:

[1] Aditya Krishna Menon, Ankit Singh Rawat, Sashank J Reddi, Seungyeon Kim, and Sanjiv Kumar. Why distillation helps: a statistical perspective. arXiv preprint arXiv:2005.10419, 2020.

[2] Tri Dao, Govinda M Kamath, Vasilis Syrgkanis, and Lester Mackey. Knowledge distillation as semiparametric inference. arXiv preprint arXiv:2104.09732, 2021.

审稿意见
6

This paper attempts to create a mathematical model for a simplified version of "knowledge transfer" from a teacher to a student within the context of multi-class learning. Notably, the authors make two key assumptions: that both 1) S\mathcal{S} (representing the space of all possible queries a student can pose to a teacher) and 2) A\mathcal{A} (denoting the space of all possible answers provided by the teacher) are finite.

In mathematical terms, for each query sSs\in\mathcal{S}, let π(s)\pi^*(\cdot\vert s) represent the (optimal) teacher policy, or the conditional distribution of labels for the query ss. The student's goal is to learn π^\hat{\pi}, a finite distribution with the same complexity as π\pi^*, using only nn i.i.d. queries (and answers) from the teacher, represented by the set (ai,si)\\{(a_i,s_i)\\} for i[n]i\in[n].

In this context, the authors aim to establish minimax bounds for Esρ[TV(π^,πρ)]\mathbb{E}_{s\sim\rho}\left[\mathsf{TV}\left(\hat{\pi},\pi^*\vert\rho\right)\right], which measures the "ρ\rho-expected total variation distance" between the teacher and student's policies. Here, ρ\rho signifies the distribution that generates queries across S\mathcal{S}. The paper successfully derives minimax rates for this problem under three different scenarios: i) when only hard labels are provided, ii) in cases where, in addition to labels, the probability of the label from π\pi^* is also given, and iii) when all class probabilities are given for each query.

The paper is well-written, and it meticulously references pertinent prior research (as far as I can tell). The mathematical derivations appear sound, with no conspicuous errors. Moreover, there are several experiments, though I haven't examined them closely. Overall, this is a commendable theoretical contribution. However, I believe that the assumption of finiteness for both the "query" and "answer" spaces oversimplifies the problem, leading to bounds that might not be as intriguing as one might hope (please refer to the weaknesses section for further clarification).

My vote is a weak accept.

优点

  • The paper is very well-written. In particular, the "Prior Works" section stands out as highly informative.

  • The problem formulation and motivations, especially the connections to Large Language Models (LLMs) and other foundational models, are immensely intriguing.

  • The theoretical framework developed in this work appears robust and sufficiently rigorous for a theory paper at ICLR. I scrutinized some of the proofs and did not find any significant errors. Furthermore, the other results seem mathematically sound.

  • The problem is analyzed under three distinct regimes, with the gradual introduction of additional side information to the original dataset. This approach allows the reader to observe, in a step-by-step manner, the mathematical impact of incorporating each layer of side information.

  • The paper also includes some experimental validations, although I did not review them in detail.

缺点

I suppose that the assumptions regarding the finiteness of both S\mathcal{S} and A\mathcal{A} may have oversimplified the problem. Here is my interpretation of Theorems 3.1 and 3.2 (please correct me if I am mistaken):

  • In essence, the paper aims to "learn" a distribution of dimension (S×A|\mathcal{S}|\times|\mathcal{A}|), denoted as ρ×π\rho\times\pi^*. By "learn," I mean the process of minimizing a specific expected total variation (TV) distance, as mentioned earlier. Let me introduce the notation F=S×AF = |\mathcal{S}|\times|\mathcal{A|}. It is already established that having approximately O~(F)\tilde{O}(F) (where O~\tilde{O} hides polylogarithmic factors) samples, denoted as nn, is sufficient to capture almost all the probability mass within S×A\mathcal{S} \times \mathcal{A} according to the distributions ρ\rho and π\pi^*. This can be directly derived from the coupon collector theorem. Consequently, in the worst-case scenario where both ρ\rho and π\pi^* are not sparse, we typically have, on average, O~(n/F)\tilde{O}(n/F) samples for each query-answer pair (a,s)(a, s). Consequently, the overall distribution ρ×π\rho \times \pi^*, as well as any of its derivatives (such as the aforementioned "expected TV distance," which is the primary focus of this paper), can be approximated with an error of at most O~(F/n)\tilde{O}(\sqrt{F/n}), as a direct consequence of general inequalities like Hoeffding. In my opinion, this summarizes most of Theorem 3.1 and 3.2. However, the rigorious derivation of final formulations, which has been done in this paper, is still a nice addition to MLT community.

  • The bounds pertaining to other scenarios discussed in the paper can also be derived using mathematical methods similar to the ones mentioned above.

Nonetheless, I commend the authors for their precise determination of minimax rates and their detailed analysis of individual cases. It would, however, be more intriguing if the paper explored more intricate scenarios for S\mathcal{S} and A\mathcal{A}. For example, one would naturally expect other (and more interesting) notions of complexity (such as VC-dimension, etc.) of distribution families ρ\rho and π\pi^* to show up in bounds, instead of the mere cardinality of query-answer spaces.

问题

Pleae see "Weaknesses" section.

评论

Thank you for your detailed feedback and we are happy to address your questions as follows.

Q1: The reviewer's interpretation of Theorem 3.1, Theorem 3.2, and other results.

A1: Let us respectfully emphasize some key points to accurately deliver the messages as follows.

  • The learner (student model) does not learn ρ×π\rho\times\pi^\star, and only aim to learn π()\pi^\star(\cdot|\cdot), which is a discrete and conditional density. ρ×π\rho\times\pi^\star is purely a probability distribution indexed by (s,a)(s, a) pairs, and learning a good ρ×π^\widehat{\rho\times\pi^\star} can not imply learning a good π^\widehat{\pi^\star}.

  • Thus, the target of interest is the conditional total variation TV(π^,πρ)\mathsf{TV}(\hat\pi, \pi^\star|\rho), while as we mentioned in the paper, an uniform ρ\rho is indeed good to provide some intuitions about the worst-case upper bound in expectation in Theorem 3.2.

  • We respectfully disagree with the reviewer on our contributions to understanding the method efficiency of the later two settings. When certain portion of {π(si)}i=1n\{\pi^\star(\cdot|s_i)\}_{i=1}^n is provided to the student, vanilla concentration bounds concerning only samples (such as Hoeffding's inequality, the Angluin-Valiant bound, Bernstein's inequality, Bennett's inequality, or McDiarmid's theorem) are no longer applicable, so our novel technical roadmap is instead inspired by the Missing Mass Analysis [1].

  • We also respectfully disagree with the reviewer on our contributions to understanding the problem complexity of each setting. The three information-theoretic lower bounds can not be easily derived from standard concentration bounds. For the first case, the constructive proof of Theorem 3.1 is derived from Assouad's hypercube reduction. For the later two cases, Theorem 4.1 and Theorem 5.1 can not even be derived via standard reductions like Le Cam's, Assound's, or Fano's [2], because these three standard methods can only tackle the case with no ground truth leakage, i.e., they can only deal with estimators of π\pi^\star that learn from purely samples from ρ×π\rho\times\pi^\star. In contrast, our proofs of Theorem 4.1 and Theorem 5.1 are based on two different and explicit constructions of Bayes priors over ρ×π\rho\times\pi^\star and the reduction from bounding minimax risk to bounding worse-case Bayes risk from below [3]. Thus, all the three constructive proofs for Theorem 3.1, Theorem 4.1, and Theorem 5.1 , are not derived from standard concentration bounds, and are different from each other, even from an intuitive perspective.

  • Also, we want to mention an important contribution related to instance-dependent/benign-case analysis in Theorem 3.2: the complexity measure ξ(π)\xi(\pi^\star), which provably explains the O(1/n)O(1/n) rate of CEsgl\mathsf{CE_{sgl}} rather than O(1/n)O(1/\sqrt{n}) in expectation when ξ(π)n1poly(S,A)\xi(\pi^\star)\asymp n^{-1}\text{poly}(|\mathcal S|, |\mathcal A|).

  • Another contribution is that the while the worst-case high-probability upper bound in Theorem 3.2 has polylog(S,A)\text{poly}\log(|\mathcal S|, |\mathcal A|) factors, the worst-case upper bound in expectation in Theorem 3.2 does not.

Q2: More intricate scenarios for S\mathcal S and A\mathcal A?

A2: We answer this question in two fold. First, the rationale behind considering discrete ρ\rho and π\pi^\star is that modern language models are often classifiers of tokens, with tokens or a collection of tokens as input. Second, we have added some discussions and conjectures (for log-linear or general softmax π\pi^\star) on function approximation in the face of large state space in Appendix H (highlighted in blue), which builds a quantitative relation between the 2\ell_2-distance of model parameters and the conditional total variation between models, and leave the complete characterization as future work.

References:

[1] David McAllester and Luis Ortiz. Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research, 4(Oct):895–911, 2003.

[2] Bin Yu. Assouad, fano, and le cam. In Festschrift for Lucien Le Cam: research papers in probability and statistics, pp. 423–435. Springer, 1997.

[3] Yury Polyanskiy and Yihong Wu. Information theory: From coding to learning. Book draft, 2022.

评论

I would like to thank the authors for addressing my comments. They have addressed the weaknesses I raised at three nearly distinct points in their response:

  • The first concerns the distinction between learning ρ×π\rho\times\pi^* and π()\pi^*\left(\cdot\vert\cdot\right) (although the latter needs to be learned in a way to minimize some given expected loss with respect to ρ\rho). The authors rightly assert that these are distinct problems. I did not intend to imply otherwise. However, it appears that they are closely related "in essence." In the tabular setting you've presented, both are tables of size A×S\vert\mathcal{A}\vert\times\vert\mathcal{S}\vert, and they are tightly related through simple mathematical relations. I remain unconvinced that learning one is fundamentally different from learning the other. In other words, I can revisit my initial comments and substitute all the statements related to ρ×π\rho\times\pi^* with analogous statements about π()\pi^*\left(\cdot\vert\cdot\right), and they still resonate with me (at least). This issue has been raised by another reviewer as well, that assuming finite A\mathcal{A} and S\mathcal{S} (tabular setting) might be too basic.

  • The second point mentioned by the authors revolves around the authors' idea that including what they call 'privileged information' fundamentally changes the problem, making traditional tools (such as traditional concentration bounds, etc.) less effective. And therefore, the technical contributions around this matter might be significant.

  • Similarly, the third point suggests that the tools used to come up with minimax bounds in this paper are quite different from the usual methods like Le Cam's, Assound's, or Fano's techniques. Again, showcasing the high level of mathematical sophistication in this work.

I'm not entirely sure about the second and third points right now since I haven't delved deeply into the paper. However, it's worth mentioning that these points don't seem to be highlighted enough in the paper, possibly getting lost in the midst of all the technical details. So, I kindly recommend that the authors consider a major rewrite to better showcase these contributions.

Additionally, I suggest that the authors take into consideration the feedback from Reviewer v1b9 regarding the definition and use of 'knowledge transfer' in this work. It seems that the primary focus of the authors' contribution lies more in the realm of "learning" with privileged information rather than what is commonly understood as "knowledge transfer" in the machine learning community today. In your study, the teacher is treated as the ground truth, and the student's ability to emulate the 'true' optimal classifier solely by accessing the teacher is somewhat diminished.

I maintain my vote, which leans towards acceptance

评论

We are grateful for your attentive replies and would like to further address the points you mentioned as follows.

  • In the latest revision, we highlighted brief discussions on the technical novelty of our work in the original text in blue; namely, the corresponding paragraph in Section 1.1 and the corresponding sentence in Section 3.1. Due to space limit, we may also consider further detailing the technical arguments in Appendix in subsequent revisions.

  • To inspire future work on function approximation, we have highlighted our newly added quantitative inequality relationship between the 2\ell_2 distance of two model parameters (assuming both models are log-linear) and the conditional total variation of the corresponding two models; and have proposed relevant conjectures on the lower bound of the 2\ell_2 distance of two model parameters.

  • Regarding the term "knowledge transfer", we would like to mention that the three cases we analyzed are idealized versions of "knowledge distillation" in deep learning practice [3]. Recently, there has been a trend of using (prompt, response) pairs queried from (Chat)GPT(-4) to train open-source language models, for example, see [1, 2]. These "black-box" knowledge transfer (distillation) paradigms can be essentially idealized to the first case we considered.

Thank you so much for your nice and scrupulous review.

References

[1] Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E. and Stoica, I. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. https://github.com/lm-sys/FastChat/

[2] Wang, G., Cheng, S., Yu, Q. and Liu, C. (2023a). OpenChat: Advancing Open-source Language Models with Imperfect Data. https://github.com/imoneoi/openchat

[3] Hinton, G., Vinyals, O. and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 .

审稿意见
8

This paper considers the problem of knowledge transfer through empirical samples from a teacher to a student. They characterize sample complexity for three different settings. The first setting is hard labels, where the student knows (input, label) samples from the teacher, the second setting is when the student knows (input, label, prob(label|input)) samples, and in the third setting, the student knows (input, label, prob(.|input)). The paper characterizes the lower bound and matching upper bounds in each of the settings, showing the statistical roles of extra information present to the student.

优点

  • Neat results and message

缺点

  • The tabular setting is quite basic which does not capture many practical settings when the state space is large.

问题

  • How can these guarantees generalize to more practical settings where the state space is large? (speculation is fine)
评论

Thank you for your appreciation and we have added some discussions and conjectures (for log-linear or general Softmax π\pi^\star) on function approximation in the face of large S\mathcal S in Appendix H (highlighted in blue), which builds a quantitative relation between the 2\ell_2-distance of model parameters and the conditional total variation between models.

审稿意见
6

The authors provide minimax rates for transfer learning, expressed as the total variation between between a learner and a reference policy or a ground truth distribution, for different levels of knowledge shared with the learner and losses such as cross entropy and squared loss. They also present simulations to demonstrate the performance of their results in the various transfer settings.

优点

  • Their work explores various ways of sharing knowledge and allows for flexibility on how teachers can assist learners.
  • Their theoretical results show that learners might not require too much data. Since their results are for commonly used losses in practice, this can be helpful in engineering other ways of knowledge sharing.

缺点

For someone who is not familiar with the literature in this field, many notations lack proper definitions or sufficient rigor. For example, there is no clear definition provided for "privileged information," and it would be beneficial for completeness to include such definitions. The technical aspects of their three settings are not distinctly emphasized. Many variables, such as CEsglCE_{sgl} or CEfulCE_{ful}, are referenced without prior definitions. Furthermore, Δ(AS)\Delta(\mathcal{A} \vert \mathcal{S}) is mentioned without an initial definition, and it's only later clarified that Δ(A)\Delta(\mathcal{A}) represents a simplex.

It took multiple reads to understand the impact of their main contribution. This paper might need several revisions because as it stands, it is not easily understandable.

问题

It would be helpful to have more specific explanation for the settings of transfer using Partially Soft Labels and Soft Labels in sections 3 and 4. It looks like D\mathcal{D} is already generated using (ρ×π)n(\rho \times \pi^\star)^n. It is not clear what additional information is provided with π(aisi)\pi^\star(a_i\vert s_i) in the case of partially soft labels.

评论

Thank you for your valuable feedback. We would like to address your concerns as follows.

Q1: What is the definition of "privileged information"?

A1: Our formulations, which can indeed be viewed as concrete instantiations of the learning using privileged information (LUPI) framework [1], are already illustrated via relevant LLM examples in the third paragraph. To be more precise, the privileged information we considered for one sample pair (si,ai)(s_i, a_i) is defined as certain portion of π(si)\pi^\star(\cdot|s_i). We apologize for the vague introduction of this concept in the third paragraph of the Introduction and have incorporated your suggestions (highlighted in blue at the end of Introduction) and polished related descriptions in the current revision.

Q2: Notation referenced before definition.

A2: For all the four losses, we have equip them with hyperlinks (highlighted in red) pointing to corresponding definitions in the current revision, especially in Table 1. We also put the formal definition of Δ()\Delta(\cdot|\cdot) and Δ()\Delta(\cdot) back to the main text (highlighted in blue), which were placed in the Appendix before Appendix A due to limited space. Thank you for helping us improve the writing.

Q3: What additional information is provided with π(aisi)\pi^\star(a_i|s_i)  in the case of partially soft labels?

A3: π(aisi)\pi^\star(a_i|s_i) itself is exactly the additional information in Section 4. A plainer explanation is that one student π^\hat\pi may want to learn π()\pi^\star(\cdot|\cdot), and in Section 3 she only use D\mathcal{D}, while in Section 4 she can utilize D{π(aisi)}i=1n\mathcal{D}\cup\{\pi^\star(a_i|s_i)\}_{i=1}^n. As we mentioned in the paper, π(aisi)\pi^\star(a_i|s_i) is the probability of choosing action aia_i conditioned on the state sis_i and π(si)\pi^\star(\cdot|s_i) is the probability distribution over the entire action space A\mathcal A given the state sis_i, where (si,ai)(s_i, a_i) is one sample pair in D\mathcal D. For example, suppose S={0,1}\mathcal S = \{0, 1\}, A={0,1}\mathcal{A} = \{0, 1\}, π(0)=[0.3,0.7]\pi^\star(\cdot|0) = [0.3, 0.7], and (s1,a1)(s_1, a_1) happens to be (0,1)(0, 1), then π(a1s1)=π(10)=0.7\pi^\star(a_1|s_1) = \pi^\star(1|0) = 0.7 is the additional information (or privileged information) provided to the student together with the sample pair (s1,a1)(s_1, a_1) in the case of Partial SLs\mathsf{P}\text{artial }\mathsf{SL}\text{s}.

[1] Vladimir Vapnik, Rauf Izmailov, et al. Learning using privileged information: similarity control and knowledge transfer. J. Mach. Learn. Res., 16(1):2023–2049, 2015.

评论

I am satisfied with the authors' responses to some of the weaknesses that I mentioned and I am increasing the score to 6.

评论

We are glad that our response has effectively addressed your concerns and suggestions.

AC 元评审

This paper attempts to create a mathematical model for a simplified version of "knowledge transfer" from a teacher to a student within the context of multi-class learning under the assumption that both the input and output domains are finite. The paper successfully derives minimax rates for this problem under three different scenarios: i) when only hard labels are provided, ii) in cases where, in addition to labels, the probability of the label is also given, and iii) when all class probabilities are given for each query.

While this paper only studies an idealized model, all reviewers agreed that the contribution in this paper is an important step towards an important problem. The AC agrees and recommends acceptance.

为何不给更高分

This paper only studies the finite-domain and bound scales with the domain size. Therefore, there is still a gap between this theoretical analysis and practice.

为何不给更低分

The minimax bounds for this new setting are novel.

最终决定

Accept (poster)