Provable and Efficient Dataset Distillation for Kernel Ridge Regression
For dataset distillation of kernel ridge regression, we show theoretically that one data per class is necessary and sufficient to recover the original model's performance in many settings.
摘要
评审与讨论
The paper presents provable algorithms for computing distilled sets for various problems.
Starting with LRR, they prove that it is enough to have m = number of classes to compute an exact solution as on the whole data. Then they directly show how to make this distilled point "Realistic" by initializing m points from the data and computing matrix D (the distilled data) to satisfy the conditions required for the distilled set while being close to these m points.
Then they prove that distilling the labels only can yield the same performance with d points (which is notably immediate from [16] which basically proved the same for the more generic case of KRR via RFF).
Section 5 is an extension of the previous results from LRR to KRR which is done in the same way after swapping the original feature with the mapped features. However, what was missing was how to construct the distilled set of points in the original space from the mapped distilled points -- which is easy if \phi is bijective. But when it is not, the authors show how to at least compute an approximation for these points.
Section 6 is crucial to draw a connection as to why practical algorithms were working in the first place!
My summary:
- I like the contribution of this paper, and I think it is super crucial to have theoretical results in this practical field.
- However, I have some major and minor concerns and questions I wish to solve in order to accept that paper, specifically about the correctness of the theorems and experiments.
- I am willing to increase my score once my questions/concerns are addressed.
优点
-
Provides important provable guarantees and algorithms for the distillation work which was mainly practical.
-
It is very important to provide a better understanding of distillation and why it works in the first place -- a gap that has been missing for so long.
-
The method is very efficient which is very useful.
缺点
My main concerns can be split into three:
a. The framing and guarantees:
-
why is \lamba_s not equal to \lamda? How do you formulate the distillation with both \lambda and \lambda_S?
-
For theorem 4.1 (line 118), can you provide an explanation why \lambda_S \leq 1/(4\alpha'^2). This seems to be a very rare case not related to practice. Or do you suggest solving the problem of LRR on the small data with another lambda? if so, how can it be determined?
-
Same question as before for theorem 5.1. How do you find these lambda_s?
-
Section 5.2, I did not fully understand the result, if \phi is injective, can you guarantee W_s =W with k+1 points? if yes do you prove the existence of such a set or also provide an algorithm to compute it?
b. Experimental results:
- Table 4 in the experimental results is not clear and might have some issues. For example "https://proceedings.neurips.cc/paper_files/paper/2022/file/5a28f46993c19f428f482cc59db40870-Paper-Conference.pdf" and https://arxiv.org/pdf/2307.08086 conducted similar experiments and got much higher results both for KIP and their methods. Could you elaborate on why the results are different or at least why did you use different architectures?
c. Simple comment but important:
- Related work is missing specifically, for LRR: "https://proceedings.neurips.cc/paper/2019/hash/475fbefa9ebfba9233364533aafd02a3-Abstract.html" showed how to construct a set of d^2 points with the same exact solution and is a subset of the input,* then, https://ieeexplore.ieee.org/abstract/document/9677460 extended the result and showed how to compute a distilled set (not subset) with only d point and yield the same solution.
- For KRR https://arxiv.org/pdf/2307.08086 also showed the first basic theoretical guarantees for full network distillation.
- https://arxiv.org/pdf/2406.04284 provided research on the information and dynamics of dis-tilled data post-distillation process
Following the rebuttal (response) from the authors, my concerns have been addressed, I am raising my score.
问题
Please see all my questions in the Weaknesses section
局限性
I suggest the authors to clearly frame the distillation problem and what is the goal of it-- explaining the confusing \lambda_s and stating the main results clearer.
We thank the reviewer for the positive feedback, thoughtful comments, and for appreciating the novelty and value of our work!
Q1. Why is not equal to ?
A1. Our goal is to find such that . and are predefined hyperparameters in our setting. We allow them to be different so that the framework is more general and flexible. We can also simply set them to be the same. In our framework, does not matter much because it is only used to compute the original model’s parameter , and is supposed to be given or easily computed in our framework. needs to satisfy some conditions in Theorem 4.1 and 5.1 to guarantee . We will make it clearer in the revised paper.
Q2. Explanation why . This seems to be a very rare case not related to practice. How to determine the in Theorem 4.1 and 5.1.
A2. is required because needs to be solvable from a given . From Eq (2), line 420, . Given a , in order to have a solution for , the equation on line 427 must have a solution for . From this, we have the requirement between and . When , it is simpler and there is no such requirement.
This requirement is easy to satisfy. For example, given a and its largest singular value , we can simply set or any number less than it. If we want to fix a predefined , e.g. , we need to sample different (by sampling different ) so that its largest singular value satisfies the condition .
Practical algorithms, e.g. KIP, FRePo, and RFAD, usually use a very small regularization , which may already satisfy this requirement. For example, KIP uses . Theorem 4.1 generally suggests a smaller is better to construct distilled data that can perfectly recover the original model’s performance.
Q3. Section 5.2: if is injective, can you guarantee W_S =W with k+1 points? if yes do you prove the existence of such a set or also provide an algorithm to compute it?
A3. For deep linear NNs (Theorem 5.3), we can guarantee with points if the conditions are satisfied: is full-rank and its right singular vectors’ last submatrix is full rank. If these conditions are satisfied, we prove the existence of the distilled dataset and provide the algorithm to compute the distilled data in the proof (lines 607-612).
For general injective mapping, we can’t guarantee with points.
Q4. Why the experiment results are different from previous papers or at least why did you use different architectures?
A4. Please refer to Q1 in the global response.
Q5. Missing related works
A5. Thanks for the relevant papers! We will make sure to properly cite and discuss these papers in the revised version.
I want to thank the authors for their response. They have addressed my comments.
Most importantly. I want to ask them to make sure to incorporate the clarity and detailed responses for Q1, Q2, and Q3 to be clearly stated in the paper - it makes things much easier to understand.
For Q4, I would say that the main motive of this work is the provable guarantees which are rare in the field. While I am aware of methods that get better practical results -- I still think that the paper is robust due to the formal guarantees. Hence, I recommend the authors explain that in the experimental results section - as the goal is to show how these guarantees transfer to practice.
For Q5. It is very important to add all of these missing citations to the related work to clearly explain the difference between the paper and those works.
Assuming all of these notes are addressed (as should be) I am raising my score to 7, and I vote for accepting the work.
Thanks for the great suggestions and for raising the score! We will make sure to clarify Q1-Q3 in the revised version. For Q4, we will clarify that our motivation is a provable guarantee and the purpose of experiments is to show how these guarantees transfer to practice. For Q5, we will clearly discuss the relation and differences with the related papers. Thank you again for reviewing our work and for recommending acceptance!
The authors study theoretical aspects of dataset distillation for kernel ridge regression. They first provide analytical results for KRR with a k-dimensional label and linear features. They then extend these results to certain types of finite-dimensional feature maps, including feature maps which can be computed by a certain class of neural networks. They also study the privacy properties of their distilled datasets. They conduct experiments on standard benchmarks to verify their theory and show the computational advantage over another DD method designed for kernel regression (KIP).
优点
Dataset distillation (DD) is a relevant topic for the community, as it offers the potential to provide significant cost reductions in terms of both model training and data storage. This seems especially crucial as the size of both models and training data continues to grow. Furthermore, most work on DD has been empirical leaving the theoretical aspect under-studied. Theoretical insights may be helpful in moving towards more effective DD methods, as well as understanding the limits of what can be expected from DD.
I generally found the writing clear and easy to follow.
缺点
The main claims of the paper seem to be somewhat trivial extensions of known results [1] (reference [6] in the submitted manuscript) for linear regression. There are three extensions as compared to [1]:
- The present paper consider a k-dimensional label instead of a 1-dimensional real-valued label. The k-dimensional case can be derived with essentially the same techniques as the 1-dimensional case, as all the same formulas for ridge-regularized linear regression still hold.
- The present paper studies additional possibilities for distilled datasets when more than the minimum number of points are used. Most of these results are again direct consequences of applying the formulas used in point 1. The authors also motivate the use of more than the minimum number of points in the distilled dataset as a way of enforcing privacy, but there are problems with this claim as well (see below). If privacy is not preserved, then it's not clear what is the value in studying a larger-than-necessary distilled dataset.
- The present paper gives positive results for kernel ridge regression with finite-dimensional feature maps, whereas [1] only gave a negative result for kernel regression. However, these results are again a trivial extension of the linear case: the results of the linear setting hold directly in the feature space, and then some additional assumptions are made (e.g., surjectivity of the feature map) which allow one to recover points in the original data space mapping to the required features. This doesn't seem to provide much additional insight.
On the other hand, the present paper only studies the problem of distilling a single model from the data. As some of the primary motivating examples for the usefulness of DD are neural architecture search/hyperparameter tuning or continual learning, some theoretical results related to these problems would be of greater interest (and these questions were also studied by [1]).
The claims that the proposed method preserves the privacy of the original dataset have also not been adequately supported. Specifically, the authors show in Theorem 6.2 that there are infinitely many possible original datasets which could have led to the learned weights W. This fact is not sufficient to claim that privacy is preserved. For instance, in an image classification setting, Theorem 6.2 does not rule out the possibility that if we restrict the original dataset to contain only realistic images, then only certain images can lead to the final model weights W. This would then constitute a privacy breach. Put another way, Theorem 6.2 is essentially claiming that model training is inherently private since the map from data to model weights is not injective. This is at odds with the consensus of the ML community, since there is a great deal of literature dedicated to making the training of ML models private (e.g. [2] has over 6000 citations).
References:
[1] Izzo, Zachary and James Zou. "A theoretical study of dataset distillation." NeurIPS Workshop on Mathematics of Modern Machine Learning. 2023.
[2] Abadi, Martin, et al. "Deep learning with differential privacy." Proceedings of the ACM SIGSAC conference on computer and communications security. 2016.
问题
Do the results in the paper have implications for practical uses of DD, e.g. hyperparameter tuning/NAS or continual learning?
Is there a formal privacy definition (e.g., differential privacy) which the proposed DD method obeys?
局限性
Some limitations are discussed in the Future Work section. Potential negative societal impact is N/A.
We thank the reviewer for the detailed comments. We believe there are some misunderstandings. Please allow us to address your concerns point-by-point.
Q1: The main claims seem to be somewhat trivial extensions of [1].
A1. Our results are totally different from [1]. As discussed in the paper and Table 1, for linear ridge regression, [1, Theorem 3] requires distilled data points instead of . In their case, and . Our results of linear ridge regression and kernel ridge regression with surjective mapping only require distilled data. In [1, Theorem 1], only for a generalized linear model, where the data and label are assumed to follow a specific exponential density function (page 3 in [1]), they can achieve single distilled data. This is a very limited case and practical data generally do not satisfy this assumption. In contrast, our analysis does not have any assumptions about the data. We will make it clearer in the revised paper.
According to the proofs in [1, Theorem 3], they simply set the distilled data to be the right singular vectors (multiplied with the singular values) of the original dataset to keep the training loss always the same as on the original dataset. To do so, they need distilled data points. We do not require the training loss to be the same but only require the trained parameters to be the same, i.e. . To solve the distilled dataset, we developed some SVD techniques to solve the nonlinear equations of . Because of these new techniques, we can achieve distilled data instead of in [1].
We would appreciate it if you could specify which parts of our work you consider to be “trivial extensions” of [1], so we can address your concerns more effectively.
Q2. The k-dimensional case can be derived with the same techniques as the 1-dimensional case.
A2. It is a good thing that our framework can be easily generalized to the -dimensional case. However, we would like to emphasize that our techniques and results are totally different from [1].
Q3. More than the minimum number of points are used.
A3.
- First, we would like to emphasize that it is our paper shows that data points are necessary and sufficient for linear ridge regression and kernel ridge regression with surjective mapping. Previous results including [1] do not know the minimum number of points needed for these cases.
- Second, it is crucial to study the behavior of dataset distillation with different numbers of data points. Our results show that when (minimum number of points as you said), there is only one solution of distilled data (see Figure 1 for example). But when , there are infinitely many distilled datasets and we give analytical solutions to compute them. Only when , we have some freedom to choose the distilled datasets that we want.
- As said in the last point, only with more than minimum number of points, we can have some freedom to choose the distilled datasets we want from the infinitely many solutions. In this case, we can find realistic distilled data or noisy distilled data as shown in Corollary 4.1.1 and Figure 2 in our draft.
- Previous dataset distillation algorithms all use different numbers of distilled points. Generally, with more distilled data, the performance becomes better and the distilled data becomes more realistic.
Q4. Results for kernel ridge regression is a trivial extension of the linear case.
A4. Please refer to Q2 in the global response.
Q5. Only studies distilling a single model from the data...
A5. We believe that single-model distillation is the basis for further theoretical analysis. If we can’t even understand the single-model distillation thoroughly, we can’t analyze the more complicated cases such as distillation for multiple models, neural architecture search, and continual learning. In this paper, we give a thorough analysis of single-model distillation and theoretically show that one data per class is necessary and sufficient to recover the original model's performance in many settings. This will pave the way for the analysis of more complicated seniors.
Note [1] did not study the distillation of multiple models, neural architecture search, or hyperparameter tuning. Compared with [1], our setting is similar to their setting of hyperparameter distillation, where the same regularization is used for the original model and the distilled dataset model, i.e. . Our framework is even more flexible than [1] to allow and to be different. As emphasized in earlier points, our results are totally different from [1].
Our results have implications for hyperparameter tuning: Theorem 4.1 generally suggests a smaller is better for constructing distilled data that can perfectly recover the original model’s performance.
Q6. If we restrict the original dataset to contain only realistic images, then only certain images can lead to the final model weights W…
A6. The infinitely many solutions do not depend on the specific training data used. For example, when , given , the original data can be (line 660), where can be arbitrary matrix. We can take to be any random noise with arbitrarily large magnitude, will still guarantee the trained parameter to be .
Q7. Do the results in the paper have implications for practical uses of DD?
A7. See A5. Theorem 4.1 generally suggests a smaller is better to construct distilled data that can perfectly recover the original model’s performance. Our distilled data can be used in applications such as NAS and continual learning. We leave more explorations as future works.
Q8. Formal privacy definition that the proposed DD method obeys.
A8. Please refer to Q3 in the global response.
Thanks to the authors for their response. The authors make a good point that in practice, distilled datasets of various sizes (not just the minimum) are studied, so I agree that having results with more than the minimum number of points is valuable. I also agree that the proof techniques used appear to be significantly different from existing results, and I have raised my score accordingly. However, I believe there are still some issues:
-
I agree with the authors that Theorem 1 of [1] has assumed a particular data generating distribution for the data. However, my understanding is that this result still holds if the assumed likelihood is just used as a loss function for training the model, independent of the actual ground truth data distribution. (For instance, linear regression with the square loss implicitly assumes Gaussian errors, but the loss function can still be applied whether or not these assumptions hold.) Thus, the single-point distillation in [1] will still hold for linear regression, and indeed the GLM setup is a generalization of this result.
-
My understanding of Theorem 3 of [1] is that the same distilled dataset can be used to recover the entire regularization path on the original data, i.e. it can be used to recover multiple models (those which result from different regularization strengths).
-
I think a result of this sort would strengthen the paper greatly. However, the proof as stated isn't complete (for instance, why will ||W-W'|| be bounded? (I-Y_S^+ Y_S)Z is indeed Gaussian, but its distribution depends on the data-dependent quantity Y_S; why can the Gaussian mechanism still be applied in this case?).
Thank the reviewer for further questions. Please see our response below.
A1. We agree with the reviewer that the assumed likelihood can be used to train the model by maximizing the log-likelihood, with the loss function being the negative log-likelihood. However, it does not include linear regression with square loss (least squares). Only when the distribution is normal, does the linear regression of the generalized linear model match the least-squares solution. Therefore there is no overlap between our results and [1] since the models and loss functions are different, i.e. ridge regression vs maximum likelihood estimation. [1] proves the single distilled data for the generalized linear model, where maximum likelihood estimation is used. We prove the single distilled data (one data per class) for least squares and ridge regression. We will clearly discuss the differences in the revised paper. We thank the reviewer for drawing our attention to this point.
That being said, for least squares and ridge regression, we indeed improve [1, Theorem 3] from data points to data points. Least squares and ridge regression are often used in practice and dataset distillation algorithms because their analytical solutions can be easily computed. We also draw a connection between our results and the practical kernel-based dataset distillation algorithms in Theorem 6.1.
In addition, we would like to emphasize our contributions of kernel ridge regression (both surjective and non-surjective cases), which improves the negative results in [1]. Our results also improve [2] from to for the random Fourier features/shift-invariant kernels in the surjective case.
[1] Izzo, Zachary and James Zou. "A theoretical study of dataset distillation." NeurIPS Workshop on Mathematics of Modern Machine Learning. 2023.
[2] Alaa Maalouf, Murad Tukan, Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. On the size and approximation error of distilled sets. NeurIPS 2023.
A2. We agree with the reviewer that the distilled data in [1, Theorem 3] works for all and . But they need data points compared with our points. As demonstrated in Table 2 of our draft, in practical scenarios, . In practice, we typically select the regularization parameter that yields the best performance.
Our analysis can recover the results of [1, Theorem 3]. When and , is . Taking and to be similar quantities as [1, Theorem 3], we can show and and this hold for all . When , it is generally impossible to have this result (unless is low rank) since is at most rank and is rank . Nevertheless, if we do not require the distilled data to be valid for all , our approach can achieve the distillation with just data points. We will discuss this relationship and the distinctions more explicitly in the revised version of the paper.
Furthermore, a more interesting scenario would be the same distilled data for different models and , which is possible to analyze based on our framework. We leave more explorations as future works.
A3. Bounding : Suppose we have two dataset that differ only in the first point. Suppose , then . Therefore . The Frobenius norm can be bounded as follows: , where the last inequality is because [3, Theorems 2.2]. Assume the the data points are bounded and the smallest singular value of datasets is bounded from below . Then .
To justify the boundness of the smallest singular value, suppose the data points are independent sub-gaussian (e.g. bounded) and isotropic, then by [4, Theorem 5.39], with probability at least , , where only depends on the sub-gaussian norm of .
Applying the Gaussian mechanism: Suppose are deterministic and independent of , e.g. we can take to be one-hot labels, then is a deterministic function of . As we have shown above, the sensitivity of is bounded. Since each element of is a zero-mean Gaussian and only its variance depends on , we can apply the Gaussian mechanism to .
We will include the rigorous proof in the revised paper.
[3] L. Meng and B. Zheng. The optimal perturbation bounds of the Moore–Penrose inverse under the Frobenius norm. Linear Algebra Appl., 432:956–963, 2010.
[4] Vershynin, Roman. "Introduction to the non-asymptotic analysis of random matrices." arXiv preprint arXiv:1011.3027 (2010).
We hope our responses have addressed all your concerns. Please let us know if you have any further questions! If your concerns are addressed, we would appreciate it if you could kindly consider raising your score.
Thanks for the additional comments. I will maintain my score.
This work presented a theoretical framework of dataset distillation for KRR, showing that (slightly more than) one data point per class is often necessary and sufficient to recover the performance of the original model. The theory led to a provable and efficient algorithm for dataset distillation, whose effectiveness was verified through experiments.
优点
- The introduction provides a concise and clear view of the related work on dataset distillation and the contributions of the theoretical findings. The following setup and results are well-organized.
- The theoretical results for the linear and surjective kernel cases are strong, both in terms of improving the existing theories and in terms of providing insights for designing efficient algorithms.
- The visualizations are straightforward and helpful for understanding, e.g., Figure 2 provides good support for the choice of in Corollary 4.1.1.
缺点
- When comparing with the existing theories, in addition to the direct comparison of the required number of distilled data (e.g., v.s. ), it would be helpful to provide more intuitions on where such significant improvement comes from. Whether it is due to specific settings or assumptions or due to new techniques or insights?
- It is slightly confusing to refer to the feature map as "surjective", especially in the introduction without clear definitions. It would be helpful to provide more intuitions why a surjective feature map matters.
- It seems that the analysis is relatively limited to the linear and surjective kernel cases, whereas the non-surjective kernel (or neural network) case is a direct extension of the surjective kernel results and relies on several strong assumptions like invertible activation (which excludes ReLU) and full-rank weight matrices.
问题
- While Theorem 6.2 implies that it's impossible to exactly recover from and , from Figure 2, it seems that when is chosen to be (an affine transform of) real trained data , can still reveal most semantic information of . This is slightly confusing and would be helpful to clarify.
- Figure 3 shows that changing the arbitrary matrix from (an affine transform of) real trained data to a Gaussian random matrix leads to a private dataset distillation, at least to human eyes. But from the expression , it seems that such private distilled data are semantic information + Gaussian noise. With prior knowledge of (the distribution of) , can standard denoising techniques be applied to recover the semantic information?
局限性
The major limitations are well-discussed in future works. Some further limitations are mentioned in Weaknesses.
We thank the reviewer for the positive feedback, thoughtful review, and for appreciating the merits of our work!
Q1: intuitions on where significant theoretical improvement comes from.
A1. Thanks for the great suggestion! The improvement mainly comes from new techniques of solving distilled datasets. In [1], they only use label distillation and can be arbitrary data. data points are needed for label distillation according to our Theorem 4.2. Label distillation only utilizes the parameters of and more parameters of () are not used. Our analysis constructs explicitly which has more parameters and therefore requires fewer data points.
In [2, Theorem 3], they simply set the distilled data to be the right singular vectors (multiplied with the singular values) of the original dataset to keep the training loss always the same as on the original dataset. To do so, they need distilled data points. We do not require the training loss to be the same but only require the trained parameters to be the same, i.e. . To solve the distilled dataset, we developed some SVD techniques to solve the nonlinear equation of . Because of these new techniques, we can achieve distilled data instead of in [2].
[1] Alaa Maalouf, Murad Tukan, Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. On the size and approximation error of distilled sets. NeurIPS 2023.
[2] Izzo, Zachary and James Zou. "A theoretical study of dataset distillation." NeurIPS Workshop on Mathematics of Modern Machine Learning. 2023.
Q2: Confusion about surjective feature map and intuitions why a surjective feature map matters.
A2. A function is surjective if, for every in its codomain, there exists at least one element in the function's domain such that . In this case, we can analyze kernel ridge regression in the feature space similar to a linear model and find the desired features that can guarantee . Because of the property of surjective mapping, there always exists some distilled data that maps to the desired features. Non-surjective mapping is harder because there may not exist distilled data that corresponds to the desired feature. We will add more explanations in the introduction.
Q3: The analysis is relatively limited to the linear and surjective kernel cases, whereas the non-surjective kernel (or neural network) case is a direct extension of the surjective kernel results.
A3. The non-surjective mapping case (Theorem 5.3) is not a direct extension of the surjective case. As explained in A2, non-surjective mapping is harder because there may not exist distilled data that corresponds to the desired features. To find the distilled data theoretically, we need to guarantee 1) the feature can guarantee and 2) there are some distilled data corresponding to the feature. These involve many technical challenges: 1) handling the pseudoinverse of sum of matrices in the distilled feature (Theorem C.1), 2) solving an overdetermined system of linear equations that has solutions only in limited cases (line 572), 3) solving a system of equations that has multiple free variables (line 589). The analysis for linear NN is already very involved and complicated. Please see the proof of Theorem 5.3, lines 564-613, for reference. We leave more analysis for non-surjective deep non-linear NNs as future works.
Q4: From Figure 2, when is chosen to be (an affine transform of) real trained data, can still reveal most semantic information of .
A4. We can choose to be real data that is outside of the training data . In this case, even if resembles , it will not leak the information of training data. At the same time, can still guarantee .
Q5: With prior knowledge of (the distribution of) , can standard denoising techniques be applied to recover the semantic information?
A5. Thanks for the insightful question! Given , if we know , we can easily recover the semantic part by . But even if we can recover , there are infinitely many solutions of according to Theorem 6.2. For example, when , (line 660), where can be arbitrary matrix.
It is also possible to prove differential privacy for the distilled data. Please refer to Q3 in the global response.
Thanks for the response. My questions are addressed. I will keep my original evaluation.
Thanks for your response and for recommending acceptance! We really appreciate it! We will make sure to follow the suggestions and clarify the questions in the revised version.
This paper studies the number of synthetic data required in dataset distillation to ensure that the optimal parameter learned on the synthetic dataset is the same of that learned on the original dataset.
The task considered is kernel ridge regression (KRR), where the labels contain classes. The goal is to preserve the optimal in . Let denote the number of synthetic samples in the distilled dataset.
- For linear ridge regression (LRR), the paper finds that suffice to recover the parameter.
- The paper assumes access to . Let be any choice of synthetic labels such that it has rank .
- The proof works by essentially setting .
- If we want the synthetic data to be realistic, the paper suggests to set where are sampled from the real dataset -- this means . In other words, can be preserved from a subset of samples in the real data set, as long as these samples span a rank- space.
- For KRR with surjective feature mappings , samples also suffice to recover the parameter.
- Examples of surjective feature mappings include invertible NNs, fully-connected nets with invertible activation and full-rank parameters, and random Fourier features with full-rank parameters.
- Proof idea: first, note that recovering the linear weight on top of the feature is the same as LRR. The next step is then to recover from , which is feasible when is invertible.
- For KRR with injective feature mappings (e.g. given by NNs), samples are not sufficient in general. For deep linear NNs, samples suffice.
优点
- The paper provides a theoretical analyses on the minimal number of samples required for a synthetic dataset to preserve the parameter to a ridge regression problem.
- The conditions proposed in this paper can be used to ensure 0 loss in some objectives proposed in prior work.
- The paper empirically verify the number of samples required to, and is significantly faster than KIP (Kernel Inducing Points, a method in prior work).
缺点
- The method proposed in this paper requires knowing , which is not always feasible. Moreover, the algorithm explicitly uses , which means that the computational time grows with the feature dimension, which could be infinite.
- In contrast, KIP uses the kernel trick and hence does not incur a dependency on the feature dimension.
- I'm concerned about the scalability of the method: although the synthetic dataset can preserve the performance on the full dataset, the accuracy is low (e.g. 42% on CIFAR10). I wonder if the method can preserve the performance of SOTA models on CIFAR10 (e.g. reaching >80% accuracy) and CIFAR100 (e.g. reaching >60% accuracy).
- Some conditions in the theorem statements and wordings in the proofs may be problematic. Please see questions below.
问题
- In the proof of Thm 5.2, the second bullet point says there existing some corresponding to is equivalent to being recoverable from . I don't see why the equivalency is true; for example, could map all samples from the same class to a same feature embedding. What am I missing?
- Line 578: "for any " should be "there exists some "?
- The choice of is "any matrix of the correct size" at multiple places, whereas Thm C.1 additionally requires it to make full rank. Please fix the conditions.
- Thm 6.1: should be ?
- For completeness of the paper, please include a brief explanation of the baseline KIP.
局限性
There is no direct societal impact.
References
[1] Noel Loo, et al. Efficient dataset distillation using random feature approximation. NeurIPS 2022.
[2] Nguyen, et al. Dataset Distillation with Infinitely Wide Convolutional Networks. NeurIPS 2021.
[3] Yongchao Zhou, et al. Dataset distillation using neural feature regression. NeurIPS 2022.
[4] Noel Loo, et al. Dataset distillation with convexified implicit gradients. NeurIPS 2023.
We thank the reviewer for their careful reading of our paper and constructive comments. We believe there are some misunderstandings. Please allow us to address your concerns point-by-point.
Q1. Proof idea and techniques of Theorem 4.1: The proof works by essentially setting .
A1. is only a special case when , and is full rank. Our framework is more general for any , and . The proof works by setting , from which we can solve for any . Then we developed some SVD techniques to solve the nonlinear equation of . needs to be computed using SVD or pseudoinverse as in Theorem 4.1.
Q2: Realistic distilled data in Corollary 4.1.1: the paper suggests to set where are sampled from the real dataset -- this means .
A2. In Corollary 4.1.1, we mainly find the distilled data instead of setting the distilled labels . We solve the that is closest to the original data in the sense that is minimized. Then can be computed from this using Theorem 4.1. Note in general because the equation in line 504 can be inconsistent. As you can see from the first two rows of Figure 2 (page 5), there are still some differences between original and distilled images. After solve this , we find can further minimize the difference of (line 510). However, alone without solving will not work. Because, with only label distillation, we need data points as per Theorem 4.2.
Q3: The method proposed requires knowing . The computational time grows with the feature dimension . KIP uses the kernel trick and avoids the dependency on
A3. First, can be easily computed given the original dataset so it should not be a problem. Second, we would like to note many dataset distillation methods require an original model. For example, Gradient Matching [30, 28, 12, 8 in the paper] and Trajectory Matching [1, 3, 5, 4 in the paper] match the gradients or training trajectories of models trained on the original dataset and distilled dataset.
Although KIP avoids the dependency on by using NTK, it is prohibitively slow because the NTK computation scale with [1], where is the number of distilled data. It is even computationally infeasible for CIFAR-100 with [2]. RFAD [1] proposes to speed up KIP by replacing NTK with random features. Also, there is still a gap between NTK and finite-width NNs, leading to performance degradation when transferred to NNs.
More recent methods such as RFAD [1], FRePo [3], and RCIG [4] solve a kernel regression on top of the NN’s features, which is more similar to our setting. Their approaches also depend on the feature dimension . However, in general, the dimension of NN’s features is not too large and these algorithms are more computationally feasible than KIP. Our algorithm is even more computationally efficient for (MNIST) and (CIFAR) and much more efficient than KIP as shown in Table 4. The run time is 17 seconds for MNIST and 13 seconds for CIFAR-10 on an A5000 GPU.
Q4: Scalability of the method: the accuracy is low on CIFAR
A4. Please refer to Q1 in the global response.
Q5: is recoverable from
A5. Thanks for pointing it out. We meant that we need to find a such that where is some desired feature. In the case that maps all samples from the same class to the same feature embedding, any sample from that class should suffice to be a solution. But we also have another condition that should guarantee . We will make it clearer in the revised paper.
Q6: Line 578: "for any " should be "there exists some "?
A6. This is a homogeneous system of linear equations and the matrix is singular (the number of equations is less than the variables), so there are infinitely many solutions. Therefore arbitrary will make up a solution. See for reference: Obtaining all solutions of a linear system. Note here we only consider the second condition – is solvable from a given , and any will guarantee is solvable.
Q7: The choice of is "any matrix of the correct size" at multiple places, whereas Theorem C.1 additionally requires it to make full rank.
A7. In Theorem C.1 it is , which is different with at other places. Theorem C.1 does not conflict with other results. In Theorem 4.1, we give conditions for through some pseudoinverse calculation, which is hard to handle when doing more analysis because there is no concise formulation for the pseudoinverse of sum of matrices. In Theorem C.1, we provide additional direct characterizations of without pseudoinverse. Because Theorem C.1 is only used in the proof of Theorem 5.3, we did not include it in the main paper. Also, note that the third condition of Theorem C.1 does not require to be full rank as noted in line 495. This condition is used in the proof of Theorem 5.3.
Q8: Thm 6.1: should be ?
A8. It is , because we want to guarantee on the original dataset, where there are data points.
Q9: brief explanation of the baseline KIP.
A9. We will add a brief explanation of KIP algorithm in the appendix.
Dear Reviewer nkTx,
As the discussion period is close to the end and we have not yet heard back from you, we wanted to reach out to see if our rebuttal response has addressed your concerns. We are more than happy to discuss any further concerns and issues you may have. If your concerns have been addressed, we would appreciate it if you could kindly re-evaluate our work. Thank you for your time.
We thank the reviewer for the reply and further questions. Please see our response below.
Q1. There's no reason why we need or want to take .
A1. There are reasons that we want to take . When , given fixed and , is deterministic and so is . There is no freedom to choose the distilled data we want. When , as discussed in lines 136-139, we can choose by varying . For example, in Corollary 4.1.1, we choose such that is close to real data. We can also take to be random noise such that the distilled data is noisy as shown in Fig 2 in the draft.
Q2. Comment on the scaling of the two methods.
A2. Thanks for the suggestion! We will add a discussion in the revised version.
Q3. Thm C.1: when .
A3. In Thm C.1, the first two cases share the same conditions and they are complementary ( and ). The third case has its own conditions. To make it clearer, we will write the third case in a separate theorem.
The first case does not require . The results still hold even when . We will delete the sentence ``from Eq (2), ’’, which is misleading, as always hold.
Q4. Cor 4.1.1.: then . How/where is defined?
A4. Note when , in general. Therefore in general.
is defined in the same way as (line 103). We will include a definition to make it clearer.
Q5. In Sec 3.2: are and the same?
A5. Thanks for pointing it out. It is a typo. It should be between lines 93-94. is a predefined hyperparameter in our setting. needs to satisfy some conditions in Theorem 4.1 and 5.1 to guarantee . Please also refer to our response to Q1 and Q2 of Reviewer RvWD.
Q6. Line 578: it cannot be "any" , since is "given".
A6. We consider the two conditions separately to find the form of that satisfies each condition and then combine the conditions to solve . If we only consider the second condition, we are trying to find the form of such that is solvable from . Then is solvable from in Eq (7) for any . If we consider the first condition together with the second condition, then the first condition requires to be given in the form of in Eq (6). Then it is indeed for some .
To make it more rigorous, we will revise it to be ``there exists some ’’.
We hope our responses have addressed all your concerns. Please let us know if you have any further questions!
I apologize for my late response, and thank you for the clarifications!
Q1: I understand that is only a special case (by taking ); what I meant was that the proof works essentially the same as how one would proceed for this special case (i.e. SVD). Moreover, based on the theorem statement, there's no reason why we need or want to take .
Q3, about the quadratic scaling in the number of samples (as in KIP) vs in the feature dimension : thanks for the clarification. Please add a comment on the scaling, so that it's transparent how the two methods compare.
Q4: thanks for the new numbers, I think they are convincing.
I find the math writing in the paper confusing, especially with quantifiers.
- Thm C.1: when ,
- Cor 4.1.1.: the statement feels tautological: it says that to minimize the distance between and , take , and set -- apologies for the poor formatting; I think the renderer got confused with LaTex and markdown.
- Moreover, how/where is defined?
- In Sec 3.2: are and the same? I'm confused especially in the equation below line 93.
- Line 578: it cannot be "any" , since is "given".
We would like to thank all the reviewers for taking the time and effort to review our paper! We are delighted that the reviewers found:
- The paper provides a theoretical analysis on the minimal number of samples required for a synthetic dataset to preserve the parameter to a ridge regression problem (Reviewer nkTx);
- The theoretical results for the linear and surjective kernel cases are strong, both in terms of improving the existing theories and in terms of providing insights for designing efficient algorithms (Reviewer 7eAw);
- The visualizations are straightforward and helpful for understanding, e.g., Figure 2 provides good support for the choice of in Corollary 4.1.1 (Reviewer 7eAw);
- Theoretical insights may be helpful in moving towards more effective DD methods, as well as understanding the limits of what can be expected from DD (Reviewer ish1);
- I like the contribution of this paper, and I think it is super crucial to have theoretical results in this practical field (Reviewer RvWD).
Below, we respond to common questions shared by reviewers in this global response. Please don’t hesitate to let us know if you have any additional feedback or questions regarding our response. We would be happy to address any remaining concerns with you in the discussion period if any. If our responses have addressed your concerns and questions, we would appreciate it if you could kindly let us know and consider raising your review score.
Q1: Scalability of the method: the accuracy is low on CIFAR. (Reviewer nkTx and RvWD)
A1. As Algorithm 1 we proposed is mainly for KRR with surjective mappings, we verified it and compared it with baselines under this setting. In experiment (II), we use a randomly initialized bijective NN as . It matches previous algorithms that start with a randomly initialized NN. If we use a pre-trained NN as , the accuracy can be improved and may match the SOTA.
We conduct additional experiments to use a pre-trained 5-layer FCNN as . Under this setting, we can get an accuracy of 57.91%0.23 on CIFAR-10 and 31.78%0.23 on CIFAR-100 for IPC=1, which is comparable to the SOTA dataset distillation algorithms such as KIP [1] (49.90.2 for CIFAR-10 and 15.70.2 for CIFAR-100) and MTT [2] (46.30.8 for CIFAR-10 and 24.30.3 for CIFAR-100) when IPC=1.
[1] Nguyen, et al. Dataset Distillation with Infinitely Wide Convolutional Networks. NeurIPS 2021.
[2] Cazenavette , et al. Dataset distillation by matching training trajectories. CVPR 2022.
Q2. Results for kernel ridge regression is a direct extension of the linear case. (Reviewer 7eAw and ish1)
A2. The non-surjective mapping case (Theorem 5.3) is not a direct extension of the linear case. Non-surjective mapping is much harder because there may not exist distilled data that corresponds to the desired feature. To find the distilled data theoretically, we need to guarantee 1) the feature can guarantee and 2) there are some distilled data corresponding to the feature. These involve many technical challenges: 1) handling the pseudoinverse of sum of matrices in the distilled feature (Theorem C.1), 2) solving an overdetermined system of linear equations that has solutions only in limited cases (line 572), 3) solving a system of equations that has multiple free variables (line 589). The analysis is very involved and complicated and the results are not a direct extension of the linear case. Please see the proof of Theorem 5.3, lines 564-613, for reference.
The surjective mapping case can not be fully covered by the linear case. In particular, we give examples of models that satisfy the condition and provide useful cases, including Invertible NN, FCNN, CNN, and Random Fourier Features. These models are ubiquitously used in practices. More importantly, for Random Fourier Features, our results theoretically improve previous distilled data [3] to , where in general can be very large.
[3] Alaa Maalouf, Murad Tukan, Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. On the size and approximation error of distilled sets. NeurIPS 2023.
Q3. Is there a formal privacy definition (e.g., differential privacy) which the proposed DD method obeys? (Reviewer ish1)
A3. It is possible to prove the distilled dataset is differential private for the original dataset. For example, for the linear case (Theorem 4.1), . We can take the elements of to be i.i.d. random variables drawn from a zero-mean Gaussian distribution, then the elements of will still be zero-mean Gaussian random variables. Supposed original datasets and only differ in one data point and their resulting parameter are and . As long as is bounded, by the Gaussian mechanism [4, Theorem 3.22], with suitable variance will guarantee to be differential private for the original dataset. The computation from to is deterministic therefore will keep the differential privacy.
[4] Cynthia Dwork and Aaron Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 2014.
This paper received high-quality reviews from four knowledgable reviewers who engaged actively during rebuttal, with diametrically opposing opinions. Reviewers ish1 and nkTx had the two most critical reviews. ish1 believes the main claims are somewhat trivial extensions of known results and questions the novelty and technical contributions. They remained unconvinced of the merits of the paper after a lengthy discussion with the authors. Reviewer nkTx, on the other hand, had concerns about the scalability of the method and the dependency on the feature dimension. The reviewer also found some conditions and wordings in the proofs problematic. However, after the discussion with the authors, this reviewer seemed satisfied with many of the clarifications, and had mostly notation and presentation outstanding comments, even though they did not update their score to reflect this.
On the other side of the discussion, we have reviewers 7eAw and RvWD, who were in favor of accepting the paper. Reviewer 7eAw expressed appreciation for the theoretical framework and contributions, especially the those for the linear and surjective kernel cases. Reviewer RvWD also highlighted the theoretical contributions of the paper, but initially expressed concerns about the correctness of the theorem statements and the experiments. However, the reviewer conceded that their major concerns were resolved by the authors’ response, and updated their score to reflect this. Both of these reviewers had a long list of recommendations for the authors in terms of presentation and framing, which they requested the authors to incorporate in a revised version if the paper were to be accepted.
After reading the paper myself twice, and going through the forum multiple times, I find myself torn and could easily find myself advocating for rejection or acceptance, depending on what issue is being highlighted. I take ish1’s points about novelty and extent of contribution, but whether these extensions are considered “trivial” or “substantial” is subjective. In cases of novelty ambiguity, I believe we should fall back to correctness and clarity, and this paper seems to satisfy both. Despite the various questions and concerns about notation raised by the reviewers, it does not seem to me that the mathematical errors/misunderstandings that came up during the discussion is unusually widespread or critical. Most of these were clarified during the rebuttal phase and none of them seem to have undone or substantially weakened the theoretical results. In terms of contribution it should be pointed out that [6] is a workshop paper, and therefore brief. Even if the only contribution of this paper where extensions over that paper, this one provides a much more detailed analysis and discussion of the results, which make it a valuable addition to [6] for the community.
All in all, I think this is an interesting but perhaps not ground-breaking paper that provides interesting theoretical guarantees for dataset distillation in certain simplified scenarios, some of which are extensions of known results, while some are not, but which could nevertheless be found interesting and relevant by the NeurIPS community.