PaperHub
5.3
/10
Poster4 位审稿人
最低3最高7标准差1.8
7
4
7
3
4.0
置信度
正确性2.5
贡献度2.5
表达2.5
NeurIPS 2024

Provable and Efficient Dataset Distillation for Kernel Ridge Regression

OpenReviewPDF
提交: 2024-05-09更新: 2025-01-04
TL;DR

For dataset distillation of kernel ridge regression, we show theoretically that one data per class is necessary and sufficient to recover the original model's performance in many settings.

摘要

Deep learning models are now trained on increasingly larger datasets, making it crucial to reduce computational costs and improve data quality. Dataset distillation aims to distill a large dataset into a small synthesized dataset such that models trained on it can achieve similar performance to those trained on the original dataset. While there have been many empirical efforts to improve dataset distillation algorithms, a thorough theoretical analysis and provable, efficient algorithms are still lacking. In this paper, by focusing on dataset distillation for kernel ridge regression (KRR), we show that one data point per class is already necessary and sufficient to recover the original model's performance in many settings. For linear ridge regression and KRR with surjective feature mappings, we provide necessary and sufficient conditions for the distilled dataset to recover the original model's parameters. For KRR with injective feature mappings of deep neural networks, we show that while one data point per class is not sufficient in general, $k+1$ data points can be sufficient for deep linear neural networks, where $k$ is the number of classes. Our theoretical results enable directly constructing analytical solutions for distilled datasets, resulting in a provable and efficient dataset distillation algorithm for KRR. We verify our theory experimentally and show that our algorithm outperforms previous work such as KIP while being significantly more efficient, e.g. 15840$\times$ faster on CIFAR-100. Our code is available at GitHub.
关键词
Dataset DistillationKernel Ridge Regression

评审与讨论

审稿意见
7

The paper presents provable algorithms for computing distilled sets for various problems.

Starting with LRR, they prove that it is enough to have m = number of classes to compute an exact solution as on the whole data. Then they directly show how to make this distilled point "Realistic" by initializing m points from the data and computing matrix D (the distilled data) to satisfy the conditions required for the distilled set while being close to these m points.

Then they prove that distilling the labels only can yield the same performance with d points (which is notably immediate from [16] which basically proved the same for the more generic case of KRR via RFF).

Section 5 is an extension of the previous results from LRR to KRR which is done in the same way after swapping the original feature with the mapped features. However, what was missing was how to construct the distilled set of points in the original space from the mapped distilled points -- which is easy if \phi is bijective. But when it is not, the authors show how to at least compute an approximation for these points.

Section 6 is crucial to draw a connection as to why practical algorithms were working in the first place!

My summary:

  • I like the contribution of this paper, and I think it is super crucial to have theoretical results in this practical field.
  • However, I have some major and minor concerns and questions I wish to solve in order to accept that paper, specifically about the correctness of the theorems and experiments.
  • I am willing to increase my score once my questions/concerns are addressed.

优点

  • Provides important provable guarantees and algorithms for the distillation work which was mainly practical.

  • It is very important to provide a better understanding of distillation and why it works in the first place -- a gap that has been missing for so long.

  • The method is very efficient which is very useful.

缺点

My main concerns can be split into three:

a. The framing and guarantees:

  1. why is \lamba_s not equal to \lamda? How do you formulate the distillation with both \lambda and \lambda_S?

  2. For theorem 4.1 (line 118), can you provide an explanation why \lambda_S \leq 1/(4\alpha'^2). This seems to be a very rare case not related to practice. Or do you suggest solving the problem of LRR on the small data with another lambda? if so, how can it be determined?

  3. Same question as before for theorem 5.1. How do you find these lambda_s?

  4. Section 5.2, I did not fully understand the result, if \phi is injective, can you guarantee W_s =W with k+1 points? if yes do you prove the existence of such a set or also provide an algorithm to compute it?

b. Experimental results:

  1. Table 4 in the experimental results is not clear and might have some issues. For example "https://proceedings.neurips.cc/paper_files/paper/2022/file/5a28f46993c19f428f482cc59db40870-Paper-Conference.pdf" and https://arxiv.org/pdf/2307.08086 conducted similar experiments and got much higher results both for KIP and their methods. Could you elaborate on why the results are different or at least why did you use different architectures?

c. Simple comment but important:

  1. Related work is missing specifically, for LRR: "https://proceedings.neurips.cc/paper/2019/hash/475fbefa9ebfba9233364533aafd02a3-Abstract.html" showed how to construct a set of d^2 points with the same exact solution and is a subset of the input,* then, https://ieeexplore.ieee.org/abstract/document/9677460 extended the result and showed how to compute a distilled set (not subset) with only d point and yield the same solution.

Following the rebuttal (response) from the authors, my concerns have been addressed, I am raising my score.

问题

Please see all my questions in the Weaknesses section

局限性

I suggest the authors to clearly frame the distillation problem and what is the goal of it-- explaining the confusing \lambda_s and stating the main results clearer.

作者回复

We thank the reviewer for the positive feedback, thoughtful comments, and for appreciating the novelty and value of our work!


Q1. Why is λS\lambda_S not equal to λ\lambda?

A1. Our goal is to find XSX_S such that WS=WW_S = W. λS\lambda_S and λ\lambda are predefined hyperparameters in our setting. We allow them to be different so that the framework is more general and flexible. We can also simply set them to be the same. In our framework, λ\lambda does not matter much because it is only used to compute the original model’s parameter WW, and WW is supposed to be given or easily computed in our framework. λS\lambda_S needs to satisfy some conditions in Theorem 4.1 and 5.1 to guarantee WS=WW_S = W. We will make it clearer in the revised paper.


Q2. Explanation why λS14σ12\lambda_S \leq \frac{1}{4 \sigma_1'^2}. This seems to be a very rare case not related to practice. How to determine the λS\lambda_S in Theorem 4.1 and 5.1.

A2. λS14σ12\lambda_S \leq \frac{1}{4 \sigma_1'^2} is required because XSX_S needs to be solvable from a given DD. From Eq (2), line 420, (XSXS+Im)1XS=D(X_S^\top X_S + I_m)^{-1} X_S^\top = D. Given a DD, in order to have a solution for XSX_S, the equation σiσi2σi+λSσi=0\sigma_i’ \sigma_i^2 - \sigma_i + \lambda_S \sigma_i’ = 0 on line 427 must have a solution for σi\sigma_i. From this, we have the requirement between λS\lambda_S and σ1\sigma_1’. When λS=0\lambda_S=0, it is simpler and there is no such requirement.

This requirement is easy to satisfy. For example, given a DD and its largest singular value σ1\sigma_1’, we can simply set λS=14σ12\lambda_S = \frac{1}{4 \sigma_1'^2} or any number less than it. If we want to fix a predefined λS\lambda_S, e.g. λS=λ\lambda_S=\lambda, we need to sample different DD (by sampling different ZZ) so that its largest singular value satisfies the condition λS14σ12\lambda_S \leq \frac{1}{4 \sigma_1'^2}.

Practical algorithms, e.g. KIP, FRePo, and RFAD, usually use a very small regularization λS\lambda_S, which may already satisfy this requirement. For example, KIP uses λS=106Tr(K(XS,XS))/m\lambda_S = 10^{-6} \cdot Tr(K(X_S, X_S)) / m. Theorem 4.1 generally suggests a smaller λS\lambda_S is better to construct distilled data that can perfectly recover the original model’s performance.


Q3. Section 5.2: if ϕ\phi is injective, can you guarantee W_S =W with k+1 points? if yes do you prove the existence of such a set or also provide an algorithm to compute it?

A3. For deep linear NNs (Theorem 5.3), we can guarantee WS=WW_S =W with k+1k+1 points if the conditions are satisfied: HH is full-rank and its right singular vectors’ last p×pp \times p submatrix is full rank. If these conditions are satisfied, we prove the existence of the distilled dataset and provide the algorithm to compute the distilled data in the proof (lines 607-612).

For general injective mapping, we can’t guarantee WS=WW_S =W with k+1k+1 points.


Q4. Why the experiment results are different from previous papers or at least why did you use different architectures?

A4. Please refer to Q1 in the global response.


Q5. Missing related works

A5. Thanks for the relevant papers! We will make sure to properly cite and discuss these papers in the revised version.

评论

I want to thank the authors for their response. They have addressed my comments.

Most importantly. I want to ask them to make sure to incorporate the clarity and detailed responses for Q1, Q2, and Q3 to be clearly stated in the paper - it makes things much easier to understand.

For Q4, I would say that the main motive of this work is the provable guarantees which are rare in the field. While I am aware of methods that get better practical results -- I still think that the paper is robust due to the formal guarantees. Hence, I recommend the authors explain that in the experimental results section - as the goal is to show how these guarantees transfer to practice.

For Q5. It is very important to add all of these missing citations to the related work to clearly explain the difference between the paper and those works.

Assuming all of these notes are addressed (as should be) I am raising my score to 7, and I vote for accepting the work.

评论

Thanks for the great suggestions and for raising the score! We will make sure to clarify Q1-Q3 in the revised version. For Q4, we will clarify that our motivation is a provable guarantee and the purpose of experiments is to show how these guarantees transfer to practice. For Q5, we will clearly discuss the relation and differences with the related papers. Thank you again for reviewing our work and for recommending acceptance!

审稿意见
4

The authors study theoretical aspects of dataset distillation for kernel ridge regression. They first provide analytical results for KRR with a k-dimensional label and linear features. They then extend these results to certain types of finite-dimensional feature maps, including feature maps which can be computed by a certain class of neural networks. They also study the privacy properties of their distilled datasets. They conduct experiments on standard benchmarks to verify their theory and show the computational advantage over another DD method designed for kernel regression (KIP).

优点

Dataset distillation (DD) is a relevant topic for the community, as it offers the potential to provide significant cost reductions in terms of both model training and data storage. This seems especially crucial as the size of both models and training data continues to grow. Furthermore, most work on DD has been empirical leaving the theoretical aspect under-studied. Theoretical insights may be helpful in moving towards more effective DD methods, as well as understanding the limits of what can be expected from DD.

I generally found the writing clear and easy to follow.

缺点

The main claims of the paper seem to be somewhat trivial extensions of known results [1] (reference [6] in the submitted manuscript) for linear regression. There are three extensions as compared to [1]:

  1. The present paper consider a k-dimensional label instead of a 1-dimensional real-valued label. The k-dimensional case can be derived with essentially the same techniques as the 1-dimensional case, as all the same formulas for ridge-regularized linear regression still hold.
  2. The present paper studies additional possibilities for distilled datasets when more than the minimum number of points are used. Most of these results are again direct consequences of applying the formulas used in point 1. The authors also motivate the use of more than the minimum number of points in the distilled dataset as a way of enforcing privacy, but there are problems with this claim as well (see below). If privacy is not preserved, then it's not clear what is the value in studying a larger-than-necessary distilled dataset.
  3. The present paper gives positive results for kernel ridge regression with finite-dimensional feature maps, whereas [1] only gave a negative result for kernel regression. However, these results are again a trivial extension of the linear case: the results of the linear setting hold directly in the feature space, and then some additional assumptions are made (e.g., surjectivity of the feature map) which allow one to recover points in the original data space mapping to the required features. This doesn't seem to provide much additional insight.

On the other hand, the present paper only studies the problem of distilling a single model from the data. As some of the primary motivating examples for the usefulness of DD are neural architecture search/hyperparameter tuning or continual learning, some theoretical results related to these problems would be of greater interest (and these questions were also studied by [1]).

The claims that the proposed method preserves the privacy of the original dataset have also not been adequately supported. Specifically, the authors show in Theorem 6.2 that there are infinitely many possible original datasets which could have led to the learned weights W. This fact is not sufficient to claim that privacy is preserved. For instance, in an image classification setting, Theorem 6.2 does not rule out the possibility that if we restrict the original dataset to contain only realistic images, then only certain images can lead to the final model weights W. This would then constitute a privacy breach. Put another way, Theorem 6.2 is essentially claiming that model training is inherently private since the map from data to model weights is not injective. This is at odds with the consensus of the ML community, since there is a great deal of literature dedicated to making the training of ML models private (e.g. [2] has over 6000 citations).

References:

[1] Izzo, Zachary and James Zou. "A theoretical study of dataset distillation." NeurIPS Workshop on Mathematics of Modern Machine Learning. 2023.

[2] Abadi, Martin, et al. "Deep learning with differential privacy." Proceedings of the ACM SIGSAC conference on computer and communications security. 2016.

问题

Do the results in the paper have implications for practical uses of DD, e.g. hyperparameter tuning/NAS or continual learning?

Is there a formal privacy definition (e.g., differential privacy) which the proposed DD method obeys?

局限性

Some limitations are discussed in the Future Work section. Potential negative societal impact is N/A.

作者回复

We thank the reviewer for the detailed comments. We believe there are some misunderstandings. Please allow us to address your concerns point-by-point.


Q1: The main claims seem to be somewhat trivial extensions of [1].

A1. Our results are totally different from [1]. As discussed in the paper and Table 1, for linear ridge regression, [1, Theorem 3] requires dd distilled data points instead of kk. In their case, k=1k=1 and dkd \gg k. Our results of linear ridge regression and kernel ridge regression with surjective mapping only require kk distilled data. In [1, Theorem 1], only for a generalized linear model, where the data and label are assumed to follow a specific exponential density function (page 3 in [1]), they can achieve single distilled data. This is a very limited case and practical data generally do not satisfy this assumption. In contrast, our analysis does not have any assumptions about the data. We will make it clearer in the revised paper.

According to the proofs in [1, Theorem 3], they simply set the distilled data to be the right singular vectors (multiplied with the singular values) of the original dataset to keep the training loss always the same as on the original dataset. To do so, they need dd distilled data points. We do not require the training loss to be the same but only require the trained parameters to be the same, i.e. WS=WW_S = W. To solve the distilled dataset, we developed some SVD techniques to solve the nonlinear equations of XSX_S. Because of these new techniques, we can achieve kk distilled data instead of dd in [1].

We would appreciate it if you could specify which parts of our work you consider to be “trivial extensions” of [1], so we can address your concerns more effectively.


Q2. The k-dimensional case can be derived with the same techniques as the 1-dimensional case.

A2. It is a good thing that our framework can be easily generalized to the kk-dimensional case. However, we would like to emphasize that our techniques and results are totally different from [1].


Q3. More than the minimum number of points are used.

A3.

  • First, we would like to emphasize that it is our paper shows that kk data points are necessary and sufficient for linear ridge regression and kernel ridge regression with surjective mapping. Previous results including [1] do not know the minimum number of points needed for these cases.
  • Second, it is crucial to study the behavior of dataset distillation with different numbers of data points. Our results show that when m=km=k (minimum number of points as you said), there is only one solution of distilled data (see Figure 1 for example). But when m>km > k, there are infinitely many distilled datasets and we give analytical solutions to compute them. Only when m>km > k, we have some freedom to choose the distilled datasets that we want.
  • As said in the last point, only with more than minimum number of points, we can have some freedom to choose the distilled datasets we want from the infinitely many solutions. In this case, we can find realistic distilled data or noisy distilled data as shown in Corollary 4.1.1 and Figure 2 in our draft.
  • Previous dataset distillation algorithms all use different numbers of distilled points. Generally, with more distilled data, the performance becomes better and the distilled data becomes more realistic.

Q4. Results for kernel ridge regression is a trivial extension of the linear case.

A4. Please refer to Q2 in the global response.


Q5. Only studies distilling a single model from the data...

A5. We believe that single-model distillation is the basis for further theoretical analysis. If we can’t even understand the single-model distillation thoroughly, we can’t analyze the more complicated cases such as distillation for multiple models, neural architecture search, and continual learning. In this paper, we give a thorough analysis of single-model distillation and theoretically ​​show that one data per class is necessary and sufficient to recover the original model's performance in many settings. This will pave the way for the analysis of more complicated seniors.

Note [1] did not study the distillation of multiple models, neural architecture search, or hyperparameter tuning. Compared with [1], our setting is similar to their setting of hyperparameter distillation, where the same regularization is used for the original model and the distilled dataset model, i.e. λS=λ\lambda_S = \lambda. Our framework is even more flexible than [1] to allow λS\lambda_S and λ\lambda to be different. As emphasized in earlier points, our results are totally different from [1].

Our results have implications for hyperparameter tuning: Theorem 4.1 generally suggests a smaller λS\lambda_S is better for constructing distilled data that can perfectly recover the original model’s performance.


Q6. If we restrict the original dataset to contain only realistic images, then only certain images can lead to the final model weights W…

A6. The infinitely many solutions do not depend on the specific training data used. For example, when λ=0\lambda = 0, given WW, the original data can be ϕ(X)=[Y+W+(InY+Y)Z]+\phi(X) = [Y^+W + (I_n - Y^+Y) Z]^+ (line 660), where ZRn×pZ \in \mathbb{R}^{n \times p} can be arbitrary matrix. We can take ZZ to be any random noise with arbitrarily large magnitude, ϕ(X)=[Y+W+(InY+Y)Z]+\phi(X) = [Y^+W + (I_n - Y^+Y) Z]^+ will still guarantee the trained parameter to be WW.


Q7. Do the results in the paper have implications for practical uses of DD?

A7. See A5. Theorem 4.1 generally suggests a smaller λS\lambda_S is better to construct distilled data that can perfectly recover the original model’s performance. Our distilled data can be used in applications such as NAS and continual learning. We leave more explorations as future works.


Q8. Formal privacy definition that the proposed DD method obeys.

A8. Please refer to Q3 in the global response.

评论

Thanks to the authors for their response. The authors make a good point that in practice, distilled datasets of various sizes (not just the minimum) are studied, so I agree that having results with more than the minimum number of points is valuable. I also agree that the proof techniques used appear to be significantly different from existing results, and I have raised my score accordingly. However, I believe there are still some issues:

  1. I agree with the authors that Theorem 1 of [1] has assumed a particular data generating distribution for the data. However, my understanding is that this result still holds if the assumed likelihood is just used as a loss function for training the model, independent of the actual ground truth data distribution. (For instance, linear regression with the square loss implicitly assumes Gaussian errors, but the loss function can still be applied whether or not these assumptions hold.) Thus, the single-point distillation in [1] will still hold for linear regression, and indeed the GLM setup is a generalization of this result.

  2. My understanding of Theorem 3 of [1] is that the same distilled dataset can be used to recover the entire regularization path on the original data, i.e. it can be used to recover multiple models (those which result from different regularization strengths).

  3. I think a result of this sort would strengthen the paper greatly. However, the proof as stated isn't complete (for instance, why will ||W-W'|| be bounded? (I-Y_S^+ Y_S)Z is indeed Gaussian, but its distribution depends on the data-dependent quantity Y_S; why can the Gaussian mechanism still be applied in this case?).

评论

Thank the reviewer for further questions. Please see our response below.


A1. We agree with the reviewer that the assumed likelihood can be used to train the model by maximizing the log-likelihood, with the loss function being the negative log-likelihood. However, it does not include linear regression with square loss (least squares). Only when the distribution is normal, does the linear regression of the generalized linear model match the least-squares solution. Therefore there is no overlap between our results and [1] since the models and loss functions are different, i.e. ridge regression vs maximum likelihood estimation. [1] proves the single distilled data for the generalized linear model, where maximum likelihood estimation is used. We prove the single distilled data (one data per class) for least squares and ridge regression. We will clearly discuss the differences in the revised paper. We thank the reviewer for drawing our attention to this point.

That being said, for least squares and ridge regression, we indeed improve [1, Theorem 3] from dd data points to kk data points. Least squares and ridge regression are often used in practice and dataset distillation algorithms because their analytical solutions can be easily computed. We also draw a connection between our results and the practical kernel-based dataset distillation algorithms in Theorem 6.1.

In addition, we would like to emphasize our contributions of kernel ridge regression (both surjective and non-surjective cases), which improves the negative results in [1]. Our results also improve [2] from pp to kk for the random Fourier features/shift-invariant kernels in the surjective case.

[1] Izzo, Zachary and James Zou. "A theoretical study of dataset distillation." NeurIPS Workshop on Mathematics of Modern Machine Learning. 2023.

[2] Alaa Maalouf, Murad Tukan, Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. On the size and approximation error of distilled sets. NeurIPS 2023.


A2. We agree with the reviewer that the distilled data in [1, Theorem 3] works for all λ0\lambda \geq 0 and λ=λS\lambda=\lambda_S. But they need dd data points compared with our kk points. As demonstrated in Table 2 of our draft, in practical scenarios, dkd \gg k. In practice, we typically select the regularization parameter that yields the best performance.

Our analysis can recover the results of [1, Theorem 3]. When ndn \geq d and mdm \geq d, WS=WW_S = W is YSXS(XSXS+λId)1=YX(XX+λId)1Y_S X_S^\top (X_S X_S^\top + \lambda I_d)^{-1} = Y X^\top (X X^\top + \lambda I_d)^{-1}. Taking XSX_S and YSY_S to be similar quantities as [1, Theorem 3], we can show XSXS=XXX_S X_S^\top = X X^\top and YSXS=YXY_S X_S^\top = Y X^\top and this hold for all λ0\lambda \geq 0. When m<dm < d, it is generally impossible to have this result (unless XX is low rank) since XSXSX_S X_S^\top is at most rank mm and XXX X^\top is rank dd. Nevertheless, if we do not require the distilled data to be valid for all λ0\lambda \geq 0, our approach can achieve the distillation with just kk data points. We will discuss this relationship and the distinctions more explicitly in the revised version of the paper.

Furthermore, a more interesting scenario would be the same distilled data for different models ϕ1\phi_1 and ϕ2\phi_2, which is possible to analyze based on our framework. We leave more explorations as future works.

评论

A3. Bounding WWF||W - W’||_F: Suppose we have two dataset X=[x1,,xn],X=[x1,,xn]Rd×nX = [x_1, \dots, x_n], X’ = [x_1’, \dots, x_n] \in \mathbb{R}^{d \times n} that differ only in the first point. Suppose λ=0\lambda = 0, then W=YX+W = Y X^+. Therefore WW=Y(X+X+)W - W’ = Y (X^+ - X’^+). The Frobenius norm can be bounded as follows: WWFYF(X+X+)FYFX+2X+2x1x12||W - W’||_F \leq ||Y||_F ||(X^+ - X’^+)||_F \leq ||Y||_F ||X^+||_2 ||X’^+||_2 ||x_1 - x_1’||_2, where the last inequality is because (X+X+)FX+2X+2XXF||(X^+ - X’^+)||_F \leq ||X^+||_2 ||X’^+||_2 ||X - X’||_F [3, Theorems 2.2]. Assume the the data points xix_i are bounded xi2B||x_i||_2 \leq B and the smallest singular value of datasets is bounded from below σ_min(X)σ>0\sigma\_{min}(X) \geq \sigma > 0. Then WWFYF2Bσ2||W - W’||_F \leq ||Y||_F \frac{2B}{\sigma^2}.

To justify the boundness of the smallest singular value, suppose the data points xix_i are independent sub-gaussian (e.g. bounded) and isotropic, then by [4, Theorem 5.39], with probability at least 12ect21 - 2e^{-ct^2}, σ_min(X)nCdt\sigma\_{min}(X) \geq \sqrt{n} - C\sqrt{d} - t, where c,Cc, C only depends on the sub-gaussian norm of xix_i.

Applying the Gaussian mechanism: Suppose YSY_S are deterministic and independent of XX, e.g. we can take YSY_S to be one-hot labels, then YS+WY_S^+ W is a deterministic function of XX. As we have shown above, the sensitivity of YS+WY_S^+ W is bounded. Since each element of (ImYS+YS)Z(I_m - Y_S^+ Y_S)Z is a zero-mean Gaussian and only its variance depends on YSY_S, we can apply the Gaussian mechanism to YS+W+(ImYS+YS)ZY_S^+ W + (I_m - Y_S^+ Y_S)Z.

We will include the rigorous proof in the revised paper.

[3] L. Meng and B. Zheng. The optimal perturbation bounds of the Moore–Penrose inverse under the Frobenius norm. Linear Algebra Appl., 432:956–963, 2010.

[4] Vershynin, Roman. "Introduction to the non-asymptotic analysis of random matrices." arXiv preprint arXiv:1011.3027 (2010).


We hope our responses have addressed all your concerns. Please let us know if you have any further questions! If your concerns are addressed, we would appreciate it if you could kindly consider raising your score.

评论

Thanks for the additional comments. I will maintain my score.

审稿意见
7

This work presented a theoretical framework of dataset distillation for KRR, showing that (slightly more than) one data point per class is often necessary and sufficient to recover the performance of the original model. The theory led to a provable and efficient algorithm for dataset distillation, whose effectiveness was verified through experiments.

优点

  • The introduction provides a concise and clear view of the related work on dataset distillation and the contributions of the theoretical findings. The following setup and results are well-organized.
  • The theoretical results for the linear and surjective kernel cases are strong, both in terms of improving the existing theories and in terms of providing insights for designing efficient algorithms.
  • The visualizations are straightforward and helpful for understanding, e.g., Figure 2 provides good support for the choice of ZZ in Corollary 4.1.1.

缺点

  • When comparing with the existing theories, in addition to the direct comparison of the required number of distilled data (e.g., kk v.s. dd), it would be helpful to provide more intuitions on where such significant improvement comes from. Whether it is due to specific settings or assumptions or due to new techniques or insights?
  • It is slightly confusing to refer to the feature map ϕ\phi as "surjective", especially in the introduction without clear definitions. It would be helpful to provide more intuitions why a surjective feature map matters.
  • It seems that the analysis is relatively limited to the linear and surjective kernel cases, whereas the non-surjective kernel (or neural network) case is a direct extension of the surjective kernel results and relies on several strong assumptions like invertible activation (which excludes ReLU) and full-rank weight matrices.

问题

  • While Theorem 6.2 implies that it's impossible to exactly recover ϕ(X)\phi(X) from ϕ(XS)\phi(X_S) and WW, from Figure 2, it seems that when ZZ is chosen to be (an affine transform of) real trained data X^S\hat{X}_S, XSX_S can still reveal most semantic information of XX. This is slightly confusing and would be helpful to clarify.
  • Figure 3 shows that changing the arbitrary matrix ZZ from (an affine transform of) real trained data X^S\hat{X}_S to a Gaussian random matrix leads to a private dataset distillation, at least to human eyes. But from the expression D=YSW+(IYSYS)ZD = Y_S^\dagger W + (I - Y_S^\dagger Y_S) Z, it seems that such private distilled data are semantic information + Gaussian noise. With prior knowledge of (the distribution of) ZZ, can standard denoising techniques be applied to recover the semantic information?

局限性

The major limitations are well-discussed in future works. Some further limitations are mentioned in Weaknesses.

作者回复

We thank the reviewer for the positive feedback, thoughtful review, and for appreciating the merits of our work!


Q1: intuitions on where significant theoretical improvement comes from.

A1. Thanks for the great suggestion! The improvement mainly comes from new techniques of solving distilled datasets. In [1], they only use label distillation and XSX_S can be arbitrary data. dd data points are needed for label distillation according to our Theorem 4.2. Label distillation only utilizes the k×mk \times m parameters of YSY_S and more parameters of XSX_S (d×md \times m) are not used. Our analysis constructs XSX_S explicitly which has more parameters and therefore requires fewer data points.

In [2, Theorem 3], they simply set the distilled data to be the right singular vectors (multiplied with the singular values) of the original dataset to keep the training loss always the same as on the original dataset. To do so, they need dd distilled data points. We do not require the training loss to be the same but only require the trained parameters to be the same, i.e. WS=YS(XSXS+λSI)1XS=WW_S = Y_S (X_S^\top X_S + \lambda_S I)^{-1}X_S^\top = W. To solve the distilled dataset, we developed some SVD techniques to solve the nonlinear equation of XSX_S. Because of these new techniques, we can achieve kk distilled data instead of dd in [2].

[1] Alaa Maalouf, Murad Tukan, Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. On the size and approximation error of distilled sets. NeurIPS 2023.

[2] Izzo, Zachary and James Zou. "A theoretical study of dataset distillation." NeurIPS Workshop on Mathematics of Modern Machine Learning. 2023.


Q2: Confusion about surjective feature map and intuitions why a surjective feature map matters.

A2. A function is surjective if, for every yy in its codomain, there exists at least one element xx in the function's domain such that f(x)=yf(x) = y. In this case, we can analyze kernel ridge regression in the feature space similar to a linear model and find the desired features that can guarantee WS=WW_S = W. Because of the property of surjective mapping, there always exists some distilled data that maps to the desired features. Non-surjective mapping is harder because there may not exist distilled data that corresponds to the desired feature. We will add more explanations in the introduction.


Q3: The analysis is relatively limited to the linear and surjective kernel cases, whereas the non-surjective kernel (or neural network) case is a direct extension of the surjective kernel results.

A3. The non-surjective mapping case (Theorem 5.3) is not a direct extension of the surjective case. As explained in A2, non-surjective mapping is harder because there may not exist distilled data that corresponds to the desired features. To find the distilled data theoretically, we need to guarantee 1) the feature can guarantee WS=WW_S = W and 2) there are some distilled data corresponding to the feature. These involve many technical challenges: 1) handling the pseudoinverse of sum of matrices in the distilled feature (Theorem C.1), 2) solving an overdetermined system of linear equations that has solutions only in limited cases (line 572), 3) solving a system of equations that has multiple free variables (line 589). The analysis for linear NN is already very involved and complicated. Please see the proof of Theorem 5.3, lines 564-613, for reference. We leave more analysis for non-surjective deep non-linear NNs as future works.


Q4: From Figure 2, when ZZ is chosen to be (an affine transform of) real trained data, XSX_S can still reveal most semantic information of X^S\hat{X}_S.

A4. We can choose X^S\hat{X}_S to be real data that is outside of the training data XX. In this case, even if XSX_S resembles X^S\hat{X}_S, it will not leak the information of training data. At the same time, XSX_S can still guarantee WS=WW_S = W.


Q5: With prior knowledge of (the distribution of) ZZ, can standard denoising techniques be applied to recover the semantic information?

A5. Thanks for the insightful question! Given D=YS+W+(ImYS+YS)ZD = Y_S^+ W + (I_m - Y_S^+Y_S) Z, if we know YSY_S, we can easily recover the semantic part WW by YSD=WY_S D = W. But even if we can recover WW, there are infinitely many solutions of ϕ(X)\phi(X) according to Theorem 6.2. For example, when λ=0\lambda = 0, ϕ(X)=[Y++(InY+Y)Z]+\phi(X) = [Y^+ + (I_n - Y^+Y) Z]^+ (line 660), where ZRn×pZ \in \mathbb{R}^{n \times p} can be arbitrary matrix.

It is also possible to prove differential privacy for the distilled data. Please refer to Q3 in the global response.

评论

Thanks for the response. My questions are addressed. I will keep my original evaluation.

评论

Thanks for your response and for recommending acceptance! We really appreciate it! We will make sure to follow the suggestions and clarify the questions in the revised version.

审稿意见
3

This paper studies the number of synthetic data required in dataset distillation to ensure that the optimal parameter learned on the synthetic dataset is the same of that learned on the original dataset.

The task considered is kernel ridge regression (KRR), where the labels contain kk classes. The goal is to preserve the optimal WW in minWWϕ(X)YF2+λWF2\min_W \|W\phi(X) - Y\|_F^2 + \lambda \|W\|_F^2. Let mm denote the number of synthetic samples in the distilled dataset.

  • For linear ridge regression (LRR), the paper finds that m=km = k suffice to recover the parameter.
    • The paper assumes access to WRk×dW \in \mathbb{R}^{k \times d}. Let YSRk×mY_S \in \mathbb{R}^{k \times m} be any choice of synthetic labels such that it has rank kk.
    • The proof works by essentially setting XS=WYSX_S = W^\dagger Y_S.
    • If we want the synthetic data to be realistic, the paper suggests to set YS=WX^SY_S = W\hat{X}_S where X^S\hat{X}_S are sampled from the real dataset -- this means XS=X^SX_S = \hat{X}_S. In other words, WW can be preserved from a subset of kk samples in the real data set, as long as these samples span a rank-kk space.
  • For KRR with surjective feature mappings ϕ\phi, m=km = k samples also suffice to recover the parameter.
    • Examples of surjective feature mappings include invertible NNs, fully-connected nets with invertible activation and full-rank parameters, and random Fourier features with full-rank parameters.
    • Proof idea: first, note that recovering the linear weight on top of the feature ϕ(X)\phi(X) is the same as LRR. The next step is then to recover XX from ϕ(X)\phi(X), which is feasible when ϕ\phi is invertible.
  • For KRR with injective feature mappings (e.g. given by NNs), kk samples are not sufficient in general. For deep linear NNs, k+1k+1 samples suffice.

优点

  • The paper provides a theoretical analyses on the minimal number of samples required for a synthetic dataset to preserve the parameter to a ridge regression problem.
  • The conditions proposed in this paper can be used to ensure 0 loss in some objectives proposed in prior work.
  • The paper empirically verify the number of samples required to, and is significantly faster than KIP (Kernel Inducing Points, a method in prior work).

缺点

  • The method proposed in this paper requires knowing WW, which is not always feasible. Moreover, the algorithm explicitly uses WW, which means that the computational time grows with the feature dimension, which could be infinite.
    • In contrast, KIP uses the kernel trick and hence does not incur a dependency on the feature dimension.
  • I'm concerned about the scalability of the method: although the synthetic dataset can preserve the performance on the full dataset, the accuracy is low (e.g. 42% on CIFAR10). I wonder if the method can preserve the performance of SOTA models on CIFAR10 (e.g. reaching >80% accuracy) and CIFAR100 (e.g. reaching >60% accuracy).
  • Some conditions in the theorem statements and wordings in the proofs may be problematic. Please see questions below.

问题

  • In the proof of Thm 5.2, the second bullet point says there existing some XSX_S corresponding to ϕ(XS)\phi(X_S) is equivalent to XSX_S being recoverable from ϕ(XS)\phi(X_S). I don't see why the equivalency is true; for example, ϕ\phi could map all samples from the same class to a same feature embedding. What am I missing?
  • Line 578: "for any Z1Z_1" should be "there exists some Z1Z_1"?
  • The choice of ZZ is "any matrix of the correct size" at multiple places, whereas Thm C.1 additionally requires it to make XλSX_{\lambda_S} full rank. Please fix the conditions.
  • Thm 6.1: nn should be mm?
  • For completeness of the paper, please include a brief explanation of the baseline KIP.

局限性

There is no direct societal impact.

评论

References

[1] Noel Loo, et al. Efficient dataset distillation using random feature approximation. NeurIPS 2022.

[2] Nguyen, et al. Dataset Distillation with Infinitely Wide Convolutional Networks. NeurIPS 2021.

[3] Yongchao Zhou, et al. Dataset distillation using neural feature regression. NeurIPS 2022.

[4] Noel Loo, et al. Dataset distillation with convexified implicit gradients. NeurIPS 2023.

作者回复

We thank the reviewer for their careful reading of our paper and constructive comments. We believe there are some misunderstandings. Please allow us to address your concerns point-by-point.


Q1. Proof idea and techniques of Theorem 4.1: The proof works by essentially setting XS=W+YSX_S = W^+Y_S.

A1. XS=W+YSX_S = W^+Y_S is only a special case when m=km=k, λS=0\lambda_S = 0 and WW is full rank. Our framework is more general for any mkm \geq k, λS0\lambda_S \geq 0 and WW. The proof works by setting WS=YS(XSXS+λSI)1XS=WW_S = Y_S (X_S^\top X_S + \lambda_S I)^{-1}X_S^\top = W, from which we can solve (XSXS+λSI)1XS=YS+W+(IYS+YS)Z(X_S^\top X_S + \lambda_S I)^{-1}X_S^\top = Y_S^+W + (I - Y_S^+Y_S)Z for any ZZ. Then we developed some SVD techniques to solve the nonlinear equation of XSX_S. XSX_S needs to be computed using SVD or pseudoinverse as in Theorem 4.1.


Q2: Realistic distilled data in Corollary 4.1.1: the paper suggests to set YS=WX^SY_S=W \hat{X}_S where X^S\hat{X}_S are sampled from the real dataset -- this means XS=X^SX_S = \hat{X}_S.

A2. In Corollary 4.1.1, we mainly find the distilled data XSX_S instead of setting the distilled labels YS=WX^_λSY_S=W \hat{X}\_{\lambda_S}. We solve the DD that is closest to the original data X^_S\hat{X}\_S in the sense that DX^_λS+_F||D - \hat{X}\_{\lambda_S}^+||\_F is minimized. Then XSX_S can be computed from this DD using Theorem 4.1. Note XSX^_SX_S \neq \hat{X}\_S in general because the equation in line 504 can be inconsistent. As you can see from the first two rows of Figure 2 (page 5), there are still some differences between original and distilled images. After solve this DD, we find YS=WX^_λSY_S=W \hat{X}\_{\lambda_S} can further minimize the difference of DX^_λS+_F||D - \hat{X}\_{\lambda_S}^+||\_F (line 510). However, YS=WX^λSY_S=W \hat{X}_{\lambda_S} alone without solving DD will not work. Because, with only label distillation, we need dd data points as per Theorem 4.2.


Q3: The method proposed requires knowing WW. The computational time grows with the feature dimension dd. KIP uses the kernel trick and avoids the dependency on dd

A3. First, WW can be easily computed given the original dataset so it should not be a problem. Second, we would like to note many dataset distillation methods require an original model. For example, Gradient Matching [30, 28, 12, 8 in the paper] and Trajectory Matching [1, 3, 5, 4 in the paper] match the gradients or training trajectories of models trained on the original dataset and distilled dataset.

Although KIP avoids the dependency on dd by using NTK, it is prohibitively slow because the NTK computation scale with O(m2)O(m^2) [1], where mm is the number of distilled data. It is even computationally infeasible for CIFAR-100 with m=50m=50 [2]. RFAD [1] proposes to speed up KIP by replacing NTK with random features. Also, there is still a gap between NTK and finite-width NNs, leading to performance degradation when transferred to NNs.

More recent methods such as RFAD [1], FRePo [3], and RCIG [4] solve a kernel regression on top of the NN’s features, which is more similar to our setting. Their approaches also depend on the feature dimension dd. However, in general, the dimension of NN’s features is not too large and these algorithms are more computationally feasible than KIP. Our algorithm is even more computationally efficient for d=784d=784 (MNIST) and d=3072d=3072 (CIFAR) and much more efficient than KIP as shown in Table 4. The run time is 17 seconds for MNIST and 13 seconds for CIFAR-10 on an A5000 GPU.


Q4: Scalability of the method: the accuracy is low on CIFAR

A4. Please refer to Q1 in the global response.


Q5: XSX_S is recoverable from ϕ(XS)\phi(X_S)

A5. Thanks for pointing it out. We meant that we need to find a XSX_S such that ϕ(XS)=ϕ\phi(X_S)=\phi^* where ϕ\phi^* is some desired feature. In the case that ϕ\phi maps all samples from the same class to the same feature embedding, any sample from that class should suffice to be a solution. But we also have another condition that ϕ(XS)\phi(X_S) should guarantee WS=WW_S = W. We will make it clearer in the revised paper.


Q6: Line 578: "for any Z1Z_1" should be "there exists some Z1Z_1"?

A6. This is a homogeneous system of linear equations and the matrix is singular (the number of equations is less than the variables), so there are infinitely many solutions. Therefore arbitrary Z1Z_1 will make up a solution. See for reference: Obtaining all solutions of a linear system. Note here we only consider the second condition – XSX_S is solvable from a given ϕ\phi^*, and any Z1Z_1 will guarantee XSX_S is solvable.


Q7: The choice of ZZ is "any matrix of the correct size" at multiple places, whereas Theorem C.1 additionally requires it to make XλSX_{\lambda_S} full rank.

A7. In Theorem C.1 it is ZRd×mZ’ \in \mathbb{R}^{d \times m}, which is different with ZRm×dZ \in \mathbb{R}^{m \times d} at other places. Theorem C.1 does not conflict with other results. In Theorem 4.1, we give conditions for XSX_S through some pseudoinverse calculation, which is hard to handle when doing more analysis because there is no concise formulation for the pseudoinverse of sum of matrices. In Theorem C.1, we provide additional direct characterizations of XSX_S without pseudoinverse. Because Theorem C.1 is only used in the proof of Theorem 5.3, we did not include it in the main paper. Also, note that the third condition of Theorem C.1 does not require XλSX_{\lambda_S} to be full rank as noted in line 495. This condition is used in the proof of Theorem 5.3.


Q8: Thm 6.1: nn should be mm?

A8. It is nn, because we want to guarantee WSϕ(X)=YW_S \phi(X) = Y on the original dataset, where there are nn data points.


Q9: brief explanation of the baseline KIP.

A9. We will add a brief explanation of KIP algorithm in the appendix.

评论

Dear Reviewer nkTx,

As the discussion period is close to the end and we have not yet heard back from you, we wanted to reach out to see if our rebuttal response has addressed your concerns. We are more than happy to discuss any further concerns and issues you may have. If your concerns have been addressed, we would appreciate it if you could kindly re-evaluate our work. Thank you for your time.

评论

We thank the reviewer for the reply and further questions. Please see our response below.


Q1. There's no reason why we need or want to take Z0Z \neq 0.

A1. There are reasons that we want to take Z0Z \neq 0. When Z=0Z = 0, given fixed WW and YSY_S, D=YS+WD = Y_S^+ W is deterministic and so is XSX_S. There is no freedom to choose the distilled data we want. When Z0Z \neq 0, as discussed in lines 136-139, we can choose XSX_S by varying ZZ. For example, in Corollary 4.1.1, we choose Z=X^λS+YS+WZ = \hat{X}_{\lambda_S}^+ - Y_S^+ W such that XSX_S is close to real data. We can also take ZZ to be random noise such that the distilled data is noisy as shown in Fig 2 in the draft.


Q2. Comment on the scaling of the two methods.

A2. Thanks for the suggestion! We will add a discussion in the revised version.


Q3. Thm C.1: when kmdk \leq m \leq d.

A3. In Thm C.1, the first two cases share the same conditions and they are complementary (mdm \leq d and mdm \geq d). The third case mkm \geq k has its own conditions. To make it clearer, we will write the third case in a separate theorem.

The first case does not require mkm \geq k. The results still hold even when m<km < k. We will delete the sentence ``from Eq (2), WS=YS+XλSW_S = Y_S^+ X_{\lambda_S}’’, which is misleading, as WS=YS+XλSW_S = Y_S^+ X_{\lambda_S} always hold.


Q4. Cor 4.1.1.: YS=WX^_λSY_S = W \hat{X}\_{\lambda_S} then D=YS+W=X^_λS+D = Y_S^+ W = \hat{X}\_{\lambda_S}^+. How/where is X^λS\hat{X}_{\lambda_S} defined?

A4. Note when d>kd > k, YS+=(WX^_λS)+X^_λS+WY_S^+ = (W \hat{X}\_{\lambda_S})^+ \neq \hat{X}\_{\lambda_S}^+ W in general. Therefore YS+W=(WX^_λS)+WX^_λS+Y_S^+ W = (W \hat{X}\_{\lambda_S})^+ W \neq \hat{X}\_{\lambda_S}^+ in general.

X^_λS+=(X^SX^S+λSIm)1X^S\hat{X}\_{\lambda_S}^+ = (\hat{X}_S^\top \hat{X}_S + \lambda_S I_m)^{-1} \hat{X}_S^\top is defined in the same way as X_λSX\_{\lambda_S} (line 103). We will include a definition to make it clearer.


Q5. In Sec 3.2: are λ\lambda and λS\lambda_S the same?

A5. Thanks for pointing it out. It is a typo. It should be λ\lambda between lines 93-94. λ\lambda is a predefined hyperparameter in our setting. λS\lambda_S needs to satisfy some conditions in Theorem 4.1 and 5.1 to guarantee WS=WW_S = W. Please also refer to our response to Q1 and Q2 of Reviewer RvWD.


Q6. Line 578: it cannot be "any" ZZ, since ϕ\phi is "given".

A6. We consider the two conditions separately to find the form of ϕ(XS)\phi(X_S) that satisfies each condition and then combine the conditions to solve XSX_S. If we only consider the second condition, we are trying to find the form of ϕ(XS)\phi(X_S) such that XSX_S is solvable from ϕ(XS)\phi(X_S). Then XSX_S is solvable from ϕ(XS)=l=1LW(l)(W(1))+Z1\phi(X_S) = \prod_{l=1}^L W^{(l)} (W^{(1)})^+ Z_1 in Eq (7) for any Z1Z_1. If we consider the first condition together with the second condition, then the first condition requires ϕ(XS)\phi(X_S) to be given in the form of ϕ(XS)=(YS+W+(ImYS+YS)Z)\phi(X_S) = (Y_S^+ W + (I_m - Y_S^+Y_S) Z) in Eq (6). Then it is indeed for some Z1Z_1.

To make it more rigorous, we will revise it to be ``there exists some Z1Z_1’’.


We hope our responses have addressed all your concerns. Please let us know if you have any further questions!

评论

I apologize for my late response, and thank you for the clarifications!

Q1: I understand that XS=WYSX_S = W^\dagger Y_S is only a special case (by taking Z=0Z=0); what I meant was that the proof works essentially the same as how one would proceed for this special case (i.e. SVD). Moreover, based on the theorem statement, there's no reason why we need or want to take Z0Z \neq 0.

Q3, about the quadratic scaling in the number of samples mm (as in KIP) vs in the feature dimension pp: thanks for the clarification. Please add a comment on the scaling, so that it's transparent how the two methods compare.

Q4: thanks for the new numbers, I think they are convincing.

I find the math writing in the paper confusing, especially with quantifiers.

  • Thm C.1: when kmdk \leq m \leq d,
  • Cor 4.1.1.: the statement feels tautological: it says that to minimize the distance between DD and X^λS\hat{X}_{\lambda_S}^\dagger, take YS=WX^λSY_S = W\hat{X}_{\lambda_S}, and set D=YSW=X^λSD = Y_S^\dagger W = \hat{X}_{\lambda_S}^\dagger -- apologies for the poor formatting; I think the renderer got confused with LaTex and markdown.
    • Moreover, how/where is X^λS\hat{X}_{\lambda_S} defined?
  • In Sec 3.2: are λ\lambda and λS\lambda_S the same? I'm confused especially in the equation below line 93.
  • Line 578: it cannot be "any" Z1Z_1, since ϕ\phi is "given".
作者回复

We would like to thank all the reviewers for taking the time and effort to review our paper! We are delighted that the reviewers found:

  • The paper provides a theoretical analysis on the minimal number of samples required for a synthetic dataset to preserve the parameter to a ridge regression problem (Reviewer nkTx);
  • The theoretical results for the linear and surjective kernel cases are strong, both in terms of improving the existing theories and in terms of providing insights for designing efficient algorithms (Reviewer 7eAw);
  • The visualizations are straightforward and helpful for understanding, e.g., Figure 2 provides good support for the choice of ZZ in Corollary 4.1.1 (Reviewer 7eAw);
  • Theoretical insights may be helpful in moving towards more effective DD methods, as well as understanding the limits of what can be expected from DD (Reviewer ish1);
  • I like the contribution of this paper, and I think it is super crucial to have theoretical results in this practical field (Reviewer RvWD).

Below, we respond to common questions shared by reviewers in this global response. Please don’t hesitate to let us know if you have any additional feedback or questions regarding our response. We would be happy to address any remaining concerns with you in the discussion period if any. If our responses have addressed your concerns and questions, we would appreciate it if you could kindly let us know and consider raising your review score.


Q1: Scalability of the method: the accuracy is low on CIFAR. (Reviewer nkTx and RvWD)

A1. As Algorithm 1 we proposed is mainly for KRR with surjective mappings, we verified it and compared it with baselines under this setting. In experiment (II), we use a randomly initialized bijective NN as ϕ\phi. It matches previous algorithms that start with a randomly initialized NN. If we use a pre-trained NN as ϕ\phi, the accuracy can be improved and may match the SOTA.

We conduct additional experiments to use a pre-trained 5-layer FCNN as ϕ\phi. Under this setting, we can get an accuracy of 57.91%±\pm0.23 on CIFAR-10 and 31.78%±\pm0.23 on CIFAR-100 for IPC=1, which is comparable to the SOTA dataset distillation algorithms such as KIP [1] (49.9±\pm0.2 for CIFAR-10 and 15.7±\pm0.2 for CIFAR-100) and MTT [2] (46.3±\pm0.8 for CIFAR-10 and 24.3±\pm0.3 for CIFAR-100) when IPC=1.

[1] Nguyen, et al. Dataset Distillation with Infinitely Wide Convolutional Networks. NeurIPS 2021.

[2] Cazenavette , et al. Dataset distillation by matching training trajectories. CVPR 2022.


Q2. Results for kernel ridge regression is a direct extension of the linear case. (Reviewer 7eAw and ish1)

A2. The non-surjective mapping case (Theorem 5.3) is not a direct extension of the linear case. Non-surjective mapping is much harder because there may not exist distilled data that corresponds to the desired feature. To find the distilled data theoretically, we need to guarantee 1) the feature can guarantee WS=WW_S = W and 2) there are some distilled data corresponding to the feature. These involve many technical challenges: 1) handling the pseudoinverse of sum of matrices in the distilled feature (Theorem C.1), 2) solving an overdetermined system of linear equations that has solutions only in limited cases (line 572), 3) solving a system of equations that has multiple free variables (line 589). The analysis is very involved and complicated and the results are not a direct extension of the linear case. Please see the proof of Theorem 5.3, lines 564-613, for reference.

The surjective mapping case can not be fully covered by the linear case. In particular, we give examples of models that satisfy the condition and provide useful cases, including Invertible NN, FCNN, CNN, and Random Fourier Features. These models are ubiquitously used in practices. More importantly, for Random Fourier Features, our results theoretically improve previous pp distilled data [3] to kk, where pΩ(nlogn)p \in \Omega(\sqrt{n} \log n) in general can be very large.

[3] Alaa Maalouf, Murad Tukan, Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. On the size and approximation error of distilled sets. NeurIPS 2023.


Q3. Is there a formal privacy definition (e.g., differential privacy) which the proposed DD method obeys? (Reviewer ish1)

A3. It is possible to prove the distilled dataset is differential private for the original dataset. For example, for the linear case (Theorem 4.1), D=YS+W+(ImYS+YS)ZD = Y_S^+W + (I_m - Y_S^+ Y_S)Z. We can take the elements of ZZ to be i.i.d. random variables drawn from a zero-mean Gaussian distribution, then the elements of (ImYS+YS)Z(I_m - Y_S^+ Y_S)Z will still be zero-mean Gaussian random variables. Supposed original datasets XX and XX’ only differ in one data point and their resulting parameter are WW and WW’. As long as WWF||W-W’||_F is bounded, by the Gaussian mechanism [4, Theorem 3.22], ZZ with suitable variance will guarantee DD to be differential private for the original dataset. The computation from DD to XSX_S is deterministic therefore will keep the differential privacy.

[4] Cynthia Dwork and Aaron Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 2014.

最终决定

This paper received high-quality reviews from four knowledgable reviewers who engaged actively during rebuttal, with diametrically opposing opinions. Reviewers ish1 and nkTx had the two most critical reviews. ish1 believes the main claims are somewhat trivial extensions of known results and questions the novelty and technical contributions. They remained unconvinced of the merits of the paper after a lengthy discussion with the authors. Reviewer nkTx, on the other hand, had concerns about the scalability of the method and the dependency on the feature dimension. The reviewer also found some conditions and wordings in the proofs problematic. However, after the discussion with the authors, this reviewer seemed satisfied with many of the clarifications, and had mostly notation and presentation outstanding comments, even though they did not update their score to reflect this.

On the other side of the discussion, we have reviewers 7eAw and RvWD, who were in favor of accepting the paper. Reviewer 7eAw expressed appreciation for the theoretical framework and contributions, especially the those for the linear and surjective kernel cases. Reviewer RvWD also highlighted the theoretical contributions of the paper, but initially expressed concerns about the correctness of the theorem statements and the experiments. However, the reviewer conceded that their major concerns were resolved by the authors’ response, and updated their score to reflect this. Both of these reviewers had a long list of recommendations for the authors in terms of presentation and framing, which they requested the authors to incorporate in a revised version if the paper were to be accepted.

After reading the paper myself twice, and going through the forum multiple times, I find myself torn and could easily find myself advocating for rejection or acceptance, depending on what issue is being highlighted. I take ish1’s points about novelty and extent of contribution, but whether these extensions are considered “trivial” or “substantial” is subjective. In cases of novelty ambiguity, I believe we should fall back to correctness and clarity, and this paper seems to satisfy both. Despite the various questions and concerns about notation raised by the reviewers, it does not seem to me that the mathematical errors/misunderstandings that came up during the discussion is unusually widespread or critical. Most of these were clarified during the rebuttal phase and none of them seem to have undone or substantially weakened the theoretical results. In terms of contribution it should be pointed out that [6] is a workshop paper, and therefore brief. Even if the only contribution of this paper where extensions over that paper, this one provides a much more detailed analysis and discussion of the results, which make it a valuable addition to [6] for the community.

All in all, I think this is an interesting but perhaps not ground-breaking paper that provides interesting theoretical guarantees for dataset distillation in certain simplified scenarios, some of which are extensions of known results, while some are not, but which could nevertheless be found interesting and relevant by the NeurIPS community.