PaperHub
7.0
/10
Poster4 位审稿人
最低6最高8标准差1.0
6
8
6
8
4.3
置信度
正确性2.8
贡献度2.5
表达2.8
ICLR 2025

Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-training of Deep Networks

OpenReviewPDF
提交: 2024-09-28更新: 2025-04-01
TL;DR

We present the first effective method for dataset distillation (i.e. creating a small synthetic dataset to summarize a large real dataset) for self-supervised learning.

摘要

关键词
dataset distillationself-supervised learning

评审与讨论

审稿意见
6

The authors of the paper deal with applying dataset distillation on unlabeled dataset as part of a self-supervised pre-training. A fundamental observation presented in the paper is that optimizing for dataset distillation while using SSL objective is hard to convergence because of a high gradient variance in the batch. This observation was demonstrated empirically and theoretically under some assumptions. To mitigate the high gradient variance, the authors proposed to use Matching Training Trajectories following knowledge distillation, which leads to a much smooth gradients, allowing a stable dataset distillation optimization and accuracy improvements on several downstream tasks.

优点

  • In my view, the proposed approach effectively transforms the problem of data distillation in an unlabeled setting (SSL objective) into a supervised approach using knowledge distillation loss.

  • The concept of enhancing self-supervised approaches by incorporating dataset distillation is interesting.

  • The approach presented in the paper is simple and clearly explained.

  • The theoretical and experimental motivations regarding high variance gradients appear reasonable and valid.

缺点

  • Line 75: “… datasets distilled with ConvNets can transfer to other larger architectures such as ResNets.” Note that ResNets are also a type of Convolutional network (ConvNet).

  • What about considering larger (and more modern) architectures? For example, using larger ResNet architectures instead of just 3-4 convolutional layers.

  • Do the authors consider connections to data-free knowledge distillation (DFKD)?

  • The authors' theoretical derivation is based on very strict assumptions. Statements such as "as confirmed theoretically above" (line 242) may be overstated.

  • The paper uses the Barlow Twins SSL objective. Did the authors try other SSL approaches? It would be useful to know if the observations are generalizable and if the accuracy improvements hold across different SSL algorithms.

  • Experiments: the authors only compare linear probes on 1% and 5% of the downstream labeled data. Why not present results for additional subset sizes? Especially when comparing to SAS, which is a data pruning method where the main focus is on higher subset sizes (e.g., SAS provides results for 20% subset size and above). This raises the question of whether the comparison to SAS is fair.

问题

  • My main question is more general, concerning the concept of dataset distillation (DD). Despite over four years of research on DD, progress appears to be limited, with most work focused on small datasets and controlled settings. I would appreciate the authors' insights on the future potential and development of dataset distillation.

  • I think the authors may consider presenting some connections to the field of data-free knowledge distillation. What do the authors think?

评论

We thank the reviewer for acknowledging the strengths of our work, namely: 1) the innovative solution to transform dataset distillation (DD) for SSL into a supervised approach using the knowledge distillation loss 2) tackling the important and under-explored problem of DD for SSL 3) simple and clear writing 4) theoretical and empirical motivations for high variance of gradients of SSL.

W1: Corrected in revision.

W2: In Table 5, we have already included results showing our method can be used to train models as large as ResNet-18.

W3: Discussed in Q2

W4: Changed in revision to "as illustrated theoretically in the simplified setting above".

W5: In Table 6, we have already included results for MKDT applied to SimCLR (another SSL approach).

W6: In Table 7, we have already included results evaluating the encoders trained using MKDT distilled sets with larger fractions of labeled data (10%,50%). We mainly focused on very small subsets in MKDT, since dataset distillation has additional benefits over SAS (subset selection for contrastive learning). SAS mainly focuses on reducing training time, while preserving accuracy; whereas dataset distillation is useful to generate extremely small subsets to enable very fast training/adaptation on memory-limited edge devices. Moreover, dataset distillation has the additional orthogonal goal of preserving privacy.

Q1: In supervised learning, very recent work of [1] has shown promise in generalizing to larger networks and larger datasets (e.g. training a ResNet on a distilled version of ImageNet to 34% accuracy using only 50 images per class). This, however, relies heavily on labels and cannot be applied to dataset distillation for SSL. Currently, dataset distillation for SSL is still underexplored. Our contribution sets the groundwork for how to perform dataset distillation for SSL. As happened with SL dataset distillation, we hope that subsequent work extend this to larger scale models and dataset sizes.

Q2: Recent work in distribution matching [40 from paper, 47 from paper, 1] that has achieved remarkable results has relied on data-free knowledge distillation. We did mention these works in our related work for distribution matching methods. The key idea from data-free distillation works of optimizing the synthetic data to match data statistics such as mean, variance etc. [2] has proved extremely helpful in dataset distillation for SL. However, as shown in [40 from paper, 47 from paper, 1], this still requires labels to prevent a collapse of synthetic data from different classes. Nonetheless, the connection to data-free knowledge distillation is indeed interesting and we believe future work could adapt DFKD for DD for SSL, using our idea of converting SSL to SL using knowledge distillation.
We have also added additional references to DKFD literature to our revision.

We are eager to engage in further discussion to resolve any other comments / concerns.

References:

[1] Shao, Shitong, et al. "Generalized large-scale data condensation via various backbone and statistical matching." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Lopes, Raphael Gontijo, Stefano Fenu, and Thad Starner. "Data-free knowledge distillation for deep neural networks." arXiv preprint arXiv:1710.07535 (2017).

评论

As the discussion period is coming to an end soon, we're hoping to hear if our rebuttal addressed the reviewer's concerns and if we can provide any more clarifications about our work.

Thank you once again for your efforts reviewing our paper!

评论

As the extended discussion period will end in a few days, we're hoping to hear if our rebuttal addressed the reviewer's concerns and if we can provide any more clarifications about our work.

Thank you once again for your efforts reviewing our paper!

评论

Thanks for the authors' responses. After reviewing the response and reading the other reviewr's comments, I would like to keep my score.

评论

Thank you for taking the time to go through our rebuttal!

审稿意见
8

This paper introduces one of the first approaches to dataset distillation for self-supervised learning (SSL) pre-training, addressing the challenge of generating compact, synthetic datasets that can efficiently pre-train deep networks without labeled data. By leveraging a teacher-student framework, the authors demonstrate that using knowledge distillation (KD) significantly reduces the variance in training trajectories, a common issue in SSL due to high variance in gradient updates. This approach, termed "Matching Knowledge Distillation Trajectories" (MKDT), trains a smaller student model to match the embeddings of a larger teacher model, resulting in a low-variance objective that allows effective dataset distillation for SSL. Experimental results indicate that MKDT outperforms prior methods by up to 13% across various downstream tasks with limited labeled data, underscoring its potential for memory- and compute-efficient SSL pre-training​

优点

  1. This work is among the first to address dataset distillation for self-supervised pre-training, providing an innovative approach to generate compact, synthetic datasets that enable efficient pre-training without labeled data.
  2. The theoretical motivation for introducing a teacher-student learning model is compelling. By using knowledge distillation, the method effectively reduces the high variance in gradients commonly seen in self-supervised learning objectives, improving upon naive trajectory matching approaches. While I have not throughly checked the the mathematical proofs, but nonetheless, the proofs intuitively seem convincing.
  3. The related work section is particularly well-constructed, offering a thorough exploration of the dataset distillation literature and providing valuable insights into the challenges and developments in both supervised and self-supervised settings, positioning this work within the broader context of current research.

缺点

  1. Limited Discussion of Alternative Distillation Paradigms: The paper presents knowledge distillation (KD) as a novel proxy for enhancing trajectory matching in SSL dataset distillation. However, the authors omit a comparison with other established dataset distillation methods, such as gradient matching or distribution matching. The Related Work section briefly acknowledges these methods but dismisses them based on label dependency. This lack of comparative analysis leaves a gap: could these paradigms outperform trajectory matching if adapted to SSL? Addressing this gap would strengthen the paper by either justifying the choice of trajectory matching for SSL or empirically showing why KD-based approaches are superior.

  2. High Variance in SSL Gradients and Potential Regularization Techniques: A core motivation for the KD approach is the high variance in SSL gradients, which makes naive trajectory matching ineffective. Yet, the authors do not explore whether regularization techniques, such as Sharpness-Aware Minimization (SAM), could mitigate this variance issue. SAM, which reduces oscillations in model updates by flattening the loss landscape, could be a promising candidate to stabilize SSL gradients. An empirical analysis incorporating such techniques would clarify whether the KD setup is necessary or if regularization alone could suffice.

  3. Clarification and Takeaways in Figure 1: Figure 1 illustrates the challenges of applying MTT to SSL, emphasizing issues like high gradient variance and chaotic updates to synthetic images. However, the figure captions lack concise takeaways summarizing the implications of each plot. Clearer captions explaining the significance of variance trends and how they justify the proposed KD approach would improve the paper's readability and impact.

Minor Weakness (Formatting and Page Limit): The submission uses a numerical citation format, which does not adhere to the ICLR 2025 requirements of (Author, Year) format. Given the 10-page length, updating the citation style within the limit should be manageable.

Reference:
[1] Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation, CVPR 2023.

问题

  1. In Figure 1, what are the primary insights that should be drawn from each sub-figure? Specifically, could the authors clarify how each metric in Figure 1 (e.g., variance in weights, distillation loss) directly impacts the effectiveness of the KD-based trajectory matching approach?

  2. The paper argues for the superiority of trajectory matching in the SSL setting. Could the authors elaborate on any specific challenges or limitations of gradient matching and distribution matching that make them unsuitable for SSL dataset distillation, as compared to trajectory matching? (this question ties to one of the points in the weakness section)

  3. Regarding the high-variance problem in SSL gradients, has there been an investigation into alternative approaches, such as sharpness-aware minimization, to reduce gradient variance? What led to the decision to focus solely on KD rather than considering variance-reducing techniques as a complementary approach? (also raised in the weakness section)

  4. In Table 2, could the authors clarify the relationship between initialization choice (e.g., high-loss vs. random subsets) and performance? Is there a grounded rationale behind the preference for high-loss subset initialization?

  5. Could the authors provide more context on how the choice of SSL algorithm (e.g., SimCLR vs. Barlow Twins) affects the efficacy of the MKDT method? Is there a fundamental reason one algorithm would be preferred over another in the KD-based trajectory matching framework? Which method among SimCLR and BarlowTwins as faster convergence while learning the distilled samples? Can this method also be extended to other SSL methods such as masked reconstruction?

  6. Lastly, in the methodology, could the authors clarify how they determined the optimal values for hyperparameters like the number of distillation steps, K expert trajectories, and distillation loss thresholds?

评论

We thank the reviewer SPPk for their appreciation of 1) the innovative approach we provide for the important problem of dataset distillation for SSL 2) the theoretical motivation for using knowledge distillation to generate lower variance trajectories to enable trajectory matching for DD for SSL 3) the well-written related works section, situating the contribution of our work in the current research.

We now address the weaknesses and questions raised by the reviewer:

W1: As the reviewer observed, in our related work we did identify the label dependency of the SL method as preventing them from being applied to SSL. Gradient matching methods necessarily need labels, as it is crucial to match gradients per-class to learn class-discriminative features as shown in [46 from paper]. For distribution matching, a similar concern exists where distilling without labels would lead to a collapse in representations i.e. all classes would have nearly identical representations. We confirm this empirically in the table below, showing that DM without labels is not able to distill anything meaningful, performing no better than random subsets. Since MTT is the only method that can indeed distill without labels and still learn class-discriminative features, we do believe MTT is the best candidate from SL distillation techniques to be adapted to SSL.

Table: Distribution Matching (DM) without Labels (Distilled Data Size = 5%)

The fraction % next to the dataset name denotes the fraction of downstream labels used. (Compare this to #s in Table 4 from paper)

DatasetCIFAR100CIFAR10TinyImageNetAircraftCUB2011DogsFlowers
C10 1%7.09 ± 0.1635.98 ± 0.982.43 ± 0.192.15 ± 0.211.19 ± 0.131.71 ± 0.181.60 ± 0.17
C10 5%14.77 ± 0.1046.41 ± 0.355.11 ± 0.324.94 ± 0.881.57 ± 0.312.53 ± 0.133.10 ± 0.33
C100 1%8.74 ± 0.3537.75 ± 0.872.75 ± 0.222.51 ± 0.081.20 ± 0.112.07 ± 0.202.53 ± 0.27
C100 5%17.01 ± 0.3747.32 ± 0.386.40 ± 0.215.08 ± 0.821.99 ± 0.132.77 ± 0.284.62 ± 0.28

W2, W3: We combined the responses for these 2 weaknesses since a clear explanation of Figure 1 helps us explain why other variance reduction techniques are unlikely to be sufficiently effective. Firstly, as shown in [1], SAM is a regularization for finding more-generalizable minima by penalizing the sharpness. This is distinct from the problem of reducing variance of gradients. In fact [5] shows that SAM’s optimization leads to higher variance of gradients and can benefit significantly itself from variance reduction. Secondly, classical variance reduction techniques have been shown to be unsuccessful for deep neural networks [4]. Thirdly, we provide additional results (see Table ) showing that MTT applied directly to higher variance SSL trajectories leads to distilled sets that perform worse than even trivial baselines (random subsets / no-pretraining). Additionally, in Figure 1, we do indeed explore a simple alternative to reduce variance for SSL, i.e., increasing the batch size, and show how this is still insufficient: In Figure 1a, we provide evidence for the higher variance in gradients using empirical estimates. Since weights at the end of each iteration are determined exactly by the gradient, we use the variance of the weights at end of each iteration for models starting from the same initialization to estimate this. As we see, the SL (KD) loss has far lower variance than SSL. Moreover, while increasing the batch size by 4x does reduce the variance slightly, it is not nearly as much as SL. This indicates that reducing the variance of SSL through other regularization techniques may also likewise be insufficient. Figure 1b shows that as a result of the higher variance, when the distillation process optimizes the distilled set to match the training trajectories, it is not able to successfully optimize the distilled set to match the trajectories i.e. it cannot minimize the distillation loss. This is due to how challenging this optimization is due to the high variance of SSL. Finally, Figure 1c confirms that the insufficient minimization of distillation loss is problematic. In particular, it shows that for the cases where distillation loss was insufficiently minimized, there is little to no change between distilled images and their initializations indicating that the distillation has been unsuccessful.

Q1: Answered with W2/W3.

Q2: Answered with W1.

Q3: Answered with W2/W3.

评论

Q4: We empirically showed that, on most tasks, initializing with the high-loss subset performs slightly better than initializing with a random subset. Examples with a high loss correspond to more ambiguous data points likely to lie on the boundary of the latent classes of the pretraining data (high loss images from CIFAR100, shown in https://anonymous.4open.science/r/iclr_mkdt_rebuttal-2DE1/iclr_rebuttal_fig.png). As a result, initializing the distilled set with these high-loss examples allows the distilled set to better learn representations of boundary examples, leading the encoder to more closely preserve the teacher model’s representations on them (see average MSE on the top 1% of high-loss examples in https://anonymous.4open.science/r/iclr_mkdt_rebuttal-2DE1/iclr_rebuttal_fig.png). Consequently, since the boundary points are represented more accurately, the linear classifier trained on these representations with downstream data achieves higher accuracy. This is not central to our method but rather an additional component that provides further performance improvements at no extra cost.

Q5: MKDT does indeed enable dataset distillation for various SSL methods (including masked reconstruction) since the only requirement is representations from a teacher model trained with SSL. As we show in Table 6, we have results extending MKDT to SimCLR as well, showing significant gain over baselines. We believe that the effectiveness of the distilled set will be determined by how effective the given SSL algorithm is training the teacher model.

Q6: Largely we borrowed hyperparameters from MTT [1 from paper]. In particular, we use nearly identical number of synthetic steps (40 for us v/s 30-50 for [1 from paper] on CIFAR10/100 and 10 for both on TinyImageNet) and expert epochs (2 epochs, across all datasets, for both us and [1 from paper]). As seen in [1 from paper], when distillation is more challenging, e.g., on larger datasets such as TinyImageNet, the max start epoch is reduced in order to only distill early training dynamics and make optimization more tractable. Even with knowledge distillation trajectories, we observed that optimizing the distillation loss for SSL is harder (since we deal with the more difficult problem of supervised regression with unique representations as labels), thus we reduced the max start epoch by ~10x (2 for us v/s 20 for [1 from paper] on CIFAR10/100 and 2 for us v/s 10 [1 from paper] on TinyImageNet).

We are eager to engage in further discussion to resolve any other concerns or comments!

References:

[1] Foret, Pierre, et al. "Sharpness-aware minimization for efficiently improving generalization." arXiv preprint arXiv:2010.01412 (2020).

[2] Chen, Xuxi, et al. "Data distillation can be like vodka: Distilling more times for better quality." Proceedings of the International Conference on Learning Representations (ICLR), 2024.

[3] Yang, Yu, Hao Kang, and Baharan Mirzasoleiman. "Towards sustainable learning: Coresets for data-efficient deep learning." International Conference on Machine Learning. PMLR, 2023.

[4] Defazio, Aaron, and Léon Bottou. "On the ineffectiveness of variance reduced optimization for deep learning." Advances in Neural Information Processing Systems 32 (2019).

[5] Li, Bingcong, and Georgios Giannakis. "Enhancing sharpness-aware optimization through variance suppression." Advances in Neural Information Processing Systems 36 (2024).

评论

Thank you for your detailed and thoughtful response to the initial review. I appreciate the effort you have put into addressing the concerns and providing clear explanations and justifications for the choices made in your work. Given that all my concerns have been addressed satisfactorily, I am happy to increase my score for this submission.

评论

We're glad you found our rebuttal detailed and thoughtful, and that the clear explanations we provided for choices made in our work helped address your concerns.

Thank you once again for taking the time to review our paper so thoughtfully!

审稿意见
6

This paper explores dataset distillation methods for self-supervised learning (SSL). The authors demonstrate that the MTT method cannot be directly applied to SSL due to high trajectory variance, both theoretically and empirically. To address this, they propose a solution leveraging knowledge distillation (KD) to reduce the length and variance of SSL trajectories.

优点

  • Self-supervised dataset distillation is an important yet underdeveloped area with wide-ranging applications.
  • The empirical and theoretical evidence provided (Theorem 4.1 and Figure 1) for high gradient and trajectory variance of MTT in SSL settings is convincing and insightful.

缺点

  • The performance improvements over random subset and high-loss subset are marginal in many cases, especially on the larger TinyImageNet dataset (Table 3). This raises concerns about the method’s practical value, as it requires significant computational resources to distill synthetic data while yielding minimal improvement.
  • The absence of KRR-ST results in Tables 4, 5, 6, and 7 limits the ability to assess the proposed method's effectiveness, given that KRR-ST is a closely related baseline.
  • The MKDT method appears to add only a knowledge distillation process to MTT, where a student model mimics the SSL teacher model’s representations. The synthetic data is then learned by matching the student model’s trajectory, rather than the teacher model’s trajectory, as in MTT. It is unclear why introducing a student model as an intermediary reduces trajectory variance, or what specific role the student network plays throughout the data distillation process. The authors are encouraged to provide further discussion or insights into this mechanism.

问题

  • Line 295: Incorrect format in “[41] trains student ….”
  • The MKDT method uses ResNet-18 as the teacher model and ConvNet as the student model. What is the backbone network used for KRR-ST in Tables 2 and 3?
  • Presenting some distilled images would help demonstrate the proposed method’s effectiveness and its advantage over the baseline KRR-ST.
评论

We’d like to thank the reviewer YkVh for appreciating 1) the importance of the problem we tackle i.e. dataset distillation for self-supervised learning and 2) the empirical and theoretical evidence we provide for high variance of SSL trajectories that prevent successful application of trajectory matching (MTT).

We now address the weaknesses and questions raised by the reviewer:

W1: Our evaluation measures performance across several downstream datasets and We have significant improvement when pre-training on CIFAR10/CIFAR100, up to 13%. We acknowledge that, for TinyImageNet, on some datasets, the improvement over subsets is smaller. However, this is not a limitation of our method which proposes using knowledge distillation (KD) trajectories to enable data distillation for SSL. Instead, this is a limitation inherited from trajectory matching (MTT) for higher-resolution datasets such as TinyImageNet. As seen in the MTT paper [1 from paper], the improvement over random subsets is far smaller for higher resolution datasets such as TinyImageNet as compared to CIFAR10/CIFAR100. Remedying this problem for both trajectory matching for SL and SSL is an interesting direction for future work.

W2: We have added these results in our revision (to Tables, 4,5,6 and 7). We also include these results below. As we see, across all these settings, our proposal, MKDT, continues to outperform KRR-ST significantly.

For all the tables below, % in brackets indicates the % of labeled downstream data available.

Table 4: KRR-ST 5%

DatasetCIFAR100CIFAR10TinyImageNetAircraftCUB2011DogsFlowers
CIFAR10 1%8.69 ± 0.3236.69 ± 0.883.20 ± 0.232.26 ± 0.131.33 ± 0.091.91 ± 0.342.39 ± 0.18
CIFAR10 5%16.95 ± 0.5347.40 ± 0.347.10 ± 0.275.56 ± 0.771.98 ± 0.072.78 ± 0.164.38 ± 0.04
CIFAR100 1%9.02 ± 0.2437.86 ± 1.142.94 ± 0.132.42 ± 0.351.50 ± 0.071.99 ± 0.193.04 ± 0.36
CIFAR100 5%17.24 ± 0.4747.53 ± 0.116.60 ± 0.325.37 ± 0.852.31 ± 0.332.87 ± 0.275.23 ± 0.14
TinyImageNet 1%7.54 ± 0.3534.27 ± 1.363.19 ± 0.222.11 ± 0.231.30 ± 0.121.68 ± 0.202.65 ± 0.64
TinyImageNet 5%13.71 ± 0.3042.82 ± 0.466.50 ± 0.234.36 ± 0.491.97 ± 0.062.75 ± 0.373.97 ± 0.14

Table 5: ResNet10

DatasetCIFAR100CIFAR10TinyImageNetAircraftCUB2011DogsFlowers
KRR-ST 5%13.84 ± 0.7839.21 ± 0.558.04 ± 0.522.12 ± 0.151.16 ± 0.051.77 ± 0.144.56 ± 0.42

Table 5: ResNet18

DatasetCIFAR100CIFAR10TinyImageNetAircraftCUB2011DogsFlowers
KRR-ST 5%12.30 ± 0.8335.73 ± 1.077.21 ± 0.352.32 ± 0.391.18 ± 0.161.81 ± 0.142.45 ± 0.12

Table 6: KRR-ST SimCLR

DatasetCIFAR100CIFAR10TinyImageNetAircraftCUB2011DogsFlowers
C10 1%8.38 ± 0.1736.90 ± 1.302.95 ± 0.122.45 ± 0.131.19 ± 0.091.87 ± 0.182.35 ± 0.06
C10 5%16.29 ± 0.3746.87 ± 0.526.31 ± 0.435.31 ± 0.631.89 ± 0.142.66 ± 0.184.36 ± 0.16
C100 1%8.38 ± 0.3636.57 ± 1.023.01 ± 0.222.41 ± 0.151.28 ± 0.021.71 ± 0.301.98 ± 0.24
C100 5%15.75 ± 0.4646.76 ± 0.506.17 ± 0.205.43 ± 0.651.93 ± 0.062.61 ± 0.253.55 ± 0.29

Table 7: KRR-ST 2% Distilled Data with Larger Label Fractions

DatasetCIFAR100CIFAR10TinyImageNetAircraftCUB2011DogsFlowers
C10 10%21.01 ± 0.1451.02 ± 0.538.95 ± 0.267.91 ± 0.762.44 ± 0.123.54 ± 0.296.82 ± 0.64
C10 50%29.01 ± 0.3058.09 ± 0.0715.94 ± 0.3117.60 ± 1.045.01 ± 0.346.81 ± 0.3515.92 ± 0.64
C100 10%21.39 ± 0.1652.40 ± 0.738.21 ± 0.077.64 ± 0.362.34 ± 0.123.76 ± 0.256.52 ± 1.28
C100 50%29.46 ± 0.8158.57 ± 0.7815.70 ± 0.0715.89 ± 0.544.82 ± 0.417.00 ± 0.1815.07 ± 0.46
Tiny 10%17.02 ± 0.2645.48 ± 0.848.88 ± 0.415.29 ± 0.162.08 ± 0.213.21 ± 0.156.10 ± 1.44
Tiny 50%20.01 ± 0.4048.16 ± 1.1813.59 ± 0.478.66 ± 1.293.46 ± 0.224.98 ± 0.3115.12 ± 0.37
评论

W3: We provide additional results (https://openreview.net/forum?id=c61unr33XA&noteId=am03RRoWCi) where we directly apply trajectory matching (MTT) to the higher variance SSL trajectories and show that the corresponding distilled sets perform worse than even trivial baselines (random subsets / no-pretraining). The key insight of our method is that using the student model as intermediary, we are able to convert the problem from SSL distillation, where the gradient has high variance, to distillation of a supervised regression problem where the labels are representations learned by the teacher model. As discussed in Sections 4.1 and 4.2, the supervised regression problem has lower gradient variance as the gradient of each example is independent of other examples in the batch, whereas for SSL, the gradient of each example depends on all other examples in the mini-batch (hence is very sensitive to the choice of mini-batches). The evidence we provide in Thm 4.1 and Figure 1 supports our claims: 1) the variance of the SL regression (knowledge distillation) is less than that of SSL and 2) the lower variance makes distillation more effective.

Q1: We have fixed this in our revision.

Q2: ResNet-18 trained with BarlowTwins (same as MMKDT) (we follow the exact methodology & hyperparameters specified by KRR-ST and have added this to our Appendix)

Q3: We have added examples of distilled images for CIFAR10, CIFAR100 and TinyImageNet for both MKDT and KRR-ST in Appendix E. From these, we can clearly observe that MKDT distilled images seem to sharpen the salient class-features of the image, while blurring out background and other extraneous details; in contrast, the images distilled by KRR-ST appear noisy and blurred.

We are eager to engage in further discussion to resolve any other concerns or comments!

评论

As the discussion period is coming to an end soon, we're hoping to hear if our rebuttal addressed the reviewer's concerns and if we can provide any more clarifications about our work.

Thank you once again for your efforts reviewing our paper!

评论

Thanks for all the efforts on addressing my concerns, I have increased my score.

评论

Thank you for taking the time to go through our rebuttal!

审稿意见
8
  • This paper propose a data distiallation method for self-supervised pretraining. This methods MKDT is a based on knowledge distiallation method, which is mainly motivated by the study on the gradient's variance of different loss function.

优点

  • The first, as this paper claims, effective data distillation method for SSL pretraining is proposed.
  • Outstanding experimental performance on provided datasets, e.g., CIFAR-10, 100/Tiny ImageNet.

缺点

  • Not matched theory and implementation. The toy case in theoretical analyses is too simple and the gap between analyses and implementation is too large.
    1. The linear model is provided in the main paper only. However, even in linear probe experiments, the tested model is non-linear. Moreover, in the literature of the related ones (e.g., sharp-aware generalization, bias-variance trade-off, Out-of-Distribution Generalization), analyses about gradient's variance on different loss design are usually conducted on non-linear case (e.g., 2-layer network with non-linear Activation function, such as ReLU in early years).[1,2,3]
    2. The derivative and presentation of its proof is too complicated however the logic is actually trival (i.e., simply unfold the deviation of the gradient of SL and SSL, and then compare each term by easy to see is larger).
  • Pretraining is important nowadays. However, the model architecture is too small and simple, for which pretraining is not as important as the large models (e.g., LLMs), which is not discussed in the main paper.
    1. In the assumption of the theorem, citation [4] about sparse code is on multi-modal pretraining, but similar architecture is not mentioned in the rest of the paper.
  • Experiments about generalization is confusing.
    1. In the distillation setup, it claims that the teacher model is ResNet-18 and the student model is ConvNet. However, the models in the experiments about generalization to Larger architecture are ResNet-18 and ResNet-10. The models are not larger at all, and ResNet-L (L>18) are available in the provided code.
  • Typos and inconsistent prensentation, e.g., synhronous in Line 215, pre-training and pretraining.

[1] Estimating Example Difficulty using Variance of Gradients. CVPR 2022 [2] Fishr: Invariant Gradient Variances for Out-of-Distribution Generalization. ICML 2022 [3] Rethinking Bias-Variance Trade-off for Generalization of Neural Networks. ICML 2022 [4] Data-efficient contrastive language-image pretraining: Prioritizing data quality over quantity. AISTATS 2024

问题

It seems like that the logic of the paper is: the gradient variance of SSL is larger than the one of SL, thus the knowledge distillation is needed to lower the variance due to the relationship between gradient and trajectory matching. The questions are:

  • It's due to the high variance of the SSL gradient, why not show and print the empirical gradient variance directly?
  • How is the straightforward relationship between model performance and gradient variance, if albation study on trajectory sampling is considered?

I first give 6 here, actually, I tend to give 5.5. therefore, some important concerns should be discussed during the rebuttal period.

评论

We’d like to thank reviewer 1SyT for appreciating our work’s contribution as 1) the first effective dataset distillation method for SSL pre-training, as well as 2) our approach’s outstanding experimental performance on CIFAR-10, CIFAR100 and TinyImageNet.

We now address the weaknesses and questions raised by the reviewer:

W1: We included a simplified theoretical analysis to highlight how the interaction between examples in a batch by SSL loss leads to higher gradient variance. Thus, we unrolled all the terms to illustrate how the variance for SSL gradient would be larger. The works cited by the reviewer are all for supervised learning; we are not aware of existing analysis of the variance of gradients with non-linear models for contrastive learning. Due to the more complicated nature of the SSL/CL loss, prior works in CL theory [1,2,3] have also relied on analyses of linear models and have demonstrated how such analysis corresponds with the behavior of deep nonlinear networks. Providing an expression for variance of SSL gradient in a more general setting is a challenging problem in its own right. This is because gradient of each example depends on every other example in the batch, hence introducing additional complications. Such an analysis is beyond the scope of this work. Nonetheless, we have revised our claims from “confirmed theoretically” to “illustrate theoretically, in a simplified setting” to reflect the gap between theory and implementation.

W2: Although we use 4-5 layer ConvNets for distillation, in Table 5, we do demonstrate that datasets distilled with these smaller ConvNets can be used to pre-train models as large as ResNet-18. This pre-training is effective as it shows generalization to a variety of downstream tasks with only 1-5% of data, when evaluated by training only a linear classifier on the representations of the pretrained ResNets. We note that using 4-5 layer ConvNets is also common practice in dataset distillation for SL [1, 23, 25, 42, 46, 48 etc. from paper]. This is because gradient, weight, or distribution matching with networks larger than ConvNets becomes prohibitive due to the very high cost and difficulty of optimization. Enabling further scaling of this method to larger networks is indeed an interesting direction for future research. Recent work [6, 9 from paper] on improving the scalability and optimization of trajectory matching, orthogonal to our contribution, can be useful to this end. With regards to the citation for sparse-coding model, we have replaced them with citations from prior literature on unimodal contrastive learning.

W3: There might be a misunderstanding of the goal of our experiment here. We are not trying to demonstrate generalization to architectures larger than the “teacher network”. Instead, our goal is to demonstrate that our methodology can apply to networks larger than the 4-5 layer ConvNets used for distillation. The results on ResNet-10 and ResNet-18 indeed prove that this methodology can generalize to larger networks.

W4: Thank you for catching these! We've addressed them in our revision.

Q1: Since weights = lr * gradient, our Figure 1 shows the high variance of gradients using the variance of weights, after each iteration. The first step shows exactly the high variance of gradients and subsequent steps shows how this leads to even further increase in variance of weights over iterations. Moreover, we focus on showing the impact of the high variance of gradients on variance of weights since trajectory matching matches weights, at different iterations, rather than gradients. In Figure 1a, we compute the variance, in weights, after each iteration (step), across 5-models starting from the same initialization, for both SL (in particular, the knowledge distillation loss which is a supervised regression loss) and SSL. The figure confirms the smaller variance of our method (denoted as MTT SL) vs MTT SSL.

评论

Q2: First, we provide additional results (see https://openreview.net/forum?id=c61unr33XA&noteId=am03RRoWCi) showing that MTT applied directly to higher variance SSL trajectories leads to distilled sets that perform worse than even trivial baselines (random subsets / no-pretraining). Second, as discussed earlier, Fig 1 confirms the connection between variance and distillation performance. Figure 1a established that the variance of the gradient is large for SSL as compared to SL. Figure 1b shows that as a result, distillation loss cannot be effectively minimized by MTT on higher variance trajectories. Figure 1c confirms that the inability to minimize distillation loss leads to no change in the images from their initialization - indicating no distillation has occurred. To further confirm this connection, we considered reducing variance for SSL by increasing the batch size (denoted by 4x batch size SSL). We confirm in Figure 1a that this does indeed reduce the variance. Figure 1b shows that this leads to a slightly better minimization in distillation loss and Figure 1c shows that this leads to a slightly more effective distillation with images changing from their initialization. Overall, the distilled images using the high variance SSL trajectories are nearly identical to the real images used for initialization. Thus, they offer no benefits to pre-training, privacy etc. We use these results together to confirm empirically the connection between gradient variance and effectiveness of trajectory matching distillation.

We are eager to engage in further discussion to resolve any other concerns or comments!

References:

[1] Ji, Wenlong, et al. "The power of contrast for feature learning: A theoretical analysis." Journal of Machine Learning Research 24.330 (2023): 1-78.

[2] Xue, Yihao, et al. "Investigating the Benefits of Projection Head for Representation Learning." Proceedings of the International Conference on Learning Representations (ICLR), 2024.

[3] Xue, Yihao, et al. "Which features are learnt by contrastive learning? On the role of simplicity bias in class collapse and feature suppression." International Conference on Machine Learning. PMLR, 2023.

评论

As the discussion period is coming to an end soon, we're hoping to hear if our rebuttal addressed the reviewer's concerns and if we can provide any more clarifications about our work.

Thank you once again for your efforts reviewing our paper!

评论

As the extended discussion period will end in a few days, we're hoping to hear if our rebuttal addressed the reviewer's concerns and if we can provide any more clarifications about our work.

Thank you once again for your efforts reviewing our paper!

评论

Thx for the rebuttal. It solves most concerns. I raised the score. Good luck! citing more recent dd papers is good for this work.

评论

Thank you for taking the time to go through our rebuttal!

评论

Table MTT for High Variance SSL Trajectories

DatasetCIFAR100CIFAR10TinyImageNetAircraftCUB2011DogsFlowers
C10 (1% Downstream Labels)4.63 ± 0.3022.06 ± 1.861.11 ± 0.161.92 ± 0.260.77 ± 0.111.38 ± 0.281.26 ± 0.37
C10 (5% Downstream Labels)7.10 ± 0.4329.04 ± 0.581.90 ± 0.172.81 ± 0.600.88 ± 0.081.82 ± 0.191.42 ± 0.53
C100 (1% Downstream Labels)4.26 ± 0.3426.06 ± 1.581.20 ± 0.141.49 ± 0.280.87 ± 0.261.24 ± 0.071.41 ± 0.27
C100 (5% Downstream Labels)9.66 ± 0.2438.16 ± 0.572.11 ± 0.192.83 ± 0.411.06 ± 0.041.74 ± 0.152.38 ± 0.19

As confirmed explicitly in this table, applying MTT directly to the high variance SSL trajectories indeed leads to an ineffective distilled set. In fact, training on this distilled set is even worse than Random Subsets (compare with Table 4 in main paper).

评论

We thank the reviewers for their valuable suggestions to our submission. We've made changes to our submission based on these and highlighted these changes in blue.

AC 元评审

This paper introduces a novel approach for dataset distillation in self-supervised learning (SSL) pre-training, addressing the inherent challenge of high gradient variance in SSL objectives. The proposed method, Matching Knowledge Distillation Trajectories (MKDT), leverages a teacher-student framework to stabilize gradients and improve the optimization of synthetic datasets. Its innovative approach, strong empirical performance, and theoretical grounding make it a solid candidate for acceptance. The thorough rebuttal further solidified the case for the paper, addressing all reviewer concerns comprehensively.

审稿人讨论附加意见

All reviewers agreed to accept this paper.

最终决定

Accept (Poster)