6.0

/10

Spotlight4 位审稿人

最低6最高6标准差0.0

4.0

置信度

正确性3.0

贡献度3.0

表达3.3

NeurIPS 2024

Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment

Jiawei Du,Xin Zhang,Juncheng Hu,Wenxin Huang,Joey Tianyi Zhou

OpenReview PDF

提交: 2024-05-05更新: 2024-11-19

摘要

关键词

Dataset DistillationSynthetic DataDiversityGeneratlization

评审与讨论

审稿意见

评分: 6置信度: 42024-07-07

This paper proposes a new method to enhance dataset distillation by improving the diversity of synthesized datasets. The authors introduce a dynamic and directed weight adjustment technique to maximize the representativeness and diversity of synthetic instances. Theoretical and empirical analyses demonstrate that increased diversity enhances the performance of neural networks trained on these synthetic datasets. Extensive experiments across multiple datasets, including CIFAR, Tiny-ImageNet, and ImageNet-1K, show that the proposed method outperforms existing approaches while incurring minimal computational expense.

优点

The paper is well-written and structured.
The analysis of the decoupled variance component is interesting.
The proposed dynamic adjustment mechanism is novel and can enhance the diversity of the synthesized dataset.
The paper provides a solid theoretical foundation and extensive empirical validation across multiple datasets and architectures.
The findings have important implications for the efficiency of dataset distillation, particularly for large-scale datasets.

缺点

The effectiveness of the diversity in generated datasets has been discussed in many prior works [1, 2].
The authors should compare the proposed method with recent SOTA dataset distillation methods that extended from SRe2L [2, 3]. Especially for [3], the relation to it needs to be clarified.
A comparison with recent diffusion-based generative dataset distillation methods [4, 5] is also needed because they are free of optimization.
How long does it typically take to search for the optimal values of $K$ and $\rho$ for ResNet-18 and other different datasets and network architectures? Can the authors provide insights into the computational cost associated with this parameter-tuning process?
Are there any scenarios where the directed weight adjustment might negatively impact performance or diversity? If so, how can these be mitigated?
Have the authors considered some downstream applications to further verify the effectiveness of the proposed method, such as continual learning and privacy protection?

[1] Liu, Yanqing, et al. "Dream: Efficient dataset distillation by representative matching." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[2] Sun, Peng, et al. "On the diversity and realism of distilled dataset: An efficient dataset distillation paradigm." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[3] Shao, Shitong, et al. "Generalized large-scale data condensation via various backbone and statistical matching." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[4] Gu, Jianyang, et al. "Efficient dataset distillation via minimax diffusion." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[5] Su, Duo, et al. "D4M: Dataset Distillation via Disentangled Diffusion Model." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

问题

See the weaknesses section. I am happy to raise my score if the above concerns can be addressed.

局限性

The authors discussed the limitations.

作者回复

2024-08-07

Thank you for your constructive comments! We sincerely appreciate the time and effort you dedicated to reviewing our work. We will incorporate the mentioned state-of-the-art methods [2,3,4,5] in the related work section and include the downstream experiments in our revised paper. Below are our responses to your questions.

Q1 (Weakness 1): prior works [1, 2].

A1: Yes, we will incorporate those prior works in the Related Works of our revised paper. In fact, Dream [1] is already cited in line 298 in our original submission. We will also include RDED [2] in our revised paper, which enhances diversity by compressing multiple key patches into a single image. These key patches are selected from the original dataset based on a specific metric.

Unlike these prior works, we explore a novel perspective by adjusting weight parameters to introduce greater variance in generating synthetic instances. Our DWA is theoretically complementary to these prior works.

Q2 (Weakness 2 & 3): compare with recent SOTA [2,3,4,5]

A2: Yes, we will compare our proposed DWA with the recent SOTA methods [2,3,4,5] in our revised paper. We first append the comparison results in the below tables.

Tiny-ImageNet

ResNet-18

ipc	DWA (Ours)	RDED[2]	G-VBSM[3]	D4M[5]
50	52.8	58.2	47.6	46.8
100	56.0	-	51.0	53.3

ResNet-101

ipc	DWA (Ours)	RDED[2]	G-VBSM[3]	D4M[5]
50	54.7	41.2	48.8	53.2
100	57.4	-	52.3	55.3

ImageNet-1K

ResNet-18

ipc	DWA (Ours)	RDED [2]	G-VBSM[3]	Minimax[4]	D4M[5]
50	55.2	56.5	51.8	58.6	55.2
100	59.2	-	55.7	-	59.3

ResNet-101

ipc	DWA (Ours)	RDED[2]	G-VBSM[3]	D4M[5]
50	63.3	61.2	61.0	63.4
100	66.7	-	63.7	66.5

Our proposed DWA outperforms the four state-of-the-art methods in the following scenarios: Tiny ImageNet with ResNet-18 and ipc 100, and ResNet-101 with ipc 50 and 100, as well as ImageNet-1K with ResNet-101 and ipc 100. While diffusion-based methods have advantages on the ImageNet-1K dataset using ResNet-18, our DWA performs very close to their level.

RDED[2] enhances diversity by compressing multiple key patches into a single image using a dataset pruning approach, selecting patches from the original dataset based on specific metrics instead of synthesizing distilled datasets. G-VBSM[3] ensures data densification by performing matching based on diverse backbones, layers, and statistics, sampling a random backbone for distillation in each iteration. In contrast, our method introduces perturbations on a single model, whereas G-VBSM requires extra computation and memory for soft labels generated by different architectures, indicating the efficiency of our approach.

Q3 (Weakness 4): computational cost on parameter-tuning?

A3: Our method, DWA, involves only two hyperparameters: $K$ and $\rho$ . We use consistent values of $K=12$ and $\rho=0.0015$ across the four datasets: CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1k. The generalized parameters of our DWA make the parameter-tuning process straightforward to implement.

We demonstrated the results of the hyperparameter search in Figure 5 in our original submission. Figure 5 illustrates that a significant performance gain can be achieved as long as $K>4$ and $\rho>0.001$ .

We searched the parameters $K$ and $\rho$ using binary search for 12 times (6 times for each). Each search requires 1.5 hours using ResNet-18 on CIFAR-100.The computational cost associated with this parameter-tuning process is low in our DWA method.

Q4 (Weakness 5): negatively impact performance

A4: We would like to thank you for the insightful question. According to our experiments across four datasets and seven architectures, we did not observe any negative impact on performance from using our DWA method.

However, based on our theoretical analysis in Section 3, there can be negative impacts of using our DWA. As we stated in line 160, perturbing converged weight parameters will introduce noise. This issue is verified by our experiments in the table of random perturbation, which also motivated us to investigate the directed perturbation to avoid introducing noise. Therefore, there are two scenarios where DWA can have a negative impact:

First, if the model $f_{\theta_{\mathcal{T}}}$ is not well-optimized or converged. The theoretical guarantee of the directed perturbation is proved in Equation 14. Intuitively, this means that the directed perturbation $\widetilde{\Delta\theta}$ will decrease the integrated loss $L_{\mathcal{T} \setminus \mathbb{B}}$ . Equation 14 will not hold if the weight parameters $\theta_{\mathcal{T}}$ have not converged; that is, the model used to generate the synthetic dataset is not well-optimized.

Second, if we use a very large value of $\rho$ . A larger $\rho$ will also make Equation 14 invalid, as our assumption is that $\rho$ is small enough for the first-order Taylor expansion.

However, both scenarios can be easily avoided by using a well-optimized $f_{\theta_{\mathcal{T}}}$ and a smaller $\rho$ . Your suggestion inspires us to design a mechanism to test Equation 14 in our code to avoid these two scenarios and improve our DWA.

Q5 (Weakness 6):downstream applications

A5: Thank you for the advices, we first examine the continual learning task to verify our proposed DWA. Our implementation is based on the effective continual learning method GDumb [1]. Class-incremental learning was performed under strict memory constraints on the CIFAR-100 dataset, using 20 images per class (ipc=20). CIFAR-100 was divided into five tasks, and a ConvNet was trained using our distilled dataset, with accuracy reported as new classes were incrementally introduced.

class	20	40	60	80	100
SRe2L	15.7	10.6	9.0	7.9	6.9
DWA	34.6	25.7	22.5	20.2	18.1

As you suggested, we will also incorporate downstream applications such as Neural Architecture Search(NAS) and privacy protection in our revised paper.

2024-08-09

Thank you for your prompt response, your clarifications have resolved many of my concerns. Regarding A2, I would also like to know the performance comparison when the IPC is set to 10.

2024-08-09

Due to length constraints, these details were not included in the initial response. The comparison with SOTA methods under ipc=10 is summarized below. Benefiting from the more powerful supervisions, such as multi-patch stitching in RDED [2] and the use of large pre-trained diffusion models in Minimax [4], these methods outperform ours in the ipc=10 setting. Despite the limitations imposed by a restricted distillation budget, which hinders the diversity of our synthesis results, our method still performs significantly better than other SOTA methods.

ImageNet-1K

ResNet-18

ipc	DWA (Ours)	SRe2L	RDED [2]	G-VBSM[3]	Minimax[4]	D4M[5]
10	37.9	21.3	42.0	31.4	44.3	27.9

ResNet-101

ipc	DWA (Ours)	SRe2L	RDED[2]	G-VBSM[3]	D4M[5]
10	46.9	30.9	48.3	38.2	34.2

Besides the suggested related works, we also review three other recent methods: DATM [6] (a lossless distillation method), CDA [7] (an extension of SRe2L), and D3M [8] (a Diffusion-based method). The comaprsion showcases the cutting-edge performance of our method.

Cifar-100

ConvNet

ipc	DWA (Ours)	DATM [6]
10	47.6	47.2
50	59.0	55.0

ImageNet-1K

ResNet-18

ipc	DWA (Ours)	CDA [7]	D3M [8]
10	37.9	-	23.57
50	55.2	53.5	32.23
100	59.2	58.0	-

[6] Guo Z, Wang K, Cazenavette G, et al. "Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching." International Conference on Learning Representations. 2023

[7] Yin, Zeyuan, and Zhiqiang Shen. "Dataset distillation in large data era." 2023.

[8] Abbasi A, Shahbazi A, Pirsiavash H, et al. "One Category One Prompt: Dataset Distillation using Diffusion Models". arXiv preprint arXiv:2403.07142, 2024.

2024-08-09

Thank you for your time and constructive comments! The comparison mentioned above will be included in our camera-ready version.

2024-08-10

Thank you very much, I have no further concerns. And I have increased my score.

2024-08-10

Thank you for your helpful review. We are glad to have resolved your concerns.

审稿意见

评分: 6置信度: 42024-07-07

This paper proposes a dynamic adjustment mechanism to enhance the variation component in batch normalization loss so that increase the diversity of synthetic images. By adjusting the weights, the synthesized images achieve sufficient diversity to cover the widespread distribution of the real dataset.

优点

a. Originality: The author innovatively analyzes the different components of the loss function. This work differs from previous contributions by proposing a directed weight adjustment algorithm to increase the diversity of synthetic images. Related work is adequately cited. b. Quality: The submission is technically sound. The claims are well supported by theoretical analysis and experimental results. It is a complete piece of work. The authors are careful and honest in evaluating both the strengths (increased diversity and minimal computational expense) and weaknesses (unexpected noise) of their work. c. Clarity: The submission is clearly written and well organized. d. Significance: The results are important. The submission addresses the diversity problem more effectively than previous work. Most results advance the state of the art with a large margin, providing a unique theoretical approach that is also demonstrated in experiments.

缺点

c. Clarity: A few equations, such as (14), need more explanation. The current text does not adequately inform the reader on how to reproduce the results, as there is insufficient connection to the implementation details. d. Significance: The results on CIFAR10 using ConvNet are significantly worse than the current state of the art. Additionally, the authors did not mention whether or how the method can be integrated as a plug-in.

问题

In equation 14, can the authors explain why it is <=0? In line 176, is the first item clearly greater than the second item because equation 14 is <=0, or is the causal relation inverse, with the first item being greater than the second item concluding that equation 14 is <=0?.
In Table 1, Can the author explain why the results of ConvNet on CIFAR10 are obviously worse than state of the art?
Why there is no results for ipc=1 in all result tables.

局限性

The authors address the limitation about the unexpected noise.

作者回复

2024-08-07

Thank you for your insightful comments! We answer your questions as below.

Q1 (Weakness 1 & Question 1): A few equations, such as (14), need more explanation. The current text does not adequately inform the reader on how to reproduce the results, as there is insufficient connection to the implementation details.

A1: Thank you for your advice on improving the presentation of our paper. We will revise these equations and release our code for reproducibility once our paper is accepted.

The solutions to Equation 11, referenced in Algorithm 1, are obtained using a multi-step first-order gradient descent method as stated in lines 6-7 of Algorithm 1.

Equation 14 aims to prove that directed perturbation will not introduce unanticipated noise while enhancing diversity. This provides a theoretical explanation for using Equation 11 to solve for the optimal perturbation. Additionally, it offers behavioral guidance for the "Directed Weight Adjustment" step in line 7 of Algorithm 1.

Derivation of Equation 14： According to Equation 11, $\Delta \theta$ can achieve max $L_B$ , so we have

$L_B\left(f_{\theta + \widetilde{\Delta\theta}}\right) - L_B\left(f_{\theta}\right) \leq 0.$

and then we can get Equation 14 $\leq 0$ .

Q2 (Weakness 2 & Question 2): The results on CIFAR10 using ConvNet are significantly worse than the current state of the art.

A2: We believe the poor performance on CIFAR-10 is inherited from our baseline, SRE2L, which also underperforms with ConvNet on this dataset. However, our DWA still outperforms SRE2L in this scenario.

The reasons for the poor performance of SRE2L and DWA in using ConvNet on CIFAR-10 are as below:

First, ConvNet is a very small and simple architecture, consisting of only three convolutional layers. The typical bi-level optimization approach used in previous methods has advantages in these simplified architectures. In contrast, SRE2L and our DWA leverage a coarser metric (BN parameters) for distillation, which is more adaptive to complex architectures with more BN layers.

Second, the synthetic datasets in CIFAR-10 are only 1/10 the size of those in CIFAR-100. Our proposed DWA improves distillation performance through enhanced diversity of the synthetic datasets. As the reviewer Uwti stated, “since the notion of ‘diversity’ only becomes significant when each class has multiple prototypes,” our method is expected to achieve greater performance gains with larger synthetic datasets. This is because our DWA can prevent duplicated synthetic instances.

Q3 (Weakness 3): Whether or how the method can be integrated as a plug-in.

A3: Yes, our method is designed as a plug-in approach for generation-based distillation methods. Consequently, our DWA can be integrated with baselines such as [1] and [2], as our theoretical analysis remains valid for these methods. Additionally, our code is easy to implement, requiring only minor adjustments to the weights before synthesis.

However, our DWA is not adaptable to trajectory-matching methods like [3], as these methods rely on stable and consistent gradient trajectories during training to guide the synthesis of distilled data. Using our DWA would disrupt the consistency of these trajectories, making it challenging to condense sufficient knowledge.

We plan to explore the integration of our DWA with more baselines in future work.

Q4 (Question 3): Why there is no results for ipc=1 in all result tables?

A4: We do not present the results for ipc=1 for the following reasons:

First, our proposed DWA is designed to introduce diversity in the dimension of ipc, specifically by introducing intra-class diversity to prevent duplicated synthetic instances within the same class. Therefore, our DWA is effective only when ipc > 1.

Second, the diversity issue becomes a performance bottleneck only when there is a sufficient distillation budget. In the scenario with ipc=1, the bottleneck of performance is the distillation budget rather than diversity.

We also conducted experiments to compare our DWA method to the baseline SRE2L in scenarios with different ipc settings. These experiments were conducted using ImageNet-1k with ResNet-18.

ipc	1	5	10	20	30	40	50
SRe2L	0.1	15.2	21.6	33.6	37.5	42.4	46.8
DWA (ours)	3.4	20.5	37.9	47.4	51.4	53.6	55.2

The results verify effectiveness of our DWA across different ipc settings.

[1] Shao, Shitong, et al. "Generalized large-scale data condensation via various backbone and statistical matching." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Gu, Jianyang, et al. "Efficient dataset distillation via minimax diffusion." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[3] George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 10708–10717, 2022.

2024-08-09

Thank you for your thoroughly explanation. I will maintain the current score.

2024-08-09

Thank you for your time. We are so glad the responses addressed your concerns!

审稿意见

评分: 6置信度: 42024-07-08

This paper introduces a random directed perturbation strategy to the teacher models for further strengthening the diversity of the existing sample-wise data synthesis method. Through intuitive empirical and theoretical analysis of the diversity factors, the authors proposed a simple yet overlooked weight adjustment strategy, showing promising results in various vision models and benchmarks including ImageNet.

========== Post rebuttal ==========

Thank you for the detailed responses and additional analysis. Having read the responses, my major concerns were addressed. Therefore, I raise the rating.

优点

Intuitive motivational analysis of how the existing methods, SRe2L, lack of diversity, while providing empirical and theoretical backgrounds for the factors that can affect diversity during the synthesis process.
Proposed simple yet overlooked perspectives of existing data synthesis loss via 1) perturbing the teacher model in a random directed way, 2) decoupled variance matching, both of which collaboratively enhances sample diversity.
These simple yet intuitive methods exhibit largely improved performance on various vision models and benchmarks including large-scale ImageNet dataset.
Thorough ablation studies with the decoupled variance coefficient, weight adjustment, and cross-architecture additionally provide an intuition of how the proposed method enhances the actual performance.

缺点

The proposed method has a heavy reliance on BN statistics matching from CNN-based models. Considering that the recent State-of-the-art models such as ViT are not leveraging BN, the proposed approach might not be directly applicable to these vision transformer-based models.
Failure cases of DWA should be more clarified. For example, In Table 1, when the model and the number of classes in the target dataset is small (ConvNet / CIFAR-10), DWA underperforms compared to the other methods.
Comparison with baseline (SRe2L) in terms of computational overhead is required. Since this paper introduces add-on procedure, weight adjustment, to the baseline, the authors should clarify if the proposed method does not induce significant computation overhead during synthesis.
In Figure 5, the test performance varies depending on the perturbation parameters. Although the authors deterministically set one set of parameters for all the different datasets, it should be clarified if these hyper-parameters are not sensitive and generalizable to all the datasets, to reduce time-consuming trial and error for searching these parameters on the other general datasets.

问题

Is the proposed synthesis method applicable to the ViT-based models that leverage LayerNorm rather than BatchNorm layers? Further experiments (directly synthesizing on ViT-based models or cross-architecture experiments by synthesizing with CNN models and testing on ViT-based models) would be required for the generalization of the proposed method to the recent models.
Disregarding the computational overhead, can searching for the optimal perturbation parameters per sample rather than using one set of predefined universal perturbation parameters help improve the performance?
Can you provide comparison of computation overhead with SRe2L on ImageNet dataset?

局限性

See above weaknesses and questions.

作者回复

2024-08-07

We appreciate your valuable comments and answer your questions in order.

Q1 (Weakness 1 & Question 1): Considering that the recent State-of-the-art models such as ViT are not leveraging BN, the proposed approach might not be directly applicable to these vision transformer-based models.

A1: We acknowledge that our proposed approach cannot be directly applied to models without BN layers, such as ViT. Our baseline solution, SRE2L, involves developing a ViT-BN model that replaces all LayerNorm layers with BN layers and adds additional BN layers between the two linear layers of the feed-forward network. We followed their solution and conducted cross-architecture experiments with DeiT-Tiny on the ImageNet-1K dataset. The results are listed below.

Squeezed Model: DeiT-Tiny-BN

Evaluation Model	DeiT-Tiny	ResNet-18	ResNet-50	ResNet-101
SRe2L	25.36	24.69	31.15	33.16
DWA (ours)	37.0	32.64	40.77	43.15

Squeezed Model: ResNet-18

Evaluation Model	DeiT-Tiny	ResNet-18	ResNet-50	ResNet-101
SRe2L	15.41	46.71	55.29	60.81
DWA (ours)	22.72	55.2	62.3	63.3

*DeiT-Tiny and DeiT-Tiny-BN refer to the vanilla architecture and the BN-integrated variant, respectively.

The results demonstrate that our approach can be applied to ViT-BN with superior performance compared to the baseline. Theoretically, our approach can be combined with other generation-based distillation methods to avoid the BN limitation. This can be our future exploration.

Q2 (Weakness 2): When the model and the number of classes in the target dataset is small (ConvNet / CIFAR-10), DWA underperforms compared to the other methods.

A2: Our proposed DWA underperforms compared to other methods only with ConvNet on CIFAR-10, although it outperforms them on CIFAR-100. We believe the poor performance on CIFAR-10 is inherited from our baseline, SRE2L, which also underperforms with ConvNet on this dataset. However, our DWA still significantly outperforms SRE2L in this scenario.

The reasons for the poor performance of SRE2L and DWA in using ConvNet on CIFAR-10 are as below:

Q3 (Weakness 3 & Question 3): Since this paper introduces add-on procedure, weight adjustment, to the baseline, the authors should clarify if the proposed method does not induce significant computation overhead during synthesis.

A3: Thank you for the constructive comments, we append the computational overload as below.

Methods	Avg. time used for generating 1 ipc
SRE2L	116.58 s (100%)
DWA (ours)	125.12 s (107.32%)

We compare the averaged time used for generating 1 image for each class using ResNet-18 on CIFAR-100. It can be seen that our proposed DWA only requires 7.32% additional computational overhead to improve the diversity of the synthetic dataset.

That is attributed to the K steps directed weights perturbation before the generation of each ipc as stated in lines6-7 in Algorithm 1:

For $k = 1$ to $K$ do $\Delta\theta_k = \Delta\theta_{k-1} + \rho/K \nabla L_{S_{k0}} \left( f_{\theta_T + \Delta\theta_{k-1}} \right)$ .

Since each ipc requires 1000 iterations of forward-backward propagation for generation, our DWA’s requirement of only $K=12$ additional forward-backward propagations is negligible in the overall optimization process. We will add this analysis to our revised paper.

Q4 (Weakness 4 & Question 2): It should be clarified if these hyper-parameters are not sensitive and generalizable to all the datasets, to reduce time-consuming trial and error for searching these parameters on the other general datasets. And can searching for the optimal perturbation parameters per sample rather than using one set of predefined universal perturbation parameters help improve the performance?

A4: We demonstrated the results of hyperparameters search in Figure 5 of our submission. It can be seen that a significant performance gain can be achieved as long as $K>4$ and $\rho>0.001$ , indicating the effectiveness of our DWA. Our method only involves two hyperparameters, $K$ and $\rho$ , and we use consistent values of $K=12$ , $\rho=0.0015$ across the four datasets: CIFAR-10/100, Tiny-ImageNet, ImageNet-1k. Thus, the cost of searching for hyperparameters is very low for our method.

2024-08-14

Thank you for your insightful and constructive feedback on our submission! We have noted your post-rebuttal response and will incorporate the DeiT results and the computational overhead comparison into our paper, as you suggested.

审稿意见

评分: 6置信度: 42024-07-09

This paper introduces a novel method for dataset distillation (DD): DD through Directed Weight Adjustment (DWA). The authors first identify that the current dataset distillation methods often fail to ensure diversity in the distilled data, which leads to suboptimal performances. To address this issue, they propose a method that uses dynamic and directed weight adjustment techniques to increase the diversity of synthetic data. The propose method DWA introduces three modifications based on SOTA method SRe2L: 1). Decoupled coefficients for batch norm loss 2). Initialize with real images 3). Random perturbation on model weights. Extensive experiments on datasets like CIFAR, Tiny-ImageNet, and ImageNet-1K demonstrate the superior performance of this approach, highlighting its effectiveness in producing diverse and representative synthetic datasets with minimal computational expense.

优点

Increasing diversity in distilled datasets is a key challenge posed by existing methods, and this paper addresses it effectively. The authors conducted a detailed analysis showing that batch normalization plays a crucial role in enhancing diversity in dataset distillation. By leveraging this observation, they proposed innovative techniques to incorporate batch normalization more effectively within the distillation process. The result is a significant improvement over state-of-the-art (SOTA) dataset distillation methods.
The authors also conduct detailed ablation studies that demonstrate the effectiveness of each component of their proposed method. The contribution of each element is clearly shown, providing a comprehensive understanding of the method's strengths. Additionally, they show that the method's effectiveness is robust across different hyperparameter settings and datasets, which makes the method have wide practical applicability.

缺点

In the ablation study, the authors analyzed and justified the design choices for two key components: the decoupled $L_var$ coefficient and the Directed Weight Adjustment (DWA) scheme. Additionally, they proposed using real initialization. However, the study lacks an ablation analysis that separately evaluates the contribution of these three components to the overall performance of DWA. Such an ablation would provide a clearer understanding of the individual and combined effects of the decoupled $L_var$ coefficient, the Directed Weight Adjustment scheme, and real initialization on the method's efficacy.
The paper primarily presents results using moderate to large data budgets (10 and 50 IPCs) for most datasets and 100 IPC for ImageNet-1K. Since the notion of “diversity” only becomes significant when each class has multiple prototypes, the proposed method’s benefits are limited to situations when the distillation budget is high. Overall, the proposed method brings sizable performance improvement but only for one SOTA method with specific data budget constraints.

问题

The paper considered using a directed perturbation to increase the diversity of the model. Is it equivalent or similar to using model trained on different random seeds?
In section 2, the authors pointed out $\vec{x}$ and $\vec{s}$ are in latent space, mapped by $g_{\theta}$ , following the same notation, Eqn. 2 is operating on latent space and the trained classifier $f_{\theta}$ . Is the random weight perturbation method proposed in Section 3.2 only on the classifier part then (Alg 1 only refers to $f_{\theta}$ or the entire network $g$ is perturbed?
In section 3.2, how is min/max taken over $L$ , if I am not mistaken, $L$ is a gradient of the loss function on $\theta$ ?

局限性

Yes, the authors adequately addressed the limitations of their work.

作者回复

2024-08-07

Thank you for your constructive comments! We give point-to-point replies to your questions in the following.

Q1: an ablation analysis.

A1: We conducted an ablation study on these three components using the CIFAR-100 dataset under the ipc=50 setting, as you suggested. The experimental results are listed in the following table:

Real initialization	Directed weight adjustment	Decoupled $L_{var}$	ACC.(%)
			47.9
$\checkmark$			52.1
$\checkmark$	$\checkmark$		57.0
$\checkmark$		$\checkmark$	56.6
$\checkmark$	$\checkmark$	$\checkmark$	60.3

Applying directed weight adjustment without real initialization is not suitable here because the solved $\tilde{\Delta\theta}$ is equivalent to random noise if $\mathbb{B}$ is initialized from a random distribution (Equation 11), as studied in Table 3 of our submission.

The results illustrate that both direted weight adjustment and decoupled $L_{var}$ contributes to the performance enhancement. Additionally, they work in an orthogonal manner to improve diversity without mutual interference.

Q2: the proposed method’s benefits are limited to situations.

A2: We would like to thank you for your insightful critique. We acknowledge that our proposed method performs worse in scenarios with a smaller distillation budget. However, the original intention of our work is to move further towards lossless distillation with a large distillation budget first.

The existing distillation methods are still inferior to traditional coreset selection methods in scenarios with a large distillation budget (when the size of the set is greater than 30% of the size of the original dataset). This is because existing distillation methods usually overfit to small distillation budget scenarios. For example, the kernel-ridge-regression method (KRR) [1] and the data factorization-based method (RTP) [2] achieve superior performance with ipc = 1 but underperform with ipc = 50. Simply increasing the distillation budget brings limited performance gain in most distillation methods.

Therefore, we chose SRE2L as our baseline, which underperforms with a small distillation budget (ipc = 1) but achieves SOTA performance with a large distillation budget and pioneers dataset distillation in ImageNet-1K. We provide additional results under various ipc settings using ImageNet-1k with ResNet-18, verifying that our method shows a clear advantage over SRE2L, regardless of the ipc setting.

ipc	1	5	10	20	30	40	50
SRe2L	0.1	15.2	21.6	33.6	37.5	42.4	46.8
DWA (ours)	3.4	20.5	37.9	47.4	51.4	53.6	55.2

We will incorporate your advice to explore the adaptability of our DWA method to more distillation methods for wider application.

Q3:Is it equivalent to using model trained on different random seeds?

A3: Thank you for suggesting this interesting investigation. We conducted experiments to compare our directed weight adjustment with models trained on different random seeds. We trained five ResNet models with different random seeds on the CIFAR-100 dataset. We used the setting of ipc = 50, meaning each model generated 10 images per class. Both approaches use the real initialization, as the directed weight adjustment is based on real initialization for directed perturbation. The results are listed below.

Real initialization	Directed weight adjustment	5 models with different random seeds	ACC.(%)	Avg. time used for generating 1 ipc
$\checkmark$			52.1	116.58 s(100%)
$\checkmark$		$\checkmark$	54.6	596.58 s (511%)
$\checkmark$	$\checkmark$		57.0	125.12 s (107%)

To some extent, the models trained with different random seeds are similar to our proposed directed perturbation, as they can enhance performance. However, training models with different random seeds is computationally expensive. Our approach incurs only about 7% additional computational overhead compared to the 411% additional computational overhead required for models trained with different random seeds.

Q4: Is the random weight perturbation method proposed in Section 3.2 only on the classifier part or the entire network

A4: The weight perturbation is actually applied to the entire network. However, $f_\theta$ refers only to the classifier. We clarified this in the footnote below Algorithm 1 in our original submission.

As stated in line 95 of our submission, "Throughout the paper, we explore the properties of synthesized datasets within the latent space". We made this assumption for theoretically proving the enhanced variance in latent space because comparing the variance at the pixel level is meaningless (the variance of each pixel is very large). Therefore, the feature extractor $g$ is supposed to be fixed under our assumption and only the classifier $f$ is perturbed in Algorithm 1. We use the footnote to indicate that the operation is actually conducted over the entire network $h$ in the actual optimization process.

Thank you for pointing out this misunderstanding, we will revise the footnote and our assumptions to avoid such misunderstanding in our paper.

Q5: $L$ definition

A5: We sincerely apologize for the typo. $L$ is the sum of instance-wise losses, instead of the gradients. We incorrectly added $\nabla\theta$ in the Equations 11 and 12 in Section 3.3. The corrected definition of $L$ should be

$L_{\mathbb{B}} (f_{\theta_{\mathcal{T}} + \Delta\theta} ) = \sum_\limits{x_i \in \mathbb{B}} \ell (f_{\theta_{\mathcal{T}} + \Delta\theta}, x_i)$ in Equation 11 $L_{\mathcal{T}} (f_{\theta_{\mathcal{T}}} ) = \sum_\limits{x_i \in \mathcal{T}} \ell (f_{\theta_\mathcal{T}}, x_i ),$ in Equation 12.

[1] Timothy Nguyen, et al. Dataset meta-learning from kernel ridge-regression. In Proc. Int. Conf. Learn. Represent. (ICLR), 2021.

[2] Zhiwei Deng, et al. Remember the past: Distilling datasets into addressablememories for neural networks. (NeurIPS), 2022.

评论- Thank you!

2024-08-12

Thank you for the detailed explanation and all the additional experiments! Addressing diversity in dataset distillation and aiming for lossless distillation is an important topic. I would like to keep my score of acceptance!

2024-08-12

Thank you for your time! Your reviews have helped us improve our work, and we’re glad that the response meets your expectations!

最终决定Accept (spotlight)

2024-09-25

The paper introduces a novel directed weight adjustment (DWA) mechanism to enhance diversity in dataset distillation. The reviewers collectively recognised the significance of this paper:

Addressing the critical challenge of diversifying synthetic datasets
Theoretical foundation
Empirical validation across multiple datasets and architectures
Detailed analysis of the method’s components
Minimal computational overhead and simplicity
Clear presentation

Most of the initial concerns were effectively addressed by the rebuttal, as confirmed by the reviewers.

The paper makes a substantial contribution to the field and merits consideration for a spotlight presentation.