5.8

/10

Poster4 位审稿人

最低5最高7标准差0.8

3.0

置信度

正确性2.8

贡献度2.8

表达2.5

NeurIPS 2024

Elucidating the Design Space of Dataset Condensation

Shitong Shao,Zikai Zhou,Huanran Chen,Zhiqiang Shen

OpenReview PDF

提交: 2024-04-25更新: 2025-01-16

TL;DR

We propose a comprehensive designing-centric framework that includes specific, effective strategies. These strategies establish a benchmark for both small and large-scale dataset condensation.

摘要

Dataset condensation, a concept within $data-centric learning$, aims to efficiently transfer critical attributes from an original dataset to a synthetic version, meanwhile maintaining both diversity and realism of syntheses. This approach can significantly improve model training efficiency and is also adaptable for multiple application areas. Previous methods in dataset condensation have faced several challenges: some incur high computational costs which limit scalability to larger datasets ($e.g.,$ MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets ($e.g.,$ SRe$^2$L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive designing-centric framework that includes specific, effective strategies like implementing soft category-aware matching, adjusting the learning rate schedule and applying small batch-size. These strategies are grounded in both empirical evidence and theoretical backing. Our resulting approach, $E$lucidate $D$ataset $C$ondensation ($EDC$), establishes a benchmark for both small and large-scale dataset condensation. In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78%. This performance surpasses those of SRe$^2$L, G-VBSM, and RDED by margins of 27.3%, 17.2%, and 6.6%, respectively. Code is available at: https://github.com/shaoshitong/EDC.

关键词

dataset condensationefficient computer visiondesign space

评审与讨论

审稿意见

评分: 7置信度: 32024-07-08

The paper explores scalable dataset condensation techniques, introducing Elucidate Dataset Condensation (EDC) which integrates multiple design strategies such as soft category-aware matching and learning rate adjustments. These methods achieve state-of-the-art accuracy across different datasets, demonstrating improved efficacy and efficiency over previous methods.

优点

The paper conducts a thorough investigation into effective strategies for broadening the design space of dataset distillation while also reducing computational costs. These strategies are underpinned by solid theoretical support, enhancing the robustness of the proposed approaches.
Empirical results demonstrate substantial improvements across various datasets and models, underscoring the practical efficacy and applicability of the proposed methods.

缺点

The comparison experiments presented are not sufficiently comprehensive. Given that the baseline method RDED (which represents the state-of-the-art with convolutional architecture) was used, it is essential to supplement the comparison experiments of the proposed method with other methods that also utilize convolutional architectures. This would provide a more thorough evaluation of the proposed approach against the full spectrum of existing techniques.
The provided code cannot be executed as it lacks several necessary packages and related details. Detailed instructions, including a complete list of dependencies and environment setup guidelines, are essential to ensure the reproducibility of the results and to enable other researchers to effectively utilize and build upon this work.

问题

See weaknesses part.

局限性

The limitations and societal impacts are not discussed in the paper.

作者回复

2024-08-04

Thank you for your recognition and acknowledgement on the theoretical contributions of our work, and for sharing valuable suggestions. We hope we have addressed your concerns.

Q1: The comparison experiments presented are not sufficiently comprehensive.

A1: Thank you for your valuable suggestions. Due to space constraints in the main paper and for aesthetic reasons, we have not fully presented the experimental results of other methods. However, since the benchmark for dataset distillation is uniform and well-recognized, the performance of other algorithms can be found in their respective papers. We present the related experimental results of the popular convolutional architecture ResNet-18 in the following table:

Dataset	IPC	MTT	TESLA	SRe^2L	G-VBSM	CDA	WMDD	RDED	EDC (Ours)
CIFAR-10	1	-	-	-	-	-	-	22.9 ± 0.4	32.6 ± 0.1
	10	46.1 ± 1.4	48.9 ± 2.2	27.2 ± 0.4	53.5 ± 0.6	-	-	37.1 ± 0.3	79.1 ± 0.3
	50	-	-	47.5 ± 0.5	59.2 ± 0.4	-	-	62.1 ± 0.1	87.0 ± 0.1
CIFAR-100	1	-	-	2.0 ± 0.2	25.9 ± 0.5	-	-	11.0 ± 0.3	39.7 ± 0.1
	10	26.8 ± 0.6	27.1 ± 0.7	31.6 ± 0.5	59.5 ± 0.4	-	-	42.6 ± 0.2	63.7 ± 0.3
	50	-	-	49.5 ± 0.3	65.0 ± 0.5	-	-	62.6 ± 0.1	68.6 ± 0.2
Tiny-ImageNet	1	-	-	-	-	-	7.6 ± 0.2	9.7 ± 0.4	39.2 ± 0.4
	10	-	-	-	-	-	41.8 ± 0.1	41.9 ± 0.2	51.2 ± 0.5
	50	28.0 ± 0.3	-	41.1 ± 0.4	47.6 ± 0.3	48.7	59.4 ± 0.5	58.2 ± 0.1	57.2 ± 0.2
ImageNet-10	1	-	-	-	-	-	-	24.9 ± 0.5	45.2 ± 0.2
	10	-	-	-	-	-	-	53.3 ± 0.1	63.4 ± 0.2
	50	-	-	-	-	-	-	75.5 ± 0.5	82.2 ± 0.1
ImageNet-1k	1	-	-	-	-	-	3.2 ± 0.3	6.6 ± 0.2	12.8 ± 0.1
	10	-	17.8 ± 1.3	21.3 ± 0.6	31.4 ± 0.5	-	38.2 ± 0.2	42.0 ± 0.1	48.6 ± 0.3
	50	-	27.9 ± 1.2	46.8 ± 0.2	51.8 ± 0.4	53.5	57.6 ± 0.5	56.5 ± 0.1	58.0 ± 0.2

Q2: The provided code cannot be executed.

A2: We apologize for any distress our oversight may have caused. We have shared the link anonymously: EDC. Additionally, we have included instructions and pre-stored statistics in the new code to allow you to follow the steps and run it directly.

Q3: The limitations and societal impacts are not discussed in the paper.

A3: Thank you for highlighting this issue. In the original paper, we included the limitations and societal impacts in the appendix. In future versions, we will place these sections after the conclusion.

评论- Paper is Ready for Acceptance After Addressing Major Concerns.

2024-08-08

The authors have effectively addressed the majority of my concerns. Given the satisfactory responses and the improvements made to the manuscript, I am confident that the paper is now well-prepared for acceptance.

2024-08-09

We are delighted that our response addressed your concerns. We appreciate your recognition and support. Your suggestions are invaluable to our work.

审稿意见

评分: 5置信度: 32024-07-10

This paper proposes a design framework to address the limitations of existing dataset condensation methods. Specifically, the authors have introduced some strategies, such as soft category-aware matching and learning rate scheduling. The authors have provided theoretical and empirical analysis of these strategies, proving the superiority of the proposed Elucidate Dataset Condensation (EDC) with extensive experiments.

优点

It seems that this is a comprehensive work to address the existing problems in dataset condensation. I am not sure about this because the structure of the paper is a bit confusing, as I will note in the weaknesses of the paper.
It seems that there is solid theoretical analysis in the paper. However, again, because of the confusing structure of the paper, I can hardly understand what these equations aim to prove.
I acknowledge that the authors have conducted extensive experiments for evaluation and also put some effort into the visualization.

缺点

The major weakness is the lack of fluency in the paper, leading to the confusing structure of the paper. More explanations are as follows.

In line 27-41, the authors have compared bi-level and uni-level optimization paradigms. However, I am not sure what “bi-level” and “uni-level” mean, maybe because I am not an expert in dataset condensation.
The authors also mentioned the bi-level paradigm limits the effectiveness, but did not explain why.
The motivation of the entire work is vague. Regarding the limitations of previous works, the authors have mentioned the effectiveness (line 27), accuracy (line 36), potential information loss (line 38). After reading these, I expect to see how the authors tackled these issues. However, later in Section 3, the authors talked about other limitations, involving realism (line 110), matching mechanism (line 115), loss landscape (line 122), and hyperparameter settings (line 128). And then the authors proposed strategies to address these limitations, which sounds reasonable. However, in the end of Section 3, the authors talked about augmentation and backbone choice, which aim to address other specific issues, making me confused again.
Figure 1 mentions CONFIG A to G, but I have no idea what they mean.

问题

Since the structure of the paper is confusing, I am not really sure what the primary novelty of this work is. I wish the authors can clearly emphasize their motivations and novelties.
It seems that the authors have discussed the limitations of previous works from two levels. In the introduction, the authors talked about effectiveness, accuracy, etc., while in Section 3, the authors talked about realism, matching mechanism, etc. Can the authors clearly state the relationship between these limitations?

局限性

From my point of view, the major limitation is the organization of the paper. I am not an expert in dataset distillation, but I try to understand the problems of previous works and the corresponding solutions in this paper. Unfortunately, I find it difficult to figure out these, though I think that this seems to be a solid work.

作者回复

2024-08-04

Thank you for your thorough review and detailed suggestions on our paper's layout. We will accommodate all your comments in our revision.

Q1: The differences between “bi-level” and “uni-level”.

A1: The main difference between “bi-level” and ”uni-level“ is that “bi-level” requires updating the dataset and the model alternately, whereas “uni-level” only updates the dataset. Here is our algorithmic process for “bi-level” that we taken from a well known survey [1]:

Input: Original dataset $\mathcal{T}$ .

Output: Synthetic dataset $\mathcal{S}$ .

Initialize $\mathcal{S}$ .

While not converge do

Get a network $\theta$ .

Update $\theta$ via $\mathcal{S}$ or $\mathcal{T}$ and cache it if necessary.

Update $\mathcal{S}$ via $L(\mathcal{S}, \mathcal{T})$ . ( $L$ denotes loss function)

done

Return $\mathcal{S}$

[1] Dataset Distillation: A Comprehensive Review, 2023.

Q2: Explain the limitation of the bi-level paradigm.

A2: Bi-level optimization in dataset distillation requires alternately updating the synthetic data and the model (e.g., trajectory and gradient matching). This alternating process hinders its application to large-scale datasets. In contrast, statistical matching only necessitates optimizing the synthetic data, making it more effectiveness and efficient than bi-level optimization.

Q3: The motivation of the entire work is vague.

A3: Thank you for your concerns. The ``effectiveness (line 27), accuracy (line 36), potential information loss (line 38)'' you mention are issues that EDC addresses at a macro and general level. Specifically, we use statistical matching to ensure effectiveness, and a series of improvements presented in Fig. 1 to ensure the accuracy of EDC on both small- and large- scale dataset. We also compensate for the potential information loss caused by RDED through statistical matching.

By comparison, ``realism (line 110), matching mechanism (line 115), loss landscape (line 122), and hyperparameter settings (line 128)'' exemplify the limitations in previous work from details. We address these shortcomings to ultimately ensure effectiveness, accuracy, and compensation for information loss.

The augmentation and backbone choice is an improvement on the irrational hyperparameter settings (line 128) of past algorithms (i.e., G-VBSM, SRe $^2$ L and CDA), and related experiments and discussions can be found in the Appendix (lines 493-499, 679-691).

Q4: The meaning of CONFIG A to G.

A4: Your question is critical. The logical presentation of our work is somewhat challenging because we delves deeply into analyzing the limitations of past algorithms and proposing 9 improvements. We borrowed the presentation style from ConvNet [2] and EDM [3], but it still caused confusion. Here, we describe the improvements included in CONFIG A to CONFIG G, and in the future we will provide this in the appendix:

CONFIG A: G-VBSM (our baseline).

CONFIG B: G-VBSM, real image initialization.

CONFIG C: G-VBSM, real image initialization, smoothing LR schedule.

$\vdots$

CONFIG F: G-VBSM, real image initialization, smoothing LR schedule, flatness regularization, small batch size, better backbone choice, soft category-aware matching.

CONFIG G: G-VBSM, real image initialization, smoothing LR schedule, flatness regularization, small batch size, better backbone choice, soft category-aware matching, weak augmentation, ema-based evaluation.

[2] A ConvNet for the 2020s, 2020.

[3] Elucidating the Design Space of Diffusion-Based Generative Models, 2022.

Q5: The authors can clearly emphasize motivations and novelties.

A5: Thanks for pointing this out. We are motivated by the discovery that (1) RDED causes information loss; (2) algorithms such as SRe $^2$ L, G-VBSM, and CDA do not perform well on small-scale datasets; (3) traditional algorithms like MTT, Gradient Matching, and DataDAM do not scale well to large-scale datasets. Therefore, it is crucial to design a new algorithm to compensate for the information loss caused by RDED through a series of improvements, ensuring good generalization ability on both small and large-scale datasets.

The core novelty of our work is a comprehensive analysis of the limitations of past algorithms, followed by the proposed EDC that generalizes well to both small-scale (e.g., CIFAR-10/100) and large-scale (e.g., ImageNet-1k) datasets. No previous algorithm has performed well on both types of datasets. Up to now, EDC's performance remains the highest on CIFAR-10/100, Tiny-ImageNet, and ImageNet-1k, as you can confirm by comparing the performance of any paper in the field of dataset distillation.

In addition, this is the first time we focus simultaneously on data synthesis, soft label generation, and post-evaluation, whereas past algorithms generally focused only on data synthesis. Specifically for design choices, the core contributions of this paper include the real image initialization, soft category-aware matching, and the discussion and explanation of smoothing LR schedule and small batch size.

Q6: The relationship between two levels of limitation?

A6: As we replied in A3, ensuring “effectiveness, accuracy, etc.” is our overall goal, while “realism, matching mechanism, etc.” are the specific technical issues we need to address. The relationship between the two is that by guaranteeing realism and designing a more advanced matching mechanism (i.e., soft category-aware matching), we can achieve greater effectiveness and accuracy.

Q7: The major limitation is the organization of the paper, though I think that this seems to be a solid work.

A7: Thank you for your efforts and acknowledgement. We apologize for any confusion caused by the organization of our lines. In the future, we will correlate the two levels of limitations more clearly in the main text and add ``Related Work Section'' in the appendix for additional clarification.

2024-08-12

Thanks for your detailed response. Good luck.

2024-08-12

Thank you for your kind words! We will incorporate all the suggested improvements in our revision.

审稿意见

评分: 6置信度: 32024-07-11

This paper studies the combination of some techniques of data distillation (DD) in terms of data synthesis, soft label generation, and post-evaluation. The limitations of the existing methods, which are solved by these techniques, are provided. The extensive experiments verified the promising improvement of these techniques when using them with certain SOTA DD methods.

优点

The limitations of the existing methods, which are solved by these techniques, are provided.
For some design choices, a theoretical analysis is conducted.
The performance of the proposed method is very promising, and the ablation study is sound.

缺点

The definition of generalized data synthesis is somewhat unclear. DM-based methods [1,2] also can efficiently conduct data synthesis on the ImageNet dataset. Statistical matching essentially is a distribution matching way using second-order information, and its similar way can be seen in certain DM-based methods [3,4]. Some discussions and comparisons about this are necessary.
The solution for the limitation of irrational hyperparameter setting is very heuristic. Could the authors provide a theoretical analysis for it?
Some statements lack clear evidence or explanation. For example, In Line 227, "our findings unfortunately demonstrate that various SM-based loss functions do not converge to zero. This failure to converge contradicts the basic premise that the first-order term in the Taylor expansion should equal zero." In Line 263, "smaller batch sizes" helps prevent model under-convergence during post-evaluation. In Line 267, "The key finding reveals that the minimum area threshold for cropping during data synthesis was too restrictive, thereby diminishing the quality of the condensed dataset".

[1] DataDAM: Efficient Dataset Distillation with Attention Matching. ICCV 2023

[2] DANCE: Dual-View Distribution Alignment for Dataset Condensation. IJCAI 2024

[3] M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy. AAAI 2024

[4] Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation. CVPR 2024

问题

See above.

局限性

See above.

作者回复

2024-08-05

We thank the reviewer for your constructive comments and valuable suggestions, such as raising many unclear definition issues for us. Below, we make detailed clarifications to each question from this reviewer.

Q1: The definition of generalized data synthesis is somewhat unclear.

A1: Thank you for raising these concerns. We adopted this definition to simplify the pipeline presentation of our work, but we did not anticipate that it would cause you distress.

"Generalized data synthesis" begins with training one or more models on the original dataset to obtain global statistics. These statistics are then used in the data synthesis stage with Eq. 2 to synthesize the condensed dataset, and finally, Eq. 4 is applied to generate the soft label.

As mentioned in our original paper, "generalized data synthesis" avoids inefficient bi-level optimization. Although DM-based methods are also highly efficient, they still involve both inner and outer loops, and extending them to the full 224x224 ImageNet remains challenging in terms of performance and efficiency.

Statistical matching is indeed formally similar to distribution matching, however, statistical matching is better suited to large datasets such as ImageNet-1k. One crucial reason is that statistical matching obtains global statistics across the comprehensive dataset by traversing it in advance, whereas distribution matching gathers local statistics during data synthesis, leading to inaccuracies within the local statistics. This is evidenced by the fact that the works you mentioned [3] and [4] did not experiment on the full 224x224 ImageNet.

Additionally, certain DM-based methods [3,4] differ from EDC. As shown in Eq. 5 and the experiments in the original paper, both Forms (1) and (2) are necessary, while [4] only applies to the case of $\alpha=0$ .

Q2: Provide a theoretical analysis for the solution of irrational hyperparameter setting.

A2: Thank you for your insightful suggestion. The smoothing LR schedule is designed to address suboptimal solutions that arise due to the scarcity of sample sizes in condensed datasets. Additionally, the use of small batch size is implemented because the gradient of the condensed dataset more closely resembles the global gradient of the original dataset, as illustrated at the bottom of Fig. 2 (c).

Against the latter, we can propose a complete chain of theoretical derivation:

$L_{syn} = 𝔼_{c_i∼C} || p_θ(μ | X^S, c_i) - p(μ | X^T, c_i) ||_2 + || p_θ(σ^2 | X^S, c_i) - p(σ^2 | X^T, c_i) ||_2$ (Our statistical matching)

$∂L_{syn} / ∂θ = ∫_{c_i} ( ∂L_{syn} / ∂p_θ(· | X^S, c_i) ) ( ∂p_θ(· | X^S, c_i) / ∂θ ) d_{c_i}$

$≈ ∫_{c_i} ( [ p_θ(μ | X^S, c_i) - p(μ | X^T, c_i) ] + [ p_θ(σ^2 | X^S, c_i) - p(σ^2 | X^T, c_i) ] ) ( ∂p_θ(· | X^S, c_i) / ∂θ ) d_{c_i}$

where $p_θ(· | X^S, c_i)$ and $p(· | X^T, c_i)$ refer to a Gaussian component in the Gaussian Mixture Model. Consider post-evaluation, We can derive the gradient of the MSE loss as:

$∂ 𝔼_{x_i∼X^S} || f_θ(x_i) - y_i ||\_2^2 / ∂θ = 2𝔼_{x_i∼X^S} [ ( f_θ(x_i) - y_i ) ( ∂f_θ(x_i) / ∂θ ) ]$

$= 2𝔼_{x_i∼X^S} [ ( f_θ(x_i) - y_i ) ∫_{c_i} ( ∂f_θ(x_i) / ∂p_θ(· | X^S, c_i) ) ( ∂p_θ(· | X^S, c_i) / ∂θ ) d_{c_i} ]$

$≈ 2𝔼_{(x_j, x_i) ∼ (X^S, X^T)} [ ( f_θ(x_j) - y_j ) ∫_{c_i} ( ∂f_θ(x_i) / ∂p_θ(· | X^T, c_i) ) ( ∂p_θ(· | X^T, c_i) / ∂θ ) d_{c_i} ]$

$≈ ∂ 𝔼_{x_i∼X^T} || f_θ(x_i) - y_i ||_2^2 / ∂θ,$

where θ stands for the model parameter. The right part of the penultimate row results from the loss $L_{syn}$ , which ensures the consistency of $p_θ(· | X^T, c_i)$ and $p_θ(· | X^S, c_i)$ . If the model initialization during training is the same, the left part of the penultimate row is a scalar and has little influence on the direction of the gradient. Since $X^T$ is the complete original dataset with a global gradient, the gradient of $X^S$ approximates the global gradient of $X^T$ , thus enabling the use of small batch size.

Q3: Some statements lack clear evidence or explanation.

A3: Thank you for your suggestions to help improve the quality of our manuscript. Here, we clarify the parts you pointed out as unclear, and we will double-check the full paper in future releases to make sure there are no relevant issues:

Line 227: As shown in Line 122-127, the constraint on flatness needs to ensure that the first-order term $(\theta-\theta^*)^\mathrm{T}\nabla_\theta{L}(\theta^*)$ of the Taylor expansion equals zero, indicating normal model convergence. However, our exploratory experiments found that despite the good performance of EDC, the loss of statistical matching at the end of data synthesis still fluctuated significantly and did not reach zero. Therefore, we only enforced flatness on the logit level.

Line 263: Since the gradient of the condensed dataset can approximate the global gradient of the original dataset, the inaccurate gradient direction problem introduced by the small batch size becomes less problematic. Instead, using a small batch size effectively increases the number of iterations, thereby helps prevent model under-convergence.

Line 267: The implementation of this crop operation refers to torchvision.transforms.RandomResizedCrop, where the minimum area threshold is controlled by the parameter scale[0]. The default value is 0.08, meaning that the cropped image can be as small as 8% of the original image. Since 0.08 is too small for the model to extract complete semantic information during data synthesis, increasing the value to 0.5 resulted in a significant performance gain.

评论- Additional Information

2024-08-11

We hope our prior response has addressed your main concerns. To make our rebuttal more comprehensive and convincing, we would like to further clarify Weakness 1 by specifically listing the differences between the published papers [1, 2, 3, 4] and our work.

DataDAM [1] vs. EDC

Both DataDAM and EDC do not require model parameter updates during training. However, DataDAM struggles to generalize effectively to ImageNet-1k because it relies on randomly initialized models for distribution matching. As noted in SRe $^2$ L, models trained for fewer than 50 epochs can experience significant performance degradation.
DataDAM does not explore the soft label generation and post-evaluation phases as EDC does, limiting its competitiveness.

DANCE [2] vs. EDC

DANCE is a DM-based algorithm that, unlike traditional distribution matching, does not require model updates during data synthesis. Instead, it interpolates between pre-trained and randomly initialized models, using this interpolated model for distribution matching. Similarly, EDC also does not need to update the model parameters, but it uses a pre-trained model with a different architecture and does not incorporate random interpolation. The "random interpolation" technique was not adopted because it did not yield performance gains on ImageNet-1k.
Although DANCE considers both intra-class and inter-class perspectives, it limits inter-class analysis to the logit level and intra-class analysis to the feature map level. In contrast, EDC performs both intra-class and inter-class matching at the feature map level, where inter-class matching is crucial. To support this, last year, SRe $^2$ L focused solely on inter-class matching at the feature map level and still achieved state-of-the-art performance on ImageNet-1k.
EDC is the first dataset distillation algorithm to simultaneously improve data synthesis, soft label generation, and post-evaluation stages. In contrast, DANCE only addresses the data synthesis stage. While we agree with the reviewer that DANCE can be effectively applied to ImageNet-1k, the introduction of soft label generation and post-evaluation improvements is essential for DANCE to achieve more competitive results.

M3D [3] vs. EDC

M3D is a DM-based algorithm, but its data synthesis paradigm aligns with DataDAM by relying solely on randomly initialized models, which limits its generalization to ImageNet-1k.
M3D, similar to SRe $^2$ L, G-VBSM, and EDC, takes into account second-order information (variance), but this is not a unique contribution of EDC. The key contributions of EDC in data synthesis are real image initialization, flatness regularization, and the consideration of both intra-class and inter-class matching.

Deng et al. [4] vs. EDC

Deng et al. [4] is a DM-based algorithm, but its data synthesis paradigm is consistent with M3D and DataDAM, as it considers only randomly initialized models, which cannot be generalized to ImageNet-1k.
Deng et al. [4] considers both interclass and intraclass information, similar to EDC. However, while EDC obtains interclass information by traversing the entire training set, Deng et al. [4] derives interclass information from only one batch, making its information richness inferior to that of EDC.
Deng et al. [4] only explores data synthesis and does not explore soft label generation or post-evaluation. Additionally, Deng et al. [4] only shares some similarity with Soft Category-Aware Matching among the 10 design choices in EDC.

We thank the reviewer for highlighting these relevant papers, and we will include them in our references to further enrich our related work.

[1] DataDAM: Efficient Dataset Distillation with Attention Matching. ICCV 2023

[2] DANCE: Dual-View Distribution Alignment for Dataset Condensation. IJCAI 2024

[3] M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy. AAAI 2024

[4] Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation. CVPR 2024

2024-08-12

We appreciate the reviewer's suggestions and we agree with the point that "DM-based methods can be efficiently applied to ImageNet-1k". This perspective is supported by our official comment: "Additional Information, DANCE vs. EDC, ...While we agree with the reviewer that DANCE can be effectively applied to ImageNet-1k...". By the way, the performance of these methods may be not optimal, but their performance can be improved by incorporating the techniques we proposed in soft label generation and post-evaluation.

Furthermore, we believe that these DM-based methods should be included in our references to refine the definition of generalized data synthesis. DM-based methods, which do not require an inner loop, can be regarded as a subset of generalized data synthesis. But unlike distribution matching, generalized data synthesis also supports that data synthesis can be performed on a variety of pre-trained models (e.g., ResNet, MobileNet and EfficientNet). This approach, as demonstrated in G-VBSM [2], exhibits significant generalization ability on the cross-architecture task.

We respond the three points you mentioned one by one:

We agree that some DM-based methods do not require inner loops. As mentioned in our official comment "Additional Information", DataDAM, DANCE and M3D do not include an inner loop. However, when paired with the design choices proposed in our paper, specifically in the context of soft label generation and post-evaluation, and the use of a pre-trained teacher model similar to SRe $^2$ L [1], these methods may achieve better performance on the full 224x224 ImageNet-1k.
While DM and DataDAM did work on the full ImageNet-1k dataset, they only considered a 64x64 resolution, and we used a standard 224x224 resolution. As outlined in SRe $^2$ L [1], G-VBSM [2], RDED [3], and CDA [4], ImageNet-1k should refer to the full 224x224 ImageNet.
We agree that local statistics can be seen as efficient estimations of global statistics in batch size. In fact, we tried this scenario in our implementation (i.e., our submitted code). You can change the setting in recover/recover.sh by modifying --category-aware "global" to --category-aware "local". However, we found that with very small IPCs, especially IPC 1, global statistics enabled ResNet-18 to achieve up to 12.8% accuracy on the full 224x224 ImageNet, while local statistics only achieved up to 9.4%. Therefore, we used global statistics in our experiments.

Overall, we strongly agree that DM-based methods, including DataDAM, DANCE and M3D are efficient on the ImageNet-1k. Furthermore, we will also add those methods to the definition of generalized data synthesis and cite them.

[1] Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective, NeurIPS 2023.

[2] Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching, CVPR 2024.

[3] Dataset Distillation in Large Data Era, Arxiv 2024.

[4] On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm, CVPR 2024.

2024-08-12

Thanks for the detailed response of the authors. It addressed all my major concerns. I decide to increase my score to 6. I hope the authors can polish the presentation of this work as promised.

2024-08-12

Thank you for your thoughtful comments and for raising the score. We will work on polishing the presentation promptly. Your feedback is invaluable, and we greatly appreciate your review and suggestions.

2024-08-12

Thanks for the response and additional Information. I appreciate that the authors discuss the difference between the proposed method and existing DM-based methods, while I still have the major concern about the definition of generalized data synthesis. That is to say, I don't think that DM-based methods have the efficiency problem for large-scale datasets like full ImageNet datasets. There is no evidence to support it:

First, most DM-based methods, like DataDAM [1]、DANCE [2] and Deng et al.[4], just match the feature distribution with various networks and don't have the inter loop.
Second, even those DM-based methods that have the outer loop and the inter loop, actually don't belong to inefficient bi-level optimization [5]. Specifically, the outer loop of M3D [3] is used to change networks, and the number of the inter loop is just used to decide the iteration number for matching on each network, which does not change the linear computational complexity regarding the real data size and network parameters. While the bi-level optimization [5] needed nested gradient update, resulting in the quadratic computational complexity regarding the real data size and network parameters. Note that DM and DataDAM [1] have been conducted on the full ImageNet datasets, and the training cost analysis in the papers of M3D [3] and DANCE [2] also show they don't generate large computational consumption compared to DM and DataDAM.
Third, I think local statistics actually can be seen as efficient estimations of global statistics in batch size, which didn't change the learning target, i.e, matching the distribution of real and condensed data.

Overall, in my opinion, the authors need to include DM-based methods in generalized data synthesis or reconsider the definition of generalized data synthesis.

[1] DataDAM: Efficient Dataset Distillation with Attention Matching. ICCV 2023

[2] DANCE: Dual-View Distribution Alignment for Dataset Condensation. IJCAI 2024

[3] M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy. AAAI 2024

[4] Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation. CVPR 2024

[5] Investigating bi-level optimization for learning and vision from a unifed perspective: A survey and beyond. TPAMI 2021

审稿意见

评分: 5置信度: 32024-07-12

The authors address the limitations of previous methods, such as high computational costs and less optimal design spaces, by proposing a novel framework called Elucidate Dataset Condensation (EDC). EDC incorporates strategies like soft category-aware matching and a smoothing learning rate schedule, achieving state-of-the-art accuracy with a significant improvement over existing methods. The paper also provides empirical and theoretical insights into the design decisions made, demonstrating EDC's effectiveness across various datasets and model architectures.

优点

Comprehensive Analysis: The authors systematically examine the design space, leading to a nuanced understanding of dataset condensation and the impact of various factors on performance.
State-of-the-Art Performance: The reported improvements in accuracy, especially on ImageNet-1k with a ResNet-18 model, are substantial and demonstrate a clear strength of the proposed method.

缺点

Scalability Concerns: While EDC shows promise, the paper might not fully address how the method scales with significantly larger datasets or higher dimensions of data.
Potential Information Loss: The training-free distillation paradigm, while efficient, could potentially lead to information loss, which is not deeply explored in the paper.
Complexity of Implementation: The paper could benefit from a more detailed discussion on the practical implementation of EDC, including computational resources and potential challenges.
Generalization to Other Domains: The paper primarily focuses on image datasets; it is unclear how well EDC's strategies would generalize to other data domains, such as text or time

问题

How does EDC perform when faced with adversarial examples or noisy data, and what measures are taken to ensure model robustness?
Is there a plan to extend EDC to other domains beyond image recognition, and what challenges might arise in such extensions?

局限性

N/A

作者回复

2024-08-04

We appreciate the reviewer's recognition of our comprehensive analysis and SOTA performance, as well as the valuable suggestions for improvement. We hope our responses can address your concerns effectively.

Q1: Scalability Concerns.

SRe $^2$ L	CDA	RDED	Ours	Original Dataset
18.5	22.6	25.6	26.8	38.5

A1: Thanks for raising these concerns. We conduct experiments on a larger scale dataset ImageNet-21k-P with IPC 10. The results in the table above indicate that our method outperforms the state-of-the-art method CDA [1] on this dataset, demonstrating that EDC can scale to larger datasets.

[1] Dataset Distillation in Large Data Era, 2024.

Q2: Potential Information Loss.

A2: Thanks for pointing this out. RDED, as a training-free distillation paradigm, initially downsamples high-resolution images to obtain low-resolution images. Then, RDED concentrates the low-resolution images to produce condensed data. This paradigm inevitably loses some fine-grained information due to the downsampling operation.

EDC compensates for the information loss of RDED by training-dependent data synthesis, i.e., the statistical information of the condensed dataset on different feature maps remains the same as the original dataset. This is our advantage over the training-free distillation paradigm.

Q3: Complexity of Implementation.

Configuration	GPU Memory (G/per GPU)	Time Spent (hours)	Top-1 Accuracy (%)
CONFIG A	4.616	9.77	31.4
CONFIG B	4.616	4.89	34.4
CONFIG C	4.616	4.89	38.7
CONFIG D	4.616	4.91	39.5
CONFIG E	4.697	4.91	46.2
CONFIG F	4.923	5.11	48.0
CONFIG G	4.923	5.11	48.6

Table: Comparison of computational resources on 4 $\times$ RTX 4090.

A3: EDC is an efficient algorithm as it reduces the number of iterations by half, compared to the baseline G-VBSM. As illustrated in the table above, although transitioning from CONFIG A to CONFIG G adds small GPU memory overhead, it is minor compared to the reduction in time spent. Additionally, introducing EDC to other tasks often requires significant effort for tuning hyper-parameters or even redesigning statistical matching, which is a challenge EDC should address.

We will add the above table in the revised version.

Q4: Generalization to Other Domains.

Ratio (r)	Random	Herding	K-Center	GCOND-X	GCOND	Ours
1.3%	63.6	67.0	64.0	75.9	79.8	80.1
2.6%	72.8	73.4	73.2	75.7	80.1	81.0
5.2%	76.8	76.8	76.7	76.0	79.3	81.0

Table: EDC is performed with both SGC and GCN and evaluated using GCN on the Cora graph dataset.

A4: With careful redesign of statistical matching, EDC can be extended to other domains, such as graphs. We convert the graph data $G\in \mathbb{R}^{n\times d}$ ( $n$ is the number of nodes and $d$ is the feature length) into an image-like format. Specifically, we first derive M = G[None,... ,None].permute(0,2,1,3), then compute the feature map K = \text{cosine}(M, M.permute(0,1,3,2)), and finally perform a distillation paradigm for the dataset similar to that used in image data. We also introduce soft labels during post-evaluation. According to the table above, EDC performs well on the graph classification task, revealing that EDC can generalize to other domains.

Q5: How robust is EDC and how to ensure robustness

Attack Methods/DD Methods	MTT	SRe2L	EDC (Ours)
Clean Accuracy	26.16	43.24	57.21
FGSM	1.82	5.73	12.39
PGD	0.41	2.70	10.71
CW	0.36	2.94	5.27
VMI	0.42	2.60	10.73
Jitter	0.40	2.72	10.64
AutoAttack	0.26	1.73	7.94

Table: Comparison with baseline models with ResNet-18. The perturbation budget is set to $|\epsilon|$ = 2/255.

A5: We follow the pipeline in [2] to evaluate the robustness of models trained on condensed datasets, utilizing the well-known adversarial attack library available at [3]. Our experiments are conducted on Tiny-ImageNet with IPC 50, with the test accuracy presented in the table above. Evidently, EDC demonstrates significantly higher robustness compared to other methods. We attribute this to improvements in post-evaluation techniques, such as EMA-based evaluation and smoothing LR schedule, which help reduce the sharpness of the loss landscape.

[2] DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation, 2024.

[3] https://github.com/Harry24k/adversarial-attacks-pytorch

Q6: How to extend EDC to other domains?

A6: Thank you for your interesting question. As we replied in A4, introducing EDC to graph classification tasks is feasible. In the future, we expect to extend EDC to multimodal tasks. We argue that the biggest challenge in extending EDC is the inconsistent formats between different data, which necessitates redesigning statistical matching to make EDC adaptable to the target task.

2024-08-11

Thanks for your response. I would like to keep my ratings at this moment.

作者回复

2024-08-05

We thank all the reviewers for their insightful comments and suggestions. We are pleased that our work received positive evaluations, with comments such as "Comprehensive Analysis (gMw3)", "A theoretical analysis is conducted (r3SQ)", "Solid theoretical analysis (ZPCz), "conducts a thorough investigation into effective strategies for broadening the design space of dataset distillation while also reducing computational costs (RdSF)."

The reviewers also point out important suggestions that:

The structure of the paper is a bit confusing.
The definition of "generalized data synthesis" and some other parts of the paper are not clear.
The submitted code does not execute properly.
If EDC could scale to larger datasets and to other domains besides images.

Through this rebuttal, we aim to address unclear aspects of the presentation and typography, provide directly runnable code EDC (a bit large because it contains pre-stored statistics ), and demonstrate EDC's ability to generalize over the ImageNet-21k-P (larger dataset) and Graph (other domains) datasets. Additionally, we will revise the manuscript by incorporating the detailed comments from the reviewers in our revision.

最终决定Accept (poster)

2024-09-25

The paper studies dataset distillation and shows that with several proposed techniques, it is possible to achieve a remarkable 48.6% accuracy on ImageNet-1K with IPC=10, which substantially surpasses prior state-of-the-art results. While the degree of novelty may be somewhat limited, the reviewers unanimously commend the paper's thorough exploration of the shortcomings inherent in existing dataset distillation algorithms and its introduction of several effective methods to address these limitations. Therefore, we recommend acceptance of this paper.