5.8

/10

Poster4 位审稿人

最低5最高7标准差0.8

3.3

置信度

正确性2.8

贡献度2.8

表达2.5

NeurIPS 2024

Color-Oriented Redundancy Reduction in Dataset Distillation

Bowen Yuan,Zijian Wang,Mahsa Baktashmotlagh,Yadan Luo,Zi Huang

OpenReview PDF

提交: 2024-05-10更新: 2024-11-06

TL;DR

A novel parameterization method for data distillation by exploring the redundancy in color space.

摘要

关键词

Computer VisionData DistillationParameterization

评审与讨论

审稿意见

评分: 6置信度: 32024-07-11

The authors propose AutoPalette, which reduces color redundancy in dataset distillation. They use a palette network and color-guided initialization to enhance training efficiency and performance by minimizing redundant color information in synthetic images and datasets.

优点

Color redundancy is a fundamental aspect of natural scene images but is often overlooked in large-scale image analysis. This study focuses on the missing part, and the proposed method is effective.

缺点

In the abstract, the authors summarize their framework as the one that minimizes color redundancy at the individual image and overall dataset levels. I think that’s a good summary. However, the description is not utilized when they introduce their framework in the main text. Although they describe it in the last section, it would be better to include the summary in the middle of the main, e.g., when introducing an overview or Figure 1.
I am confused a little about the definition of the color bit in this manuscript. The authors often describe the 8-bits for the original image (e.g., Figure 2). However, if the color bit is based on the number of color palettes, the original image should have 24 bits.
Typo: "> can encoded in fewer bits” should be "can be encoded”

问题

While watching the condensed images in the Appendices, the CIFAR images are hard to perceptually recognize categories, but easy for Figures 7-9. I’m wondering why this perceptual difference emerges.
How did you decide the parameters, alpha, beta, and gamma in the experiments?

局限性

The evaluation was mainly based on a relatively small number of image datasets. I’m not sure to what extent the condensed images change when applying recent large-scale image datasets.

作者回复

2024-08-07

Weakness 1: In the abstract, the authors summarize their framework as the one that minimizes color redundancy at the individual image and overall dataset levels. I think that’s a good summary. However, the description is not utilized when they introduce their framework in the main text. Although they describe it in the last section, it would be better to include the summary in the middle of the main, e.g., when introducing an overview or Figure 1.

Thank you for your valuable suggestion. We acknowledge that providing a summary of our framework earlier in the main text would improve the clarity and flow of our paper. In the revised version, we will include a concise summary of the framework before introducing the components to give readers a better understanding upfront. In addition, we appreciate your attention to detail and have noted the typo you pointed out. We will also correct this in the revised version.

Weakness 2: I am confused a little about the definition of the color bit in this manuscript. The authors often describe the 8-bits for the original image (e.g., Figure 2). However, if the color bit is based on the number of color palettes, the original image should have 24 bits.

Thanks for highlighting this point of confusion. The 8 bits we refer to represent the storage space for each channel (Red, Green, Blue) of each pixel in the image. When considering all three RGB channels together, the total storage space indeed amounts to 24 bits per pixel. Thus, both descriptions refer to the same concept of image representation, just from slightly different perspectives. We will clarify this in the manuscript to ensure that readers understand the distinction between the per-channel and total bit depth for image representation.

Weakness 3: Typo: "> can encoded in fewer bits” should be "can be encoded”

Thank you for pointing out the typo. We will correct this in the revised version of our paper.

Question 1: While watching the condensed images in the Appendices, the CIFAR images are hard to perceptually recognize categories, but easy for Figures 7-9. I’m wondering why this perceptual difference emerges.

Thank you for your obsesrvation and opinion. We hypothesize that this perceptual difference arises due to the resolution difference between CIFAR images ( $32\times32$ ) and ImageNet images ( $128\times128$ ). CIFAR images have a lower resolution, which makes the resulting synthetic images appear blurrier and less distinct, similar to an aliasing effect in low-resolution images. Another factor contributing to this observation could be the reduction in the number of colors. With fewer colors, it can be harder for humans to perceptually recognize objects, as colors palys a significant role in distinguishing and identifying visual elements. On the other hand, ImageNet images with higher resolution, can capture more details, allowing the synthetic images to represent objects more clearly and making their categories easier to recognize. We empirically find that visualizing results from other baseline distillation methods, such as DM and TM, shows a similar observation, that image resolution impacts the perceptual clarity and recognizability of synthetic images.

Question 2: How did you decide the parameters, alpha, beta, and gamma in the experiments?

Thank you for your question regarding the hyper-parameter settings. We conducted experiments to evaluate the sensitivity of our method to various values of these hyper-parameters. Specifically, we applied our method to distribution matching [1] with 10 IPC to test parameter sensitivity. In these experiments, we varied one parameter while keeping the other two fixed. The results shown in the tables below, indicate relatively stable performance within a range of $\alpha$ , $\beta$ and $\gamma$ values. We observe that $\mathcal{L}_a$ has a slightly higher effect than the other two parameters. For instance, when gamma is set to 0.5 and 1.25, the test performance is respectively 58.08 % and 58.06 %. As $\gamma$ increases, we see an improvement in performance, which converges to around 60.9 %. For the other two parameters, $\alpha$ and $\beta$ , the sensitivity is lower, with optimal performance observed when both are around 3. This indicates that these parameters can be set within a reasonable range without significantly impacting the results.

$\alpha$	0.3	0.75	1	1.5	3.0	6.0
performance	60.63	60.8	60.9	60.77	60.91	60.94

$\beta$	0.3	0.75	1	1.5	3.0	6.0
performance	60.94	61.42	60.9	60.78	60.9	60.95

$\gamma$	0.5	1.25	2.5	3.0	5.0	10.0
performance	58.08	58.56	60.19	60.9	60.6	60.9

Limitation: The evaluation was mainly based on a relatively small number of image datasets. I’m not sure to what extent the condensed images change when applying recent large-scale image datasets.

We appreciate the concern regarding the need for experiments on large-scale datasets. In addition to the experiments conducted on CIFAR10 and CIFAR100, we have applied our method to higher-resolution subsets ( $128\times128$ ) of the ImageNet dataset, such as ImageNette and ImageWoof, as demonstrated in Table 2 of our paper. To further address the need for experiments on datasets with more classes, we conducted addtional experiments during the rebuttal period on the Tiny ImageNet dataset, which contains 200 classes with ( $64\times64$ ) images. Please kindly refer to our response to R1W2 (Reviewer 5XJK).

[1] Dataset Condensation with Distribution Matching, Bo Zhao et al.

2024-08-12

Dear Reviewer,

Thank you for your detailed and positive feedback on our paper.

We wish we have addressed your comments in our rebuttal and would appreciate any additional insights or discussion you may have. Please kindly let us know if there is any further clarification required.

Best regards, Authors

2024-08-12

Thank you for your response. My concerns have been addressed. However, due to the NeurIPS discussion rule, I am unable to review the revised main manuscript. I trust that the authors will revise it according to my comments, but I will maintain my current score with a neutral perspective.

审稿意见

评分: 7置信度: 52024-07-12

This paper introduces a straightforward yet effective dataset distillation method called AutoPalette. The method minimizes color redundancy at both the individual image level and the entire dataset level. At the image level, it trains the palette network by maximizing color loss and palette balance loss, thereby reducing color redundancy in images. At the dataset level, a color-guided initialization strategy is proposed to minimize color redundancy across the entire dataset. Extensive comparative and ablation experiments convincingly demonstrate the approach's effectiveness.

优点

The proposed method outperforms other dataset distillation methods in most tasks, providing a new perspective on dataset distillation.
The experiments and ablation study seem well done. The paper's experiments are comprehensive, and the results of the ablation studies are convincing.

缺点

The paper could benefit from a more detailed explanation of the color loss and palette balance loss. It would be helpful to include an explanation of why the palette balance loss might achieve a more balanced color palette.
The paper does not seem to explain why the similarity between the last layer gradients is measured instead of directly measuring the feature level similarity in the Color Guided Initialization Module.

问题

How does the efficiency of this method compare to other methods?
Why does directly optimizing the task loss lead to assigning pixels to a limited number of color buckets in lines 156-158?

局限性

The authors have discussed the limitations of their work in Section 5. However, they could provide more detailed descriptions of how these limitations might impact the results.

作者回复

2024-08-07

Weakness 1: The paper could benefit from a more detailed explanation of the color loss and palette balance loss. It would be helpful to include an explanation of why the palette balance loss might achieve a more balanced color palette.

Thank you for your insightful comment. The palette balance loss is designed to encourage the network to distribute pixels uniformly across all colors, maximizing the representational capacity of each color. Specifically, the palette balance loss is calculated as the average entropy of $m$ over the spatial dimensions, colors, and channels, where $m$ represents the probability distribution for assigning pixels to color buckets. When the palette balance loss is minimized, it results in an even distribution of pixels across buckets, implying that the pixel count for each color bucket is approximately equal. This encourages the network to generate a palette with a more balanced color distribution preventing any particular color from dominating the images.

Additionally, the maximum color loss is computed as the negative sum of the maximum confidences for each color bucket. This encourages each color to be selected by at least one pixel, thereby aiming to diversify the colors within each image. We will provide a more detailed explanation of these concepts in the revised version.

Weakness 2: The paper does not seem to explain why the similarity between the last layer gradients is measured instead of directly measuring the feature level similarity in the Color Guided Initialization Module.

Thank you for the insightful question. Measuring the similarity among samples by the last layer gradient and feature similarity can be largely correlated. The major difference is that the last layer gradient also captures the joint interaction beteen feature space and label space, while feature similarity mainly focus on the feature space.

To explore this further, we conducted additional experiments using feature similarity in the graph cut initialization method. We applied color-reduced images through the models to obtain their feature representations and then computed the cosine similarities among these features. Due to time constraints, we applied our method to distribution matching (DM) [1] with IPC values of 1 and 10. As shown in Table below, the results indicate that using feature similarity respectively achieve 35.36% and 60.9% test performance on 1 IPC and 10 IPC experiments.

The results align with our assumption -- Measuring by the last layer gradient and feature similarity can be largely correlated. We hypothesize that both methods for computing similarities can lead to good performance, suggesting flexibility in choosing similarity approach.

IPC	With Gradient	With Feature
1	35.5	35.36
10	60.9	60.9

Question 1: How does the efficiency of this method compare to other methods?

The additional overhead introduced by our color palette network during the forward and backward process is minimal. Our palette network consists of 3 convolutional layers designed to generate color-reduced synthetic images efficiently. Consequently, the increase in wall-time is marginal compared to vanilla baseline methods such as trajectory matching [1] and distribution matching [2]. We would like to further highlight that AutoPalette could be more efficient than other parameterization methods [3-5], which require reconstructing the distilled image at the test time. This efficiency makes AutoPalette a practical choice for various applications where computational resources are limited.

Question 2: Why does directly optimizing the task loss lead to assigning pixels to a limited number of color buckets in lines 156-158?

Empirically, we observe that during the early stages of optimization, the palette network tends to assign pixels to only a few color buckets. As the optimization progresses, if it just focuses on minimizing the task loss (distillation loss), it often neglects the need for constraining the palette network and the need for a diverse color representation. This leads to the underutilization of the available color buckets, resulting in poor generalization capacity.

To address this, we introduce the color maximum loss and palette balance loss as additional constraints on the optimization of the palette network. These losses encourage the network to utilize the full range of color buckets more effectively, leading to a richer and more balanced color representation in the color-reduced images.

[1] Dataset Condensation with Distribution Matching, Bo Zhao et al.

[2] Dataset Distillation by Matching Training Trajectories, George Cazenavette et al.

[3] Dataset Distillation via Factorization, Sonhua Liu et al.

[4] Sparse Parameterization for Epitomic Dataset Distillation, Xing Wei & Anjia Cao et al.

[5] Remember the Past: Distilling Datasets into Addressable Memories for Neural Networks, Zhiwei Deng et al.

2024-08-08

Thank you for your response. My concerns have been fully resolved, and this is excellent work.

审稿意见

评分: 5置信度: 22024-07-13

The paper titled introduces AutoPalette, a novel framework for dataset distillation (DD) that focuses on minimizing color redundancy at both the individual image and overall dataset levels. Authors propose a palette network to dynamically allocate colors from a reduced color space to each pixel, ensuring essential features are preserved. Additionally, a color-guided initialization strategy is developed to minimize redundancy among images, selecting representative images based on information gain. Comprehensive experiments on various datasets demonstrate the superior performance of the proposed color-aware DD compared to existing methods.

优点

Color quantization is an interesting way for dataset distillation, the motivation of this paper is interesting.
The methodology is well-defined, with clear explanations of the palette network and the color-guided initialization strategy.
The framework is shown to be compatible with other DD methods, indicating its potential for broad application.

缺点

The paper does not discuss the potential impact of the method on the performance of larger dataset beyond the CIFAR-10 and CIFAR-100. These 2 datasets are two small and could not show the effectiveness of the proposed method.
There is limited exploration of how the method handles imbalanced datasets or classes with unique color distributions.

问题

See weakness.

局限性

See weakness.

作者回复

2024-08-07

Weakness 1: The paper does not discuss the potential impact of the method on the performance of larger dataset beyond the CIFAR-10 and CIFAR-100. These 2 datasets are two small and could not show the effectiveness of the proposed method.

We appreciate the concern regarding the need for experiments on large-scale datasets. In addition to the experiments conducted on CIFAR10 and CIFAR100, we have also applied our method to higher-resolution subsets ( $128\times128$ ) of the ImageNet dataset, such as ImageNette and ImageWoof, as demonstrated in Table 2 of our paper. To further address the need for experiments on datasets with more classes, we conducted addtional experiments during the rebuttal period on the Tiny ImageNet dataset, which contains 200 classes with ( $64\times64$ ) images. Please kindly refer to our response to R1W2 (Reviewer 5XJK).

Weakness 2: There is limited exploration of how the method handles imbalanced datasets or classes with unique color distributions.

To evaluate the effectiveness of AutoPalette on imbalanced datasets, we created an imbalanced CIFAR10 dataset following the protocol described in [1]. Specifically, we resampled the CIFAR10 dataset so that the number of samples per class is determined by $N_{i} = N * \alpha^{\frac{i}{N_c}}$ , where $N$ is the original number of images per class, $\alpha$ is the scaling factor, $i$ indicates the i-th class, and $N_c$ is the number of classes. We used Distribution matching (DM) [2] as our baseline distillation method. The performance for IPC values of 1 and 10 is shown below.

Ratio α	Method/IPC	1	10
0.01	DM	25.91	48.01
	Ours	35.60	59.53
0.005	DM	25.54	46.71
	Ours	34.58	58.11

From the results, we observed that when the scaling factor ( $\alpha$ ) is 0.01 (with the minimum number of images per class being 50 and the maximum 5000), the distillation performance remains relatively stable. However, when the factor ( $\alpha$ ) is further decreased to 0.005 (with the minimum number of images per class being 25), the performance slightly drops. These results demonstrate that our proposed method can consistently improves upon the baseline method across a broad range of imbalance factors and dataset imbalance settings.

We agree that addressing imbalanced datasets is an important challenge. We believe that dynamically allocating limited storage resources across classes, rather than using a fixed number of images for each class, could potentially address this issue. We are excited to explore this direction in future work.

[1] Dataset Card for CIFAR-10-LT (Long Tail), Huggingface.

[2] Dataset Condensation with Distribution Matching, Bo Zhao et al.

2024-08-12

Dear Reviewer,

Thank you for your detailed review and valuable feedback on our paper. We hope we have addressed your comments in our rebuttal and would appreciate any additional insights or discussion you may have.

We are more than willing to engage in further discussion and address remaining concerns during the discussion period. If our responses have resolved your concerns we kindly ask you to consider increasing the rating.

Best regards,

Authors

审稿意见

评分: 5置信度: 32024-07-17

This paper introduces ColorPalette, a framework that minimizes color redundancy at the individual image and overall dataset levels. At the image level, the palette networks generate condensed images in reduced color bit-width while at the dataset level, a color-guided initialization strategy is proposed. The experiments are done using various datasets and IPCs.

优点

A new direction for exploring DC is proposed.
AutoPalette explores the possibility of performing DC in a reduced color space. The paper is easy to understand.

缺点

AutoPalette seems like it is built on top of [1] with DC loss.
Lack of experiment on large-scale dataset ImageNet-1K.

[1] Learning to Structure an Image with Few Colors, Yunzhong Hou et al.

问题

How is the performance of AutoPalette on ImageNet-1K?
Since the method falls into the parameterization category, given an IPC storage size, how many samples does AutoPalette generate?
In Table 1, why AutoPalette inferior to DATM on CIFAR-100 at 50 IPC?

局限性

N/A

作者回复

2024-08-07

Weakness 1: AutoPalette seems like it is built on top of [1] with DC loss:

Thank you for bringing up this important question! While color reduction plays a significant role in our methodology, our work primarily focuses on addressing two unique challenges inherent in dataset distillation with low IPC (limited synthetic samples). These challenges are fundamentally different from training networks with complete datasets as discussed in [1]. Specifically, the challenges include:

The optimization process in dataset distillation is extremely unstable, especially when employing a color reduction network. This instability can lead to the optimization process becoming trapped in local optima without capturing global information.
The color reduction model in [1] tends to select certain colors, potentially resulting in biases towards these colors and capturing spurious features. This bias can hurt the generalization of the trained model.

In contrast, our method addresses these limitations by introducing two losses tailored for the dataset distillation process: (1) a color regularization term $\mathcal{L}\_{a}$ to enhance color consistency to reach global optima, and (2) a color balance loss $\mathcal{L}\_{b}$ to avoid biased color assignment.

Our ablation studies, as presented in Table 3 of our paper, demonstrate the significance of these loss functions. They are critical for guiding the network towards a more balanced and effective representation of the color space. Without these loss functions ( $\mathcal{L}\_{a}$ and $\mathcal{L}\_{b}$ ), the test performance drops by 5.06%. The inclusion of these two unique loss functions facilitates faster and more stable convergence, highlighting the main differences with [1].

Weakness 2 & Question 1: Lack of experiment on large-scale dataset ImageNet-1K:

Thank you for your feedback regarding the importance of conducting experiments on large-scale datasets like ImageNet-1k. We have expanded our analysis beyond CIFAR10 and CIFAR100 by applying our method to higher-resolution subsets of the ImageNet dataset, such as ImageNette and ImageWoof, as shown in Table 2 of our paper.

Conducting comprehensive experiments on the full ImageNet-1K datasets poses significant computational challenges. As highlighted in [6], distilling Imagenet-1k requires 4 NVIDIA A100 GPUs, each with 80 GB of memory. Unfortunately, these requirements don't allow for running all requested experiments in the rebuttal period. Therefore, to demonstrate the scalibility and efficacy of our proposed method on large-scale datasets, we conducted addtional experiments during the rebuttal period on the Tiny ImageNet dataset, which contains 200 classes with ( $64\times64$ ) images. We adapted AutoPalette to the distribution matching (DM) [2] approach with Image Per Class (IPC) values of 1 and 10. Our findings show significant improvements in test performance compared to baselines. Specifically, our method improved test performance from 3.9% to 7.02% for $IPC=1$ and from 12.9% to 29.52% for $IPC=10$ . The table below illustrates the test performance results.

Method/IPC	1	10
DM	3.9	12.9
Ours	7.02	29.52

We apologize for not able to conduct experiments on the full ImageNet-1K dataset at this time. We are eager to explore this for the camera-ready version.

Question 2: Given an IPC storage size, how many samples does AutoPalette generate?

Thank you for insightful question. With a fixed IPC storage budget, our AutoPalette method achieves a fourfold $4\times$ increase in the number of generated instances compared to the baseline. In constract, methods such as IDC [3] and HaBa [4] typically generate instances with a fivefold $5\times$ increase, while FReD [5] achieves increases ranging from fourfold $4\times$ to sixtheenfold $16\times$ . This demonstrates the efficiency of AutoPalette in optimizing sample generation within the constraints of a given storage size.

Question 3: In Table 1, why AutoPalette inferior to DATM on CIFAR-100 at 50 IPC?

We appreciate your question regarding the performance comparison. Our method incorparates the trajectory mathcing strategy from DATM [6], where the matching steps are gradually increased during the distillation process. However, unlike DATM, we do not utilize the soft-labelling method and instead focus on exploring color features. This difference in approach may contribute to the slight performance discrepancy observed on CIFAR100 at 50 IPC. We believe that integrating soft-labelling could potentially enhance our method's performance in future work.

[1] Learning to Structure an Image with Few Colors, Yunzhong Hou et al.

[2] Dataset Condensation with Distribution Matching, Bo Zhao et al.

[3] Dataset Condensation via Efficient Synthetic-Data Parameterization, Jang-Hyun Kim et al.

[4] Dataset Distillation via Factorization, Songhua Liu et al.

[5] Frequency Domain-based Dataset Distillation, Donghyeok Shin & Seungjae Shin et al.

[6] Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching, Ziyao Guo & Kai Wang et al.

2024-08-12

Dear Reviewer,

Thank you for your detailed review and valuable feedback on our paper. We hope we have addressed your comments in our rebuttal and would appreciate any additional insights or discussion you may have. We are more than willing to engage in further discussion and address remaining concerns during the discussion period.

Thank you again for your time and consideration.

Best regards, Authors

作者回复

2024-08-07

We would like to extend our sincere gratitude to all the reviewers for their time and effort in reviewing our work. We deeply appreciate the insightful suggestions and feedback provided. We also thank for the acknowledgement of 1）our color-oriented redundancy reduction provides a new perspective in dataset distillation (5XJK, uhQ2, Jnju, VepF) 2) the proposed method is effective (Jnju, VepF) 3) our paper is easy to understand (5XJK) and methodology is well defined (uhQ2)

Based on the common suggestions, we have conducted additional experiments during the rebuttal period, summarized as below:

Additional experiments on large-scale dataset Tiny ImageNet. Please refer to our rebuttals to Reviewer 5XJK Weakness 1, Reviewer uhQ2 Weakness 1, Reviewer VepF Limitations.
Impact of hyperparameters $\alpha$ , $\beta$ and $\gamma$ ? Please refer to the rebuttal for Reviewer VepF Question 2.
Performance on the imbalanced dataset. Please refer to the rebuttal for Reviewer uhQ2 Weakness 2.

Please find the point-to-point response in each individual reply. Thank you once again for your valuable feedback and support.

最终决定Accept (poster)

2024-09-25

The paper introduces AutoPalette, a dataset distillation method that reduces color redundancy at both the image and dataset levels. Reviewers acknowledge that this approach offers a new perspective on dataset distillation and demonstrates effectiveness in experiments (5XJK, uhQ2, Jnju, VepF). However, concerns are raised about the lack of experiments on larger datasets like ImageNet-1K, which limits the demonstration of the method's scalability and effectiveness (5XJK, uhQ2, VepF). Suggestions include providing more detailed explanations of the color loss and palette balance loss (Jnju), addressing how the method handles imbalanced datasets (uhQ2), and clarifying definitions such as color bits (VepF). Questions about the method's efficiency compared to others are also noted (Jnju).

In the rebuttal the authors address concerns by explaining how their method differs from prior work, emphasizing their unique loss functions that improve optimization and generalization. They acknowledge computational constraints preventing ImageNet-1K experiments but provide additional results on Tiny ImageNet to demonstrate scalability. They clarify that AutoPalette generates four times more instances within a given storage budget and explain performance differences compared to other methods. Overall the reviewers' concerns were addressed and the AC recommends accept.