Vision-Language Dataset Distillation
摘要
评审与讨论
This might be the first work to condense images and text together. Based on the MTT from the domain of image condensation, this paper uses some engineering methods to build the basic framework. The experiments are extensively conducted on COCO and Flickr30K.
优点
- Writing is clear and easy to understand.
- The problem is new to me.
- The experiments are extensive.
缺点
- The baselines of three coreset methods are too weak. There are too many empirical studies without any theoretical analysis of why these coreset methods are good for this task.
- Why use Cosine similarity to evaluate the pairs? Any theoretical analysis?
- Why not freeze the image encoder backbone and just freeze the text encoder backbone?
- Why choose the retrieval task? A straightforward task that hit my mind is to use a subset of CLIP [a] training set to train the CLIP model with similar performance. Can the proposed method do this?
- Why use the NormalizerFree ResNet (NFNet) (Brock et al., 2021b; Wightman, 2019) as the image backbone? It looks like not the best backbone as shown in Table 9.
- What is the result if the ratio equals 50% in Table 2? If the proposed method can reach or close to the upper bound in Table 3, I would raise my score.
[a] Learning Transferable Visual Models From Natural Language Supervision
问题
See weakness
Summary:
Thank you for the insightful comments and questions. We appreciate your recognition of the clarity of our writing, the compelling experimental setup. We address the raised concerns below.
W1. Weak baselines and no theoretical analysis.
First, with regard to theoretical analysis, we would like to clarify that dataset distillation is inherently an empirical field, at least currently: a number of methods have been proposed for distilling image classification datasets (Cazenavette et al., 2022, Cui et al., 2022, Deng & Russakovsky, 2022) (published in CVPR, ICML and NeurIPS respectively), all of which rely exclusively on empirical analysis and evaluation. While theoretical grounding would certainly be beneficial to the dataset distillation field, it is outside the scope of our paper. As we note in the abstract, our core contribution is “proposing the first Vision-language Dataset Distillation method” by “implicitly matching the long-range training bi-trajectory of the target vision-language data and the synthetic ones” given three key challenges mentioned in Sec. 1, and verifying the strength of the method empirically.
Second, about coreset baselines, our work extends these methods to vision-language datasets, which introduces new challenges and complexities. The coreset methods we compare against are among the most effective and widely used baselines in dataset distillation [a, Cazenavette et al., CVPR, 2022, Wang et al., CVPR, 2022, Zhao & Bilen, ICLR, 2021b]. Since there are no prior methods for dataset distillation on vision-language datasets, leveraging coreset selection methods allows for a fair comparison between our synthetic distilled dataset and a real dataset subsampled to fit under the same storage budget. In addition, we perform ablation studies (Sec. 4.4) to examine the effects of different parts of our method. We would welcome suggestions for additional baselines or ablation studies. Unfortunately, as mentioned above, theoretical justification is outside the scope of our work – and theoretical analysis by itself would be a substantial contribution to dataset distillation, even on the original image classification task.
W2. The usage of cosine similarity.
We use cosine similarity because it is well-suited for measuring the alignment of high-dimensional data in embedding spaces. In contrastive learning, the goal is to ensure that similar sample pairs are close in the embedding space, and the dissimilar pairs are far apart. As stated in [b], the cosine loss induces sixth-order dynamics (while the L2 loss induces a third-order one), in which a stable equilibrium dynamically emerges even if there are only collapsed solutions with given initial parameters. Cosine similarity has been widely used in representation learning with various theoretical interpretations[c, d]. We refer the reviewer to those works for further details.
W3. The choice of freezing encoders.
Our decision to freeze the text encoder backbone but not the image encoder follows common design choices in the field. This choice is often made to balance computational efficiency with model performance, especially considering the complexity and diversity of visual data compared to textual data. In Tsimpoukelli et al. [e] they only finetune the vision encoder (NFNet) and generate embeddings that are fed into a frozen GPT-3. MEGMA[f] also used NFNet as the vision encoder and a frozen language model GPT-Neo. Scialom et al.[g] showed a learned linear transformation is sufficient for BERT to encode image region representations. Our design choice aligns with the established method in the field, where the image encoder is fine-tuned to better handle the complexity of visual data while the text encoder is often fixed to maintain the computational efficiency and leverage existing language understanding.
W4. Why retrieval? What about distilling the subset of CLIP?
This is an excellent question about using a subset of CLIP for the experiment, and we had considered it. However, even scaling up from image classification to the (arguably simplest) vision-language task of retrieval posed significant challenges (as shown in Tab. 2 and Tab. 5). As we were already tackling an ambitious goal of developing a novel dataset distillation method for vision-language datasets, we decided to keep the evaluation task for now as simple as possible. Additionally, even the subset of CLIP training set could reach a scale of millions (e.g. Conceptual Captions 3M), which demands significant computational resources. Our choice of task and dataset was driven by the desire to introduce and validate our novel dataset distillation method in a context that is both challenging and feasible within our resource constraints.
W5. Why use NFNet? It’s sub-optimal in Tab. 9.
Upon careful review of Tab. 9, There might be a slight misunderstanding. The last line of the result table actually presents NFNet in combination with the CLIP text encoder, still underscoring NFNet's effectiveness as the vision backbone. The choice of using NFNet as the image backbone was based on its performance and compatibility with our method and it demonstrates the best performance across all vision backbones in our experiments. Aside from the optimal performance, our design choice also aligns with common practices in the multimodal field, where NFNet is widely used [h, i, Alayrac et al., 2022].
W6. Performance reaching upper bound.
Regarding your question about results using 50% ratio of the original data size and reaching the upper bound in Table 3, we acknowledge that reaching the upper bound is certainly the North Star of dataset distillation, we and other researchers will continue working towards this goal. Distilling with 50% of the original data size is not feasible under our current academic computing setting. Our method represents a step towards this goal, and we continue to work on improving the efficiency and effectiveness of our approach.
We hope these responses address your concerns, and we welcome further discussion and suggestions.
[a] Zhao, Bo, and Hakan Bilen. "Dataset condensation with distribution matching." WACV. 2023.
[b] Bao, Han. "Feature Normalization Prevents Collapse of Non-contrastive Learning Dynamics." arXiv preprint arXiv:2309.16109 (2023).
[c] Tian, Yuandong. "Understanding deep contrastive learning via coordinate-wise optimization." NeurIPS, 2022.
[d] Ren, Yunwei, and Yuanzhi Li. "On the Importance of Contrastive Loss in Multimodal Learning." arXiv preprint arXiv:2304.03717 (2023).
[e] Tsimpoukelli, Maria, et al. "Multimodal few-shot learning with frozen language models." NeurIPS 2021
[f] Eichenberg, Constantin, et al. "MAGMA--Multimodal Augmentation of Generative Models through Adapter-based Finetuning.", EMNLP, 2022
[g] Scialom, Thomas, et al. "What BERT sees: Cross-modal transfer for visual question generation." In Proceedings of the 13th International Conference on Natural Language Generation, 2020.
[h] Merullo, Jack, et al. "Linearly mapping from image to text space." ICLR, 2023.
[i] Gu, Jindong, et al. "Towards robust prompts on vision-language models." arXiv preprint arXiv:2304.08479 (2023).
Thanks for the effort in responding. I value the initial exploration the authors provided in this direction, but I still have concerns regarding W4 and W6.
For W4, the authors didn't show the experiments on CLIP due to resource constraints. For W6, the authors fail to distill with 50% of the original data size.
Therefore, I tend to keep my score.
Thank you for your prompt response! We appreciate your feedback, and would like to provide some new updates regarding W4 & W6:
W6. Upper bound experiment.
We appreciate your question regarding the results when the distilled data ratio equals 50%. We would like to clarify that due to constraints in computational resources, we were unable to conduct experiments with 50% of the data. However, we believe that our new experiment results provided below, with only 10% of the original dataset's scale, offer significant insights into the effectiveness of our method. In our new experiments with the Flickr30K dataset, using only 10% of the original dataset's size, models trained on this distilled data show remarkable performance, closely approaching the upper bound results. Certain metrics, such as NFNet + CLIP with Text Retrieval (TR) at R@1, reach as high as 98% of the upper bound.
| Result Type | Base Model | Ratio% | TR | IR | ||||
|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |||
| Distillation | NFNet + BERT | 10% | 32.1 | 60.0 | 73.2 | 24.1 | 53.9 | 66.5 |
| Upper Bound | NFNet + BERT | 100% | 33.9 | 65.1 | 75.2 | 27.3 | 57.2 | 69.7 |
| Distillation | NFNet + CLIP | 10% | 60.0 | 86.3 | 91.4 | 47.4 | 78.2 | 86.5 |
| Upper Bound | NFNet + CLIP | 100% | 61.2 | 87.5 | 92.8 | 49.8 | 79.8 | 88.3 |
W4. Why retrieval? What about distilling the subset of CLIP?
In Addition to what we previously replied to regarding W4 about our choice of the retrieval task and the feasibility of training on a CLIP subset, we now provide new experimental results and explanations. As stated in the CLIP paper, they primarily evaluate their method via zero-shot ImageNet accuracy. Here, we show the results of training CLIP models from scratch on the Conceptual Captions 3M dataset (the entire 400M CLIP training set is not publicly available; this is one of the few accessible subsets). Note that this is still the expert buffer training stage, not distillation stage as distillation requires long trajectories stored in the buffer. The results we provide here serve as upper bound performance.
| Model | Dataset | Classification Acc. |
|---|---|---|
| CLIP ResNet50 | CC3M | 4.95% |
| CLIP ViT small-16 | CC3M | 16.87% |
| CLIP ViT base-16 | CC3M | 17.7% |
Even though this subset is quite large, containing millions of data points, the zero-shot accuracy is not particularly promising. It is important to note that this training required substantial computational resources, involving 8 x A100 GPUs, and lasted approximately one day. Our choice of task and dataset was driven by the desire to introduce and validate our novel dataset distillation method in a context that is both challenging and practical, considering our resource constraints and the availability of datasets.
We hope these responses address your concerns, and we welcome further discussion and suggestions.
The authors propose a multimodal dataset distillation method. Visual-language dataset distillation involves, first, training multiple models on the full dataset using bidirectional contrastive loss to obtain expert trajectories at various training epochs. Then, a set of student models is trained on a distilled dataset using the same bidirectional contrastive loss, and the dataset is updated based on a bi-trajectory matching loss that measures the alignment of student model parameter trajectories with the expert trajectories. The authors evaluate their method against the closest related work and show a significant improvement
优点
-
The paper is very well written and easy to understand. Authors clearly explain their method and provide an intuition behind their method selection
-
Results are significantly better than related methods with fewer examples. Ablation studies show that the multimodal distillation outperforms distillation with a single modality.
缺点
- Storing and training with the trajectory data seems like an expensive process. The addition of multimodal data requires even more resources, such as modality-specific encoders. While I believe that these factors represent significant limitations to the work, I also recognize the substantial contribution it makes to advancing this field.
问题
- I believe that the sentences generated for the qualitative results are nearest neighbors for real sentences in the dataset. However, is it possible to get the nearest neighbor token from the distilled text dataset? If so, do these tokens actually form sentences that make sense?
Thank you for your insightful review. We appreciate your acknowledgment of the clarity and effectiveness of our method and the significant improvements demonstrated in our results. We address the raised concerns below.
Weakness: Computational Demands.
Regarding the computational demands, the reviewer correctly identified the heavy resources required for trajectory data storage and multimodal distillation. As mentioned in Sec 4.2, our models, trained at 224 × 224 resolution on a single RTX 3090 GPU with 24GB, require approximately 40 minutes per epoch for 10 epochs. Distillation takes between 6 to 15 GPU hours, depending on factors such as the number of distilled pairs, using an 8-GPU A6000 node. This demonstrates the feasibility of our approach in similar research environments. In comparison, dataset distillation for image classification tasks typically demands less computation, due to lower data resolutions (e.g., CIFAR10/100: 32x32 [Cazenavette et al., 2022], ImageNet1K & Tiny-ImageNet: 64x64 [Cazenavette et al., 2022, Cui et al., 2022], ImageNet-subset: 128x128 [Cazenavette et al., 2022]). Distillation times for CIFAR10/100 range from 83 to 317 minutes, and for Tiny ImageNet, from 183 to 433 minutes. [Cazenavette et al., 2022]. Note that our work was conducted in an academic lab with resources that, while substantial, is not on par with industry-scale clusters.
Question: Distilling Text at the Token Level.
Our method distills at the sentence embedding level and is primarily designed to rely on the association between entire images and complete sentences for image-text retrieval tasks, where distilling at the token level may not effectively serve our purpose. While operating at the token level is feasible, it would significantly increase computational costs and more storage, which contradicts the principle of dataset distillation aimed at compact representation. Therefore, our decision to use sentence embeddings is more efficient and aligned with the specific needs of image-text retrieval tasks. We opted to limit our scope to sentence-level distillation for this work, considering both storage efficiency and the specific requirements of image-text retrieval tasks.
We hope this addresses your concerns and look forward to any further feedback!
This paper explore a new problem: VL dataset distillation, which is not explored before, and the paper follows the existed dataset distillation approach to evaluate the performance. The results show that the distilled dataset can outperform the coreset algorithms significantly.
优点
- This the first paper to perform dataset distillation on the vision-language dataset.
- Comprehensive experiments are conducted in the paper.
缺点
-
The underlying distillation process is the same to MTT, even though the expert model is trained with bi-direction contrastive loss
-
In the bottom on page 1, the authors mention it is hard for text data but in the paper, the authors still distill in the continuous space and then simply find the closet embeddings.
问题
-
In the problem formulation, is there any restriction on K? Or fewer pairs is the only goal?
-
The symbol notation is not clear, e.g. in eq. 2, what is * and hat of theta?, in eq. 3, what is the summation over y'? what is the set? all of y except itself?
-
In Table 1, as the authors use BERT-pretrained models, how much contributes come from the text-pretrained model when comparing to conventional dataset distillation.
-
As the pretrained models are used, what is dependency on the pretrained dataset (not the model), e.g. if the image and text encoders are pretrained with trained with the particular datasets, what is the performance of the model pretrained on other datasets training on the distilled dataset?
-
I wonder do authors know why the distilled images still look like the original real image? In MTT paper, the distilled images are very different from the real image (at least visually).
-
Regarding the distilled text showing in Fig. 3, it seems that the distilled text could provide vague description, e.g. the bottom right image, it changes "two" men to "four" football players, if this algorithm is applied to VQA, won't it provide wrong counting?
Summary:
We appreciate the reviewer's insightful feedback and acknowledgment of our contribution as the first work to explore vision-language dataset distillation. The recognition of our experimental setup and the significant performance improvements our approach achieves are highly encouraging. We address the concerns raised below.
(W1) Method.
Our approach extends the MTT formulation to handle the unique requirements of multimodal tasks, reflecting a significant departure from standard dataset distillation approaches that focus solely on single-modality data (Sec. 3.1). Unlike previous dataset distillation works, vision-language datasets lack a discrete set of classes (Sec. 1). Additionally, the cross-modal relationships between visual and textual elements require a co-distillation approach (Sec. 4.4). The complexity of those high resolution datasets, combined with more complex models, present significant challenges (Sec.1 & 4.2).
(W2) Text Representation.
Text embeddings offer a compact representation that aligns with the objectives of dataset distillation. They efficiently capture essential information from textual data in a condensed format and are a widely accepted choice in the NLP field [a, b, c, Radford et al., 2021]. This also avoids issues related to gradients, variable length, serving as a strong and standard way to benchmark the task. Many distillation methods' dependence on gradients lead to issues such as vanishing or exploding gradients. Leveraging stable representations of text embeddings bypasses the more challenging process of optimizing via discrete text distillation. Our sentence-level text embeddings provide a uniform, fixed-length representation regardless of the original text length, maintaining a consistent input size without the need for truncation or padding.
Q1: Formulation of K.
In problem formulation, K is the number of sentences associated with one image in the original dataset, which is a fixed number provided by the datasets. K=5 for both Flickr30K and COCO (as mentioned in Sec. 4.2). For , we aim to use one (instead of five) sentence per image in the distilled set for a more compact representation to learn the vision-language connection. We modified Sec. 3.1 and clarified this part.
Q2: Symbol Notation.
As mentioned in Eqn. 1, \theta^{*} represents the optimal model parameters obtained after training on the entire dataset, while \hat{\theta} denotes parameters obtained from the distilled dataset. In Eqn. 3: The summation over is the sum over all possible values except the current in the batch. This is a common notation in contrastive learning, where for a given pair , you calculate the similarity with every other in the batch (where ) to form negative pairs for the contrastive loss. We modified Sec. 3.1 and Sec. 3.2 to clarify this part.
Q3: Contribution from BERT.
Compared to conventional dataset distillation which focuses on image classification, pretraining BERT on a diverse language corpus allows for a more efficient distillation process, and reduces the size of the distilled dataset. As shown in Appendix Tab.10, it demonstrates the importance of pretraining and without pretraining, BERT trained from scratch performs poorly, even with the entire dataset. Pretraining is a common approach to provide a good foundation and starting point.
The core of vision-language dataset distillation is to figure out the most compact way to learn the joint space of the two domains. While both the encoders are pretrained, they are only pretrained on unimodal data (i.e. ImageNet for image encoder and BooksCorpus and Wikipedia for BERT) and they have not been trained on the multimodal embedding space and had no exposure to the other modality. The distillation process still heavily relies on the interaction between vision-language data.
Q4: Performance Impact of Pretraining Datasets.
Thank you for raising an important point regarding the dependency of models on the pretrained datasets. Given that most models we use, such as NFNet, typically have only one available pretrained version (pretrained on ImageNet), we are conducting an additional experiment to pretrain the NFNet on PASS [d] dataset. Due to the limited time available for the rebuttal period, we may not complete the pretraining within this short time. However, we will provide an update on this in the final version of the paper.
Q5: Noticeable Changes in Images.
We have found that increasing the learning rate or distillation time does lead to more noticeable changes in the images within the distilled dataset. We visualized the distilled image at iteration = 7000 and learning rate = 5000 in Fig.4, Appendix Sec. A in the updated paper. However, it's important to note that a higher learning rate or longer distillation time does not necessarily translate to improved performance of the distilled dataset, even if the images appear to deviate more drastically in terms of human perception. Changes in image pixels alone may not reliably predict distillation performance. It is rather a measurement of the strength of distillation. More distorted images suggest uneven pixel updates, while even updates yield results similar to the visualization we provided before (e.g. Fig. 3).
In line with previous studies, we initially expected that more obvious changes in the images would lead to better performance, but our findings suggest a different behavior of in vision-language distillation with trajectory matching framework, and it reflects how models capture the vision-language interaction. From a human perception perspective, the distilled images appear to be moving less compared to previous classification works, yet those small vectors are still meaningful and contain useful information, as opposed to artifacts such as noisy patterns. As indicated by the results, our algorithm achieves a clear and consistent improvement over random baselines. We hope this discussion can also inspire more research on vision-language dataset distillation.
Q6: Distilled Text Descriptions.
Our distillation process is primarily designed for optimizing general image-sentence alignment rather than precise detail-level alignment. It is worth noting that the final text visualizations represent the nearest neighbors of our distilled text embeddings, and thus may slightly differ from the actual distilled text embeddings. If our method were adapted for VQA, the objective function would be different and would shift to preserving key details that are important for VQA accuracy, for example the exact objects count.
We look forward to discussing these points further and appreciate the opportunity to clarify these aspects of our paper.
[a] Li, Xiang, et al. "Diffusion-lm improves controllable text generation." NeurIPS 2022
[b] Reimers, Nils, and Iryna Gurevych. "Sentence-bert: Sentence embeddings using siamese bert-networks." EMNLP, 2019
[c] Gao, Tianyu, Xingcheng Yao, and Danqi Chen. "Simcse: Simple contrastive learning of sentence embeddings." EMNLP, 2021
[d] Asano, Yuki M., et al. "Pass: An imagenet replacement for self-supervised pretraining without humans." NeurIPS 2021
The paper introduces a method for distilling vision-language datasets that consists of image-text (caption) pairs.
The method is based training an expert model on the original datasets and a student model on the distilled dataset (initialized as samples from the original dataset) for a number of epochs. After which the samples from the distilled dataset are updated by back-propagating the loss function that measures the difference in parameter value trajectories of the student and expert models, over those selected number of epochs.
In the distilled dataset, the images are updated in the pixel space, but text captions in the text encoder’s input embedding space.
As no prior dataset distillation methods for language-vision data exist, the paper compares the proposed approach to 3 coreset selection methods and show consistent and substantial improvement over all of them.
优点
-
(S1) The paper contains a good set of experiments. The authors find a way to compare their method against image-only dataset distillation methods (Table 1) which somewhat isolates the impact of the specific model proposed vs. the task of image-text dataset distillation, as opposed to image-label. Additionally, the authors also experiment by distilling only one modality (either only text or only image) (Table 4), which demonstrates the relative impact of each of the modalities and the combination of them on the performance. The results contain standard deviation values.
-
(S2) The quantitative results demonstrate that the proposed approach is consistently and substantially outperforming alternative approaches of coreset selection
-
(S3) The paper is very well-written, and the method well-explained
缺点
-
(W1) The distilled dataset samples shown in the qualitative results (Figure 3) are, in case of images, not very different from the original images - only augmented with some noisy high-frequency patterns, and in case of text, do not consistently appear to be better than the original captions. That raises a question of how robust those distilled datasets are and indicates that maybe the source of effectiveness of distilled datasets is somewhat different from what one would expect, that is, models constructing very informative and representative samples. Instead, the impact appears to come from some artifacts, like these high-frequency patterns discussed by the authors.
-
(W2) If I understood correctly, the evaluation of the distilled datasets (image-to-text and text-to-image retrieval) is performed on of the same architecture as the dataset distillation models. The paper does not seem to evaluate if the distilled datasets are equally effective for models of architecture different than those used for distilling the datasets. Considering the point raised above (W1), there is a risk that they are not. Images from the distilled datasets could be easily evaluated on different architectures. However, for text, the approach of operating on the input embeddings might not be adaptable in a straightforward way to other models.
问题
Summary:
Thank you for your insightful review and for acknowledging the contributions of our work, especially the novel setting, the compelling experimental setup and the clear presentation. To address the two noted weaknesses:
(W1) Noticeable Changes in Images.
We have found that increasing the learning rate or distillation time does lead to more noticeable changes in the images within the distilled dataset. We visualized the distilled image at iteration = 7000 and learning rate = 5000 in Fig.4, Appendix Sec. A in the updated paper. However, it's important to note that a higher learning rate or longer distillation time does not necessarily translate to improved performance of the distilled dataset, even if the images appear to deviate more drastically from the human perception perspective. Changes in image pixels alone may not reliably predict distillation performance. It is rather a measurement of the strength of distillation. More distorted images suggest uneven pixel updates, while even updates yield results similar to the visualization we provided before (e.g. Fig. 3).
In line with previous studies, we initially expected more obvious changes in the images would lead to better performance, but our findings suggest a different behavior of in vision-language distillation with trajectory matching framework and it reflects how models capture the vision-language interaction. From a human perception perspective, the distilled images appear to be moving less compared to previous classification works, yet those small vectors are still meaningful and contain useful information, as opposed to artifacts such as noisy patterns. As indicated by the results, our algorithm achieves a clear and consistent improvement over random baselines. We hope this discussion can also inspire more research on vision-language dataset distillation.
(W2) Cross-architecture Generalization.
We report the cross-architecture generalization experiment results in the table below. Despite a performance drop when generalizing models across different architectures, as seen in existing literature on dataset distillation (Cazenavette et al, 2022, Deng & Russakovsky, 2022), the performance of 100 pairs of distilled data from NFNet+CLIP used for NF-ResNet50 and NF-RegNet remains significantly higher than the coreset selection baseline. We acknowledge the importance of cross-architecture generalization, but given that this is the first attempt at the vision-language dataset distillation, we expect that it is challenging for the distilled data to generalize across different architectures. When performing cross-architecture generalization for text, we could potentially operate on the nearest neighbor sentences rather than input embeddings. This approach could potentially offer a pathway for using distilled text data across different text base models.
| TR | IR | |||||
|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
| NF-ResNet50 | 5.2 | 14.7 | 21.2 | 4.5 | 13.8 | 21.2 |
| NF-RegNet | 3.6 | 9.7 | 15.5 | 2.5 | 8.6 | 14.0 |
We look forward to discussing these points further and appreciate the opportunity to clarify these aspects of our paper.
Dear reviewers,
We would like to express our sincere gratitude to the reviewers for their very insightful and thorough feedback! We are delighted that reviewers found our paper very well written (Reviewer i1Lj, tXui), the method is well-explained (Reviewer i1Lj, tXui), includes comprehensive experiments (Reviewer MV4s, Evkk) and that it is the first work to address a novel problem (Reviewer MV4s, Evkk).
In response, we updated our paper and the updates include:
- Analysis on Distilled Images (Appendix Sec. A), in this section we mainly discuss the noticeable changes in images, where we observed more visible changes in the distilled images with longer training time and higher learning rate.
- Matching Upper Bound Performance (Appendix Sec. B).
- Clarifications of a few details in Sec. 3.1 and 3.3.
Below we address the remaining comments from the reviewers.
We again thank the reviewers for their comments and suggestions and hope we addressed them in the revised paper and our responses, Authors.
The manuscript presents a novel framework for dataset distillation, tailored for training vision-language models. This approach ambitiously extends beyond traditional dataset distillation methods used in image classification. The reviewers' opinions are divided, with scores hovering around the borderline (5, 5, 6, 6). They acknowledge the paper's pioneering efforts in tackling a significant challenge. However, there are reservations regarding the magnitude of advancement achieved. These concerns stem from what are perceived as rather intuitive methodologies, the employment of potentially weak baselines, and the constrained scope of experiments, likely due to resource limitations in academic settings. Although the authors' rebuttal has successfully addressed some issues, it hasn't managed to evoke strong advocacy from any reviewer. Consequently, the area chair is inclined to recommend rejection of the paper, considering both the reviews and subsequent discussions.
为何不给更高分
While the problem is quite important, the progress is not significant enough to pass a publication bar at ICLR.
为何不给更低分
N/A
Reject