2.5

/10

withdrawn4 位审稿人

最低1最高5标准差1.7

3.3

置信度

正确性2.0

贡献度2.0

表达1.8

ICLR 2025

Data-Evolution Learning

Peng Sun,Rui Yin,Tianyu Du,Tao Lin

OpenReview PDF

提交: 2024-09-27更新: 2024-11-15

摘要

关键词

Data-centric Learning

评审与讨论

审稿意见

评分: 1置信度: 42024-10-30

This paper introduces a new “data-evolution learning paradigm,” challenging the traditional deep learning approach that only the model should be updated during training. Instead, this paradigm proposes updating both the model and the data to enhance performance. To implement this idea, the paper presents the Data-Evolution Learning Algorithm (DeLA), where data is iteratively updated through model switching. The authors claim that DeLA significantly outperforms existing self-supervised learning (SSL) frameworks and other model-centric approaches, achieving superior results.

优点

This paper offers a new perspective of SSL, where self-supervised learning can be viewed as model-data evolution paradigm, which can explain the SSL's performance.

缺点

Regarding the label is actually a feature map of projection head and predictor (line 918(This paper essentially rebrands several established concepts from self-supervised learning (SSL) approaches and can be seen as a minor modification of the original SSL framework, MoCo [1], which was one of the first SSL methods. The primary difference lies in how the feature map is generated. Specifically:

In MoCo:

$y = \left(\sum_{n=1}^N (1 - \lambda)^n f_n \right)(x)$

where $f_n$ denotes the feature map generated by the $n$ -th iterate of the student model (or a “momentum-updated” encoder). Here, the summation is applied to the model function first, and the combined model output is then applied to the input $x$ .

In DELA:

$y = \sum_{n=1}^N (1 - \lambda)^n f_n(x)$

where each feature map $f_n(x)$ is generated by applying the $n$ -th iterate model directly to the input $x$ , and then the outputs are summed.

This difference in parentheses placement results in effectively the same computation process. Thus, DELA lacks substantial novelty, as it simply repackages MoCo’s concept by slightly modifying the feature generation approach.

Also, in line 918, algorithm section, it says the update loss is just contrastive loss calculating the similarity. This indicates that every structructure is same except the order of parenthesis

Furthermore, this dynamic updating of feature maps could be inefficient from a memory perspective, as it requires continuously storing feature maps on the GPU in addition to the existing MoCo algorithm. In this case, the number of floating-point values would be $O(N \times d)$ , where N is the number of data points, and d is the size of each feature map. This approach offers no computational advantages over MoCo in terms of efficiency.

[1] He, Kaiming, et al. "Momentum contrast for unsupervised visual representation learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

问题

In line 443, the paper claims that DINO trained on ImageNet achieves a performance of 52.2 on ResNet-50. However, according to the official DINO GitHub repository, its performance is actually 75.3% on the same model [1]. How is this discrepancy come from??

[1] https://github.com/facebookresearch/dino

审稿意见

评分: 3置信度: 32024-10-30

This paper presents a novel training paradigm where the data evolves along with the model during the training process, allowing both the training data and the model to be optimized simultaneously. This approach can be applied to datasets containing noisy labels or unlabeled data, enabling gradual performance improvement through mutual feedback between the data and the model. Through this methodology, accelerated model training and enhanced model performance are achievable. Experiments were conducted on CIFAR-10/100 and ImageNet datasets, utilizing RN18, RN50, and ViT-T/16 architectures. When applying DELA, the model demonstrated higher performance compared to using a static dataset, suggesting potential advantages in data and computational efficiency.

优点

The idea of simultaneously optimizing data and the model presented in the paper is highly innovative, and an appropriate methodology is suggested to achieve this.
A versatile methodology is proposed, applicable to both noisy labeled data and unlabeled data.
Compared to existing noisy-label learning methods and self-supervised learning methods, it achieved competitive performance.

缺点

There is a significant performance variation depending on the choice of the prior model. A method for either reducing performance discrepancies or pre-selecting prior models capable of achieving high performance is necessary.
In Table 2, results are reported for experiments using only 20/40/60% of the total data samples, but the intention behind these experiments is unclear. Additionally, on what criteria was the data selected?
While Section 4.2 emphasizes training efficiency, based on the caption in Table 1, this appears to be calculated simply by training steps (if otherwise, please provide the specific criteria used to calculate the training budget). Since DELA requires additional computation costs to evolve data labels, this may not be a fair comparison. A more reliable efficiency metric, such as wall clock time for total training duration, would be beneficial.
Given the nature of the methodology, where the model undergoes iterative self-learning, it seems likely that overfitting could be an issue. An analysis of overfitting would be appreciated.

问题

In the experimental results, ViT-T/16 consistently scores lower than ResNet-18 and ResNet-50 in all cases. Was this experiment conducted accurately? Providing reliable data or sources regarding the performance of ViT-T/16 would be helpful.
The use of the term "Original" may cause confusion. Typically, "Original" suggests ERM training that utilizes the entire dataset and labels. Since it has a different meaning in this paper, perhaps consider changing the term.
Please refer to the identified weaknesses.

审稿意见

评分: 1置信度: 32024-11-01

This paper proposes a new problem setting called data-evolution learning, in which data and models are updated simultaneously, and introduces the data-evolution learning algorithm (DeLA). DeLA updates data labels after updating the model in one step by calculating the exponential moving average of the labels from the previous step and the predicted labels from the training model. Through experiments, the paper shows that DeLA improves the performance of training models and that datasets updated with a model can be transferred to train other models.

优点

The paper proposes a new problem setting called Data-evolution Learning. Focusing on data-centric perspectives is important in the machine learning community.
The paper empirically shows that data-evolution learning has the potential to improve the performance and transferability to other model training.

缺点

The largest concern is that it is unclear why the proposed method provides improved generalization performance. The paper claims that the proposed method optimizes the dataset, but in fact, it does not define an objective function for optimizing the data; rather, it simply updates it with existing labels and pseudo-labels represented by the EMA of the model prediction. Why does this achieve the desirable properties described in Definition 1? The paper attempts to prove that the model and dataset converge at the optimal solution by Theorem 1, but the proof in Appendix A uses a different definition of the pseudo-labels and does not appear to prove the property claimed in Theorem 1. The paper should clarify why the generalization performance could be improved by the proposed method more solid evidence..
The proposed method's design is unconvincing. It restricts the loss function to cosine similarity in updating the model in Eq. (3), but there is no sufficient explanation for this.
The problem setting is ambiguous. In the experiments, the paper uses self-supervised learning (SSL) and weakly supervised learning (WSL) as baselines for comparison. However, the proposed method and these paradigms are orthogonal and should not be compared to each other. Therefore, comparisons with SSL and WSL are not meaningful in asserting the validity of the method.
The paper does not compare to a simple but important baseline; the blending of labels by Eq. (4) is similar to knowledge distillation because knowledge distillation also uses the prediction of a teacher model for training student models. In this sense, the paper should be the most naive baseline for the dataset created by Eq. (4) using the trained teacher model with the original datasets.
The proposed method is not a perfect data-centric approach. Although the paper introduces the proposed method as a data-centric approach, experimental results such as Table 1 confirm that there are model-dependent differences in accuracy. The data-centric approach [a] is a general term for methods that fix the model and improve the data. In this sense, the proposed method is a model-centric approach rather than data-centric.
Figure 1 is misleading because it draws the input sample $x$ as if it is being optimized, but in fact, just data augmentation is applied.
The paper emphasizes the proposed method's efficiency but ignores the cost of updating datasets. Table 1 shows that the optimal prior model for each target dataset and architecture is different, so the proposed method just adds a new hyperparameter, which does not solve the efficiency issue.
The experimental protocol is unclear and unreliable. For example, the SSL experiment in Sec. 4.5 claims to use an unlabeled dataset. However, the DeLA results in Table 4 are the same as the DeLA* results in Table 2, and thus, it is possible that they are comparing results trained on a supervised dataset with the results on the unsupervised dataset.

[a] Zha, Daochen, et al. "Data-centric artificial intelligence: A survey." arXiv preprint arXiv:2303.10158 (2023).

问题

The proposed method assumes that data extensions exist (Eq.(4)). Can this be used with modalities other than images?

审稿意见

评分: 5置信度: 32024-11-04

This paper introduced Data-Evolution Learning, which is a data-centric method for simultaneous evolution both datasets and models during the learning process. The traditional model training method replies on the datasets and might be limited to the dataset quality. The DELA method proposed an iterative framework to optimize the model progressively by using various types of data, including noisy and unlabeled datasets.

优点

This method enhanced the model training efficiency by creating refined datasets. In addition, it achieves comparable results to traditional training model by using prior knowledge and evolving datasets. This method is also performs well across different network structures and shows good results with different type of datasets.

缺点

This method partially relies on leveraging the pre-trained models, which could be restrictive for the model training without relevant prior models/datasets.

The tuning of hyper-parameters seems critical for achieving good results, which might be challenges for reproducing the results on different model structures.

There has no comparison results with other methods on the table.

问题

Since the DELA leverages model priors, does the accuracy significantly influenced if the t-1 model performs low accuracy? If so, how to keep the prior model performs well? 2. As the paper mentioned, this method is able to handle nosy labels, would extreme nosy level labels degrade the quality of evolved datasets? (such as adversarial noises). Are this method can detect and mitigate the extensive nose? 3. Does this method has trade-offs between data efficiency and model performance? If so, are there any scenarios that the static dataset might still be preferable? 4. From the Table 1, we could see the DELA method performs good results compare to the original accuracy. Are there any baseline or state-of-the arts method can be compared with? 5. The paper mentioned that the standard setting of 100 epochs for full evolution is enough. Does it related to other hyper-parameters? Such as batch size, total iteration steps, model size, FLOPs, learning rate, etc. If the dataset is super small, for example, 20 images/class, does the method still works with evolution? What's the limitations of dataset?

撤稿通知

2024-11-15

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.