5.0

/10

Rejected3 位审稿人

最低5最高5标准差0.0

3.0

置信度

ICLR 2024

Swift Sampler: Efficient Learning of Sampler by 10 parameters

Jiawei Yao,Chuming Li

OpenReview PDF

提交: 2023-09-20更新: 2024-03-26

摘要

关键词

Reinforcement learningAutoML

评审与讨论

审稿意见

评分: 5置信度: 42023-10-31

The paper introduces an efficient automatic sampler search algorithm, SS, designed to optimize data selection for deep learning model training. While previous methods relied on heuristic guidelines or labor-intensive trials, SS offers a streamlined approach by mapping samplers to a reduced set of hyper-parameters. It quickly gauges sampler quality using an approximated local minimum, making it computationally economical and ideal for vast datasets. Extensive testing shows SS-enhanced sampling delivers notable performance gains, such as a 1.5% improvement on ImageNet, and offers transferability across different neural network architectures.

优点

Data sampling strategy is indeed an important topic and could benefit the community.
The paper is well written and there are rigorous motivations for the choice of the SS method and the algorithm is well presented.

缺点

Despite the good presentation and mathematical formulations, the results of the proposed method are not convincing enough. The reported baseline methods seem to be significantly lower than the common baselines. For example, MobileNetV2 is known to be able to achieve 72% - 74% top-1 accuracy on the ImageNet-1K dataset. The reported value in the paper is 70.4%. In fact, even the improved results of MBV2 with SS is still significantly lower than the common baseline. The same trends hold for all the models reported in the paper. It would be more convincing if the authors could verify the effectiveness of the proposed method on baseline models with "normal" training.

问题

Please refer to the weakness part.

评论- Response to Reviewer YacP

2023-11-22

We really appreciate for your valuable comments, and are sincerely sorry for this late reply due to the time cost of conducting necessary experiments.

Reproducction of common baseline.

Thanks for your good advice to further improve our work. The reason is that we used the official training script of ResNet for all experiments, without further tricks added. Also, the baseline performance of ResNet and SE-ResNet in our paper is not worse than that released in their original papers. To make our result more convincing, we re-conduct our MobileNetV2 experiment with the official released code on both MobileNetV2, and MobileNetV2(1.4) with the expansion ratio 1.4. Also, we tried SS with the recent model Swin-B on the official released code. The result below verifies that SS does have the ability to improve the performance based on the common baselines:

Model	MobileNetV2	MobileNetV2(1.4)	Swin-B
Baseline	72.0	74.7	83.5
SS	73.4	75.9	84.3

2023-11-22

I thank the authors for their newly added experiments. However, I still have concerns about the accuracy of the baseline methods. The improvements of the proposed methods are obvious in the reproduced model results. However, it is important to see if the proposed methods can be combined with other training tricks to make improvements under stronger training settings. Thus, I decided to keep my original score.

评论- Response to Reviewer YacP

2023-11-22

We sincerely thank you for your insightful comments and appreciate your constructive feedback. We are supplementing the experiment you suggested and hope that our ongoing efforts and the upcoming additional results will address your concerns satisfactorily.

评论- Response to Reviewer YacP

2023-11-23

Thank you for your patience while we supplement our experiments.

To evaluate the performance of SS in strong setting with advanced tricks, we apply SS in ResNet 50 along with a set of tricks proposed in [1], the table below verifies that SS can further boost the performance even based on strong tricks.

Model	ResNet-50
Baseline	79.4
SS	80.2

[1]. He, Tong, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. "Bag of tricks for image classification with convolutional neural networks." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 558-567. 2019.

审稿意见

评分: 5置信度: 22023-11-01

The purpose of this paper is to create a sampler that can assign appropriate sampling probabilities to training data in order to improve performance. Unlike previous approaches that relied on heuristic rules or expensive learning methods, this paper proposes an automatic and efficient sampler search algorithm called SS. Specifically, SS employs a new formulation to map a sampler to a lower-dimensional space of hyper-parameters and uses an approximated local minimum to quickly evaluate the quality of a sampler. SS can be applied on large-scale data sets with high efficiency and leads to performance gains on various datasets, e.g., CIFAR10, CIFAR100, ImageNet-1k, and YTF.

优点

The motivation of this paper is clearly illustrated, and it is convincing. How to efficiently and effectively search for a proper data sampling policy is important.
The solution is reasonable. The proposed inner loop and outer loop pipeline with Dimension Reduction, Smooth the Objective Function, and Local Minima Approximation designs are innovative.

缺点

The outer loop searches for the sampler that has the best score on the validation set. It seems the final performance is also reported on the validation set. Should this lead to the model overfitting the validation set? This paper should also report the performance of test sets.
The experimental results of the proposed SS is not impressive. The performance of reported models is far behind the SOTA methods, e.g., Swin Transformer and ConvNext. The SS should be tried on SOTA methods to make the performance gains more convincing.
More recently proposed works, e.g., automated loss function search e.g., Li et al. (2019a) and augmentation policy search methods Lin et al. (2019); Tian et al. (2020), should be compared in performance.

问题

See weakness.

评论- Response to Reviewer UJut

2023-11-22

We really appreciate for your valuable comments, and are sincerely sorry for this late reply due to the time cost of conducting necessary experiments.

Overfitting on validation set.

We are sorry for not clarifying the experiment details, in our ImageNet and MS1M experiments, the outer loop search is on the validation set, which are separately 10% and 15% of the train set of ImageNet and MS1M. And the final performance is reported on the test set.

Experiments with SOTA model.

Thanks for your good advice, to better reveal the convincing gain of SS, we tried it on the Swin Transformer. Due to the limitation of computational cost, we perform SS on Swin-T (4.5GFLOPS) and evaluate the searched sampler on Swin-B (15.4GFLOPS), the result is:

Model	Swin-T	Swin-B
Baseline	81.2/95.5	83.5/96.5
SS on Swin-T	82.1/95.6	84.3/96.5

which verify the gain of SS is not only convincing but also general on different families of network architecture.

Important relative works

Thanks for providing the important relative works about auto augmentation and auto loss function search. In our opinion, these works and our SS are orthogonal components of AutoML and should be compatible with each other. To verify it, we empirically show that SS can further boost the performance based on the auto augmentation method in Lin et al. (2019), with the result listed below.

Model	ResNet-50	SRNext-101
AutoAug	78.9/94.3	80.7/95.4
SS w/ AutoAug	79.7/94.4	81.4/95.4

评论- Thanks for the rebuttal

2023-12-04

Thank the response of the authors. I have carefully read the rebuttal and other reviews; the results on stronger baselines make sense, while the improvement is kind of limited. I agree with other reviewers that more experimental results are required to make the submission more solid. I will keep my initial rating.

审稿意见

评分: 5置信度: 32023-11-01

The authors propose an automatic swift sampler search algorithm to explore automatically learning effective samplers. They then examine the quality of a sampler at a low computational expense by mapping a sampler to a low dimension of hyper-parameters and use an approximated local minimum.

优点

The paper introduces a novel formulation to map a sampler to a low dimension of hyper-parameters The paper uses an approximated local minimum to quickly examine the quality of a sampler rather than training from scratch, which is efficient and rarely done in previous works. The paper also designs a transform function to smooth the objective function of the sampler search problem and uses Bayesian optimization as the agent for the search process The paper demonstrates the transferability of the learned samplers

缺点

The paper does not provide enough theoretical analysis or justification for its proposed formulation, transform function, and approximation method.

         Can the author provide more profound  justification ?

The paper does not explain why its method can generalize to different tasks and datasets, or provide any experiment on how it compares with other sampler search methods in terms of complexity and scalability.

Can the author complete the relevant comparative experiments ?

The paper does not conduct ablation studies or sensitivity analysis to show the impact of different components or hyper-parameters of its method.

Can the author explain the effectiveness of components separately?

The paper does not report the standard deviation or confidence interval of its experimental results, which makes it hard to assess the statistical significance and robustness of its method.

问题

see above

评论- Response to Reviewer BnW4

2023-11-22

We really appreciate for your valuable comments, and are sincerely sorry for this late reply due to the time cost of conducting necessary experiments.

Theoretical justification of the proposed components.

In this paper, we focus on verifying that a family of sampler with a simple form can be well combined with bayesian optimization to achieve general improvement on various datasets and tasks. Unfortunately, when considering the theoretical justification, we find that there are nearly no mathematical tools to relate the neural network performance with the sampling function, due to the high complexity of neural network. In the early stage of this work, we achieved several theoretical conclusions with the assumptions of linear regression on noisy labeled dataset, but we found it hard to extend to neural networks and more complex tasks. We really hope that more strong mathematical tools can be developed to well support our research.

Complexity and scalability.

Compared with the baselines in this paper, the proposed method SS has natural advantage in scalability, as verified in Table 2, SS can search with a small network and achieve a once-for-all sampler, which means it can be reused in all further training as well as reused for other network architectures. For complexity, in the experiment of this paper, the GPU x time cost of SS is 2.3~3.1 times of training a network from scratch, which is in the same level as the baseline methods.

The impact of different components.

We design ablation experiments in Figure 3 to validate the proposed transform function cgf compared with cdf, and the choice of Bayesian optimization compared with reinforcement method. The result reveals that the designed components have obvious performance margins over the comparative baselines. For hyper-parameters of Bayesian Optimization, we use the most common parameter combination which is shown to work well in most optimization tasks.

Standard deviation.

Thanks for your valuable advice, we will add the standard deviation of all experiments in the final version. We omit it in the paper because the standard deviation is quite small (0.04~0.10 for 5 runs). We are sincerely sorry for not being rigorous enough.

评论- thanks for the rebuttal

2023-12-04

Dear Authors, Thanks for the rebuttal. My concerns have been partially solved. I strongly suggest the authors could give more analysis, experiments, or visualizations of this method. Therefore, readers would be more convinced about this work. I keep my initial score.