PaperHub
6.5
/10
Poster4 位审稿人
最低6最高8标准差0.9
6
6
6
8
3.5
置信度
ICLR 2024

Efficient Backpropagation with Variance Controlled Adaptive Sampling

OpenReviewPDF
提交: 2023-09-21更新: 2024-03-08
TL;DR

Self-adaptive sampling method that reduces the compute cost of backpropagation (BP) by up to 73.87%, with mirrored performance of exact training under theoretical guarantee.

摘要

关键词
efficient training algorithmsstochastic gradient descentimportance samplingvariance reduction

评审与讨论

审稿意见
6

This manuscript proposes a sampling algorithm to eliminate computation during forward and/or BP. More specifically, a variance-controlled adaptive sampling (VCAS) method is designed to accelerate BP by computing unbiased stochastic gradients.

The effectiveness of the VCAS is justified by pre-training and fine-tuning tasks in both vision and NLP domains. Some ablation studies are also included to discuss the effects of hyper-parameters.

优点

  • This manuscript is well-structured, with a clearly explained methodology section.
  • The manuscript evaluates the effectiveness of the VCAS on pre-training and fine-tuning tasks in both vision and NLP domains.
  • An ablation study is also included.

缺点

  1. Limited literature review. This manuscript did not carefully examine the relevant papers, and some closely related papers were omitted from the discussion, e.g., [1, 2, 3] and the related work in [1].
  2. Limited baseline evaluations. Close baseline methods e.g., [2, 3] should be considered.
  3. Additional hyper-parameters and insufficient ablation studies. The ablation study on one task, i.e., fine-tuning BERT-base on MNLI, is insufficient to justify the insensitivity of these hyper-parameters.
  4. Clarity. Some design choices in Section 5 are just given and have no explanation. For example, how to derive the provided equation from the mentioned zeroth-order algorithm?
  5. The claim on the unbiased stochastic gradient needs to be more careful and some theoretical justifications should be provided.

Reference

  1. SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks at the Edge
  2. Back Razor: Memory-Efficient Transfer Learning by Self-Sparsified Backpropagation
  3. Sparse weight activation training

问题

NA

评论

We‘d like to thank the reviewer for acknowledging the effectiveness of our work and for the constructive criticism. Below we address reviewer's concerns:

Q1. Limited literature review on sparsity-based training methods

Thanks for pointing out relevant works on sparse training. However, we'd like to place our work on another closely related field of online batch selection that discards less informative data, but not prunes weights or neurons as is typical in sparse training. The main difference comes in two ways:

  • Wall-clock time reduction: Online batch selection methods can always translate FLOPs reduction into time reduction since they neatly wipe the entire datum out. While few sparse training works report time reduction usually for the lack of supportive hardware, especially for unstructured sparsity.
  • Biasedness: In online batch selection, importance sampling is usually adopted which assures unbiasedness. While in sparse training, pruning methods typically cannot guarantee unbiased gradients. Using unbiased gradients for SGD is critical for the algorithm to converge properly.

Moreover, sparse training and online batching selection should be orthogonal since they reduce computation along different dimensions (data vs. model) and the speedup could be multiplicative.

Nevertheless we agree with the reviewer that the paper can be improved with more discussion on sparse training methods. We have added the discussion to Related Work.

Q2. Limited baseline evaluations

VCAS can translate FLOPs reduction into time reduction which is rare for sparse training methods. We compare VCAS with the official codebases of methods mentioned by the reviewer:

For Back Razor, we tested on BERT-large finetuning on MNLI. We find though Back Razor can reduce up to 90% FLOPs, it cannot reduce training time with its unstructured sparsity, and the accuracy drops a lot.

Tab. Comparison of VCAS with Back Razor

MethodTrain LossEval Acc(%)Train Time(h)FLOPs Reduction(%)Time Reduction(%)
Exact0.143986.585.48--
Back Razor0.382784.265.1990.005.29
VCAS0.161986.634.4444.1719.03

SWAT seems close to our work since it conducts a fine-grained pruning on weight and activation and implements a structured sparsity version that uniformly prune channels.

Since the official codebase of SWAT only supports CNNs, we compare SWAT with VCAS on WideResNet18 with widen factor k=4k=4 on CIFAR100. We find VCAS can reduce training time while preserving accuracy, while SWAT is even slower than exact (dense) training due to its high overhead.

Tab. Comparison of VCAS with SWAT

MethodTrain LossEval Acc(%)Train Time(h)FLOPs Reduction(%)Time Reduction(%)
Exact0.007073.254.32--
SWAT0.011572.014.8150.00-11.49
VCAS0.007074.163.8119.0511.72

Q3.Additional hyperparameters and insufficient ablations

First we'd like to emphasize that we employ the same set of hyperparameters in all our experiments across pretraining and finetuning, CV and NLP, and all get decent FLOPs reduction and accuracy. We believe it can already reflect the insensitivity. As said by Reviewer MWbo in "Strength", "the ablation studies on a few hyperparameters are very important. I see only very few of the papers in this domain have done this before".

From the experimental perspective, we agree with the reviewer that more experiments will help to justify our statement. So we have done all ablations again for BERT-base finetuning on SST-2. Please refer to Appendix A for the results.

Q4. Clarity of design choices in Section 5

Sorry for the confusion. To make our algorithm clearer we've added an algorithm table at the end of Section 5. Hope it can relieve your puzzle.

How to derive the provided equation from the mentioned zeroth-order algorithm?

The zeroth-order algorithm is embarassingly simple: since the variance increase monotonically with the drop ratio, our algorithm simply decrease the drop ratio by a constant if the variance is too large, and increase the drop ratio if the variance is too small. The only learning signal utilized is whether the variance is higher than the desired threshold. For a more intuitive view of the update, please refer to figures in Appendix C

Q5. Lack of proof on the unbiasedness of VCAS

Thank you for posing the vital problem of the unbiasedness of VCAS! We've added our proof in Appendix E.

Proof sketch: since our sampling is unbiased, we just need to prove the unbiasedness preserves in BP. As BP proceeds, linear operations like matmul, convolution and layernorm can obviously preserve unbiasedness. Nonlinear operations like ReLU incur Hadamard product with Jacobian matrix that is determined in forward pass, and thus also preserve unbiasedness. By induction we derive VCAS being unbiased.

We sincerely hope our response can solve your confusion and we would appreciate it pretty much if you could reevaluate our work. Thank you!

评论

The reviewer thanks the authors for providing the feedback.

The response has addressed most of the previous concerns. However, before updating the rating, the reviewer would like to invite authors to answer the following questions:

  • The paper, i.e., Variance Reduction with Sparse Gradients, http://arxiv.org/abs/2001.09623, may achieve a similar effect from a different perspective. It would be great if the authors could comment on this line of research in depth.
评论

The paper mentioned by the reviewer (hereinafter referred to as Sparse SVRG) belongs to the field of SVRG(Stochastic Variance Reduced Gradient) introduced in 2013 by [1], which actually cannot “achieve a similar effect with us from a different perspective” for the following reasons:

  • Sparse SVRG is far more computationally expensive than our method. It needs two forward&backward passes in each iteration and additionally calculates the average gradient of a very large batch every few iterations.
  • The sparsity introduced by Sparse SVRG is unstructured, which is unable to bring a wall-clock time speedup with current hardwares as is discussed in our earlier reply.
  • SVRG methods, including Sparse SVRG, fail to outperform SGD and can even increase the variance in modern neural networks training as is revealed in [2].
  • Our method should be orthogonal to Sparse SVRG. We can apply our methods to accelerate Sparse SVRG by replacing the SG(stochastic gradient) with ASG(approximated stochastic gradient) calculated with our method.

Below we provide a detailed comment on this line of research:

SVRG aims at reducing the variance of SGD for efficacy(i.e. lower train loss and higher eval acc with the same epochs) but not efficiency(i.e. lower FLOPs and shorter time to achieve a targeted performance) as in our work. In fact, SVRG is much more expensive than SGD in computation: SVRG needs to conduct forward&backward twice in every iteration to get two sets of gradients with respect to current model and a snapshot model. It also needs to compute a full gradient every few iterations by averaging gradients of all data in vanilla SVRG or a very large batch in Sparse SVRG.

Although SVRG is proven to be better than SGD both theoretically and experimentally in traditional problems like logistic regression, it fails to perform well and even increases the variance of modern neural networks training on harder problems like CIFAR10 as is revealed in [2]. To the best of our knowledge, the first paper to bring the benefit of SVRG into training neural networks at a practical scale(i.e. ViT on ImageNet) is [3] just proposed this year. So till now, SVRG is not widely adopted in practice.

Sparse SVRG improves upon SVRG by introducing sparsity to gradients to make SVRG "competitive with SGD"(c.f. Conclusion section in Sparse SVRG) in the sense of gradient queries. It is noted that the sparsity is unstructured and cannot bring a wall-clock time reduction with current hardwares as is discussed in our earlier reply.

Besides, our method should be orthogonal to SVRG methods since we only use ASG(approximated stochastic gradient) in substitute of SG(stochastic gradient) with controlled variance and reduced computation, not relying on the type of optimizer. Variance reduction algorithms in SVRG can still work in combination with our method by replacing SG with ASG that shares similar variance.

Reference

[1] Johnson, Rie, and Tong Zhang. "Accelerating stochastic gradient descent using predictive variance reduction." Advances in Neural Information Processing Systems 26 (2013).

[2] Defazio, Aaron, and Léon Bottou. "On the ineffectiveness of variance reduced optimization for deep learning." Advances in Neural Information Processing Systems 32 (2019).

[3] Yin, Yida, et al. "A Coefficient Makes SVRG Effective." arXiv preprint arXiv:2311.05589 (2023).

评论

Dear Reviewer,

We hope that our response and revision have adequately addressed your concerns. If our efforts have indeed addressed your concerns, we would be very grateful if you could reconsider our work and possibly adjust the score accordingly. If you have any additional questions or suggestions, we would be happy to have further discussions.

Best regards,

The Authors

评论

Thank you for providing the explanation.

Though the reviewer might only partially agree with the statements above, he/she is happy to raise the score to 6.

评论

Thank you so much for raising the score! We believe the theoretical and experimental success of VCAS can bring a new fashion of variance-controlled sampling to back propagation acceleration. Thank you!

审稿意见
6

This paper proposes Variance-Controlled Adaptive Sampling (VCAS), which performs an approximated stochastic gradient with an adaptive sampling rate. Based on the insight that gradients are sparse after learning has progressed to some extent, the authors improve the efficiency of learning by computing only a few selected gradients through adaptive sampling. The proposed method approximates the exact backpropagation values well in BERT and ViT training.

优点

VCAS performs backpropagation 20-50% faster, while following the loss curve of true full backpropagation with low variance. VCAS has an adaptive sampling rate, which allows for efficient sample selection based on learning loss and per layer. The idea is simple and highly applicable.

缺点

Comparing accuracy under the same amount of FLOPs reduction makes it difficult to understand its effectiveness compared to a metric like time to reach target accuracy[1]. As a result, it is unknown how VCAS will perform under a 50% or greater reduction.

[1] Mindermann, S., Brauner, J. M., Razzak, M. T., Sharma, M., Kirsch, A., Xu, W., ... & Gal, Y. (2022, June). Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning (pp. 15630-15649). PMLR.

问题

I would like to see more detail in Table 2 or 3. What is the relationship between Final Eval Accuracy and FLOPs reduction? For example, is the recommending FLOPs reduction ratio for VCAS around 40%?

评论

We thank reviewer d6vC for assessing our method as simple and highly applicable and for the valuable questions. Below we respond to the questions.

Q1. Relationship between Eval Acc and FLOPs reduction.

Thank you for posing the question that strikes at the heart. Actually that's where VCAS differs from previous sampling methods in essence.

In previous sampling methods, we typically first determine a sample ratio(FLOPs reduction ratio) rr and then apply sampling algorithms with it. However, many recent works 1,21,2 that reflect the efficient training community have pointed out that this pipeline is indeed unfairly trading accuracy for efficiency. The sample ratio rr actually contains prior knowledge for the incoming training by telling the model how much percentage of the dataset is less informative.

VCAS distinguishes from previous sampling methods in that it eliminates the need of determining sample ratios. With VCAS, we use the variance thresholds to determine proper sample ratios adaptively. And to our best knowledge, we are the first to conduct adaptive unbiased sampling in training while preserving performance under theoretical guarantee.

Now we can answer your question: There is no direct relation between Eval Acc and FLOPs reduction in VCAS. They are connected with variance thresholds τ\tau in the following way: FLOPs reduction will consistently climbs up as τ\tau goes larger since we can tolerate a more radical sampling scheme. While for Eval Acc, when τ\tau is not that large, like 0.025 in our setting, it won't change too much compared with exact training. But as τ\tau goes larger, the extra sampling variance becomes unignorable, thus the convergence is affected and Eval Acc will drop.

Q2. More detail in Table 2 or 3.

For Table 2 now moved to Table 5, we are varying variance threshold τ\tau . The FLOPs reduction consistently increases with bigger τ\tau. The Eval Acc keeps within 0.1% fluctuation of exact result when τ0.05\tau\le0.05 which means the extra variance is no more than 10% and can be neglected, but drops a lot when τ0.1\tau\ge0.1 which means more than 20% variance is introduced that affects convergence.

For Table 3 now moved to Table 7, we are varying update frequency FF. As FF increases, we will conduct fewer updates, so the FLOPs reduction first increases due to reduced overhead caused by update process, and then decreases due to insufficient number of updates to converge on adequate sample ratios. The Eval Acc shows no much difference across FF since it's theoretically guaranteed to produce a similar result with the exact training with a low τ\tau, like τ=0.025\tau=0.025 here.

Q3. Is the recommending FLOPs reduction ratio for VCAS around 40%? How VCAS will perform under a 50% or greater reduction?

As discussed above, the FLOPs reduction of VCAS is not tuned manually but learned in the training process. So the FLOPs reduction ratio of VCAS varies among models and datasets and can be learned automatically, reflecting how much FLOPs can be "safely" saved under given variance budget.

Indeed, we can still trade accuracy for efficiency with a much larger variance in VCAS. For example, we test with BERT-large finetuning on MNLI with a big τ=0.5\tau=0.5 that provides a 50% FLOPs reduction, which means introducing 100% extra variance actually. For comparison we also tested SB and UB with FLOPs reduction of 50%. The results are as follows:

Table. Different methods with FLOPs reduction of 50%, strikethrough indicates divergence

MethodTrain lossEval Acc(%)FLOPs reduction(%)
exact0.143986.58-
SB1.09932.7450.00
UB1.10432.7450.00
VCAS0.263386.0150.05

We can see that the performance of VCAS drops a lot compared to exact training, which is actually predicted by the large τ\tau value. However, the performance of VCAS is still much better than baselines like SB and UB both of which do not converge under this setting, owing to the fine-grained sampling and adaptive sample ratios across different training phases.

Q4. Effectiveness of time to reach target accuracy.

We list the effectiveness of time to reach target accuracy for BERT-base finetuning on MNLI below: (exact training results in Final Eval Acc of 84.33 and total steps of 36816)

Table. Number of steps to reach target accuracy with different methods

Target Acc(%)exactSBUBVCAS
751.5k3.5k1.5k1.5k
804.5k9.5k5.5k4.5k
8420knot reachednot reached19k

We can find VCAS similar with exact and much better than SB and UB.

For more insights on the effectiveness of VCAS, please refer to Fig.6b where we presented the Eval Acc curve of these methods.

Reference

[1]"The Efficiency Misnomer." ICLR.2021.

[2]"No train no gain: Revisiting efficient training algorithms for transformer-based language models." arXiv.2023.

评论

Dear Reviewer,

We hope that our response and revision have adequately addressed your concerns. If our efforts have indeed addressed your concerns, we would be very grateful if you could reconsider our work and possibly adjust the score accordingly. If you have any additional questions or suggestions, we would be happy to have further discussions.

Best regards,

The Authors

审稿意见
6

This work introduces a variance-controlled adaptive sampling (VCAS) method for accelerating the back-propagation of deep neural network training. VCAS computes unbiased, variance controlled gradients for both activations and network weights. By sampling both data samples and tokens in each datum in a layer-wise, fine-grained manner, VCAS can drastically reduce the computation in the back-propagation process without introducing too much variance overhead. With the similar FLOPs reduction, VCAS better optimizes the target model compared with prior loss-based and gradient-based sampling methods.

优点

  • This work introduces a fine-grained strategy that 1) increasingly removes data samples when back-propagating to the input layer, and 2) samples tokens in each data sample when computing gradients for network weights. This fine-grained strategy allows high FLOPs reduction with controlled variance.

  • The sampling ratios are adaptively adjusted according to the estimated variance. As training progresses, the network may require changing sampling ratios, and the adaptive sampling ratios can better satisfy this need.

  • Compared with prior methods, the proposed method, VCAS, can better simulate exact back-propagation, leading to better optimized loss and better evaluation performance.

缺点

  • Training time reduction: When comparing with baselines, this work uses FLOPs as the efficiency indicator. However, FLOP reduction may not directly translate into wall-clock time reduction due to various factors like parallel computation efficiency. It is suggested to also list the wall-clock training time of each method for a more straightforward comparison.

  • Insights on sampling ratio updates: In Section 7 this work has discussed the design choices that determine the sampling ratio updates. For better comprehension, it may be useful to include a figure that shows how ss and νl\nu_l changes as training progresses.

  • Figure/Table clarity: Figure 2 seems to lack some more detailed explanation. In Table 1, it is not clear which numbers should be bolded. For example, for ViT-base fine-tuning on CIFAR-100, UB seems to be highlighted for the highest eval accuracy, but for ViT-large fine-tuning on CIFAR-100, UB seems to be highlighted for the lowest train loss? Also for Table 1, how significant is the relative improvement over baselines?

  • Limitations: It is suggested to include some detailed discussion on the limitations (e.g., applicable model architectures, base optimizer, dataset) of the proposed method. In this paper, only Transformer-based architectures and Adam-like optimization algorithms are tested. It is not clear whether we can extrapolate the conclusion to other settings.

问题

  • It is not directly clear to me whether the weight gradient sampling of VCAS is applicable to convolution neural networks (CNN). In principle, convolution is yet another linear operator, but I’m not sure how to perform this sampling in a convolutional layer. Similarly, can VCAS be applied when optimizing a recurrent neural network (RNN)?
评论

We thank reviewer Dd1x for acknowledging the effectiveness of our work and for the insightful questions. Below we respond to the questions. We would highly appreciate it if the reviewer agree with our response and consider to raise the score. Thank you so much!

Q1. Training time reduction.

Thanks for the suggestion. To compare wall-clock training time, we choose two tasks with the longest training time: BERT-large finetuning on MNLI and ViT-large finetuning on ImageNet. The results are as follows:

Tab. BERT-large finetuning on MNLI

MethodTrain LossEval Acc(%)Train Time(h)FLOPs Reduction(%)Time Reduction(%)
exact0.143986.585.478--
SB0.249285.184.32044.4421.14
UB0.226686.094.26644.4422.12
VCAS0.161986.634.43744.1719.00

Tab. ViT-large finetuning on ImageNet

MethodTrain LossEval Acc(%)Train Time(h)FLOPs Reduction(%)Time Reduction(%)
exact0413582.0452.29--
SB0.463782.2142.5644.4418.61
UB0.424282.2141.9244.4419.83
VCAS0.422882.2741.2849.5821.06

We find VCAS can translate FLOPs reduction into wall-clock time reduction as effectively as simpler online-batch sampling methods like UB and SB that drop part of data one-time in a whole, while enjoying mirrored performance with the exact training under theoretical guarantee.

Our current implementation of VCAS is purely based on Python and PyTorch. The gap between FLOPs reduction and time reduction can be further narrowed by optimizing the implementation with CUDA or Triton.

We have added the wall-clock time results to Section 6.2, thanks again for the suggestion!

Q2. Insights on sampling ratio updates.

Thanks for the constructive suggestion. We've added a few figures in Appendix B recording the update process of our sample ratios. For all variance thresholds τ\tau from 0.01 to 0.5, gradient norm preserve ratio ss first drops rapidly and then slowly. Activation ratios ρl\rho_l gradually decrease with training. Weight ratios νl\nu_l first drop and then fluctuate around a certain value that is different across ll.

Q3. Figure/Table clarity.

Sorry for the confusion. Now we've added more explanations in Fig.2. The bold formatting in Tab.1 originally highlights the highest Eval Acc / (1-FLOPs reduction) which is a balance between performance and efficiency. Now we've changed it to indicate the best of each metric for better clarity. In Tab.1 we also added underlines to mark Eval Acc that is less than 0.1% off the exact training for a more intuitive look.

Q4. For Table 1, how significant is the relative improvement over baselines?

As in Tab.1, the FLOPs improvement is relatively marginal compared with UB and SB in our conservative settings of variance tolerance ratio τ=0.025\tau=0.025. But the accuracy improvement is significant, where the train loss and eval accuracy of VCAS are much closer to exact training.

However, it should be noted that the usage of VCAS is different from existing sampling baselines like UB and SB. For UB and SB, we need to preset a sample ratio rr carefully and pray for a decent outcome. While for VCAS, we just need to set a variance tolerance ratio τ\tau far less than 1, and then it can adaptive learn the proper fine-grained sample ratios which can keep the training curve closely to that of exact training. Therefore, two more important improvements are that VCAS is tuning-free for sample ratios and the FLOPs reduction achieved by VCAS is "safer" with theoretical guarantee compared to other baselines.

Q5. Limitations should be discussed.

Thanks for the constructive advice. We've added a detailed discussion of the limitations of VCAS in Appendix G.

Regarding to the limitations on applicable model architecture, base optimizer, dataset, theoretically, there will not be such limitations. But the performance of VCAS is closely related with the "redundancy" of architectures and datasets. Nevertheless, in such scenario, VCAS can learn to keep all data to control the variance (i.e., fall back to exact training), and the accuracy will not be harmed. On the other hand, other online batch sampling methods are likely to fail.

评论

Q6. Whether can VCAS apply to other architectures?

Yes. We apply VCAS to train WideResNet-18 on ImageNet with Momentum SGD, i.e., we generalize it beyond Transformers and beyond Adam. For CNNs, the weight gradient sampling part is not applicable, since the "tokens" (i.e., "pixels") in CNNs are not individually processed, but tightly coupled. Therefore, we only apply activation gradient sampling and adapataion of VCAS.

We've conducted an experiment using 8*3090Ti to pretrain WideResNet-18 with widen factor k=4k=4 on ImageNet. We use SGDM with momentum=0.9momentum=0.9. The results are as follows:

Table. VCAS performance for training WideResNet-18 on ImageNet

MethodTrain LossEval Acc(%)Train Time(h)FLOPs Reduction(%)Time Reduction(%)
exact1.47475.9621.31--
VCAS1.47975.8620.2017.475.21

From the table we can see that VCAS is also capable of CNN acceleration, and also usable for other optimizers like SGDM here. Besides, the parallel setting proves the parallelizability of VCAS.

Results above are added to Appendix C.

评论

The authors' detailed responses have addressed most of my previous concerns. Thank you for providing the helpful explanation and additional analysis. I'm now leaning towards accepting this work.

评论

Thank you so much for raising the score! We believe the theoretical and experimental success of VCAS can bring a new fashion of variance-controlled sampling to back propagation acceleration. Thank you!

审稿意见
8

The authors propose a sampling method for back propagation with controlled variance and self-adaptive sample ratios, named VCAS. It computes an approximate stochastic gradient by applying finegrained sampling to gradually remove samples and tokens during backpropagation. VCAS have similar variance as well accuracy with exact back propagation, while seems to reduce the training cost significantly.

优点

  1. I like it very much that the authors test the proposed algorithm on multiple fine-tuning and pre-training tasks in both vision and natural language domains. Typically, papers in this domain would use small scale datasets such as MNIST or CIFAR 10/100.

  2. The ablation studies on a few hyperparameters are very important. I see only very few of the papers in this domain have done this before.

缺点

I think the authors can improve the paper in the following ways:

  1. I believe adding an algorithm table with detailed steps would make the paper more clear.

  2. The authors report the Final Train Loss / Final Eval Acc.(%) / FLOPs reduction ratio(%). However, I'd like to know the actual reduction in training time as these sampling methods might introduce overhead in computation. It would be helpful if the authors can report a time table for training on these datasets.

To be frank, I feel not many papers actually do this but it can be interesting to see that the actual training time might not be reduced at all, or at least not much as expected given a certain sampling ratio.

  1. The paper lacks discussions of related paper. For example, https://arxiv.org/pdf/2104.13114.pdf also considers the importance sampling problem by sampling data points proportionally to the loss, instead of norm of gradient.

For another example, https://arxiv.org/pdf/2306.10728.pdf also proposes adaptively sampling methods for dynamically selecting data points for mini-batch. I'd love to see the authors discussed more about these papers.

  1. Can the authors be more specific in terms of the notations? Adding a table of notation would be very helpful. For example, what is h(l)h^{(l)} below:

Z(l1)=h(l)(Z(l);Z(l1),θ(l))\nabla_{Z^{(l-1)}}=h^{(l)}\left(\nabla_{Z^{(l)}} ; Z^{(l-1)}, \theta^{(l)}\right)

问题

  1. Although the proposed VCAS algorithm seems promising compared with SB and UB, I'd like to know the actual reduction in training time as these sampling methods might introduce overhead in computation. It would be helpful if the authors can report a time table for training on these datasets.
评论

We thank reviewer MWbo for acknowledging the effectiveness of our work and for the constructive questions. Below we respond to the questions.

Q1. Add an algorithm table.

We appreciate it a lot for your constructive advice. We've now added an algorithm table in Algorithm 1 at the end of Section 5. Thank you!

Q2. Report wall-clock time reduction.

Thanks for your insightful advice. As you can see, not many papers report the wall-clock time reduction, mainly because the wall-clock time is highly dependent on hardware and implementation. FLOPs reduction, however, is a high-level metric directly reflecting the real power of algorithm. Each metric has its own pros and cons.

Though, we agree that wall-clock time is indispensable since that's what we really want to accelerate and FLOPs reduction can not always translate into equal wall-clock time reduction, not only because of the overhead introduced, but also due to parallelizability, hardware friendliness, and compatibility to many other minor but crucial parts in the training procedure.

For this metric we choose two tasks with the longest training time, which are most worthy of acceleration: BERT-large finetuning on MNLI and ViT-large finetuning on ImageNet. The results are as follows:

Tab. BERT-large finetuning on MNLI:

MethodTrain LossEval Acc(%)Train Time(h)FLOPs reduction(%)Time reduction(%)
exact0.143986.585.478--
SB0.249285.184.32044.4421.14
UB0.226686.094.26644.4422.12
VCAS0.161986.634.43744.1719.00

Tab. ViT-large finetuning on ImageNet:

MethodTrain LossEval Acc(%)Train Time(h)FLOPs reduction(%)Time reduction(%)
exact0413582.0452.29--
SB0.463782.2142.5644.4418.61
UB0.424282.2141.9244.4419.83
VCAS0.422882.2741.2849.5821.06

From these tables, we can find that VCAS can translate FLOPs reduction into wall-clock time reduction as effectively as simpler online-batch sampling methods like UB and SB that drop part of data one-time in a whole, while enjoying mirrored performance with the exact training under theoretical guarantee.

Our current implementation of VCAS is purely based on Python and PyTorch. The gap between FLOPs reduction and time reduction can be further narrowed by optimizing the implementation with CUDA or Triton.

We have added the wall-clock time results to Section 6.2, thanks again for the suggestion!

Q3. Lack discussions of related paper.

Thanks for your advice! We have added more discussions on these papers as well as many other papers in Section 2 of Related Work in our paper.

Overall, we are addressing the problem of sampling in a new way to the best of our knowledge. VCAS differs from all these sampling algorithms in that it doesn't need a carefully tuned sample ratio and is always sure to provide a similar result with exact training under theoretical guarantee.

Specifically for the two methods mentioned by the reviewer, the loss based method OBFTF samples data by minimizing the distance of average losses between the whole batch and the sampled batch. Its theoretical guarantee and stability are not strong enough since it is sensitive to the absolute value of losses. It is also sensitive to batch size for solving a combinatorial optimization problem that will explode when batch size is large.

The adaptive method AdaSelection is an ensemble of SB, UB, uniform sampling and other methods by adaptively assigning weights among them. It is different from VCAS in the way of adaption: VCAS adaptively find the proper sample ratios, while AdaSelection still needs to preset a sample ratio and simply adapt weights of each method under the ratio. Thus VCAS is more tuning-free, more effective with finer sampling granularity, and provides more theoretical guarantee.

Q4. Clarity of notations.

Sorry for the confusion. h(l)h^{(l)} means the function of calculating input gradient with output gradient, input and weight. For example for linear layers, we have input gradient equals to output gradient multiplied by weight. The same with g(l)g^{(l)} for weight gradient.

We have now added more explanations on every notations in Section 4 and 5. Hope it can relieve your puzzle!

评论

Dear Reviewer,

We hope that our response and revision have adequately addressed your concerns. If our efforts have indeed addressed your concerns, we would be very grateful if you could reconsider our work and possibly adjust the score accordingly. If you have any additional questions or suggestions, we would be happy to have further discussions.

Best regards,

The Authors

评论

thank you for your prompt and detailed responses! the authors addressed my questions well, especially for the training time reduction part!

raise my score

评论

Thank you so much for raising the score! We believe the theoretical and experimental success of VCAS can bring a new fashion of variance-controlled sampling to back propagation acceleration. Thank you!

AC 元评审

In extensive experiments, the paper shows that further approximations of the stochastic gradient can significantly speed up training of deep nets without a loss in accuracy. Specifically, a method called 'Variance Controlled Adaptive Sampling' (VCAS) is proposed which subsamples the weight- and activation-gradients appearing in backpropagation.

The main strength of the paper are the extensive experiments running vision-transformers and language models on large datasets. These go beyond the usual CIFAR-10/CIFAR-100 datasets used in related works, and demonstrating an improved performance in this scale is a significant contribution and of interest to the community.

The main weaknesses (as pointed out by the reviewers) are the clarity (figures, tables, reasoning behind the design-choices) and the literature review / comparison to baseline methods. The proposed method appears to work best for training large transformers on large datasets (which is highly relevant), but for example the ImageNet results are perhaps less convincing.

Overall, the paper makes a solid contribution to improving the training speed of large neural networks. I recommend to accept this paper as a poster.

为何不给更高分

The paper builds mostly on well-known techniques and is in some sense a bit incremental. It is unclear to me at this stage whether the paper can serve as a basis for future innovations, or provides deep insights into the training speed of deep neural networks. For an oral or spotlight recommendation, I would have like to see some of these.

为何不给更低分

The paper is a solid work for improving the training speed of neural networks, and will be of use for the community, especially if the code is released. All reviewers recommended acceptance, and acknowledged the strength of the experiments. Most concerns of the reviewers were addressed in the rebuttal.

最终决定

Accept (poster)