PaperHub
5.8
/10
Rejected4 位审稿人
最低5最高6标准差0.4
5
6
6
6
3.3
置信度
ICLR 2024

Efficient Stagewise Pretraining via Progressive Subnetworks

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11
TL;DR

We propose progressive subnetwork training for efficient pre-training of large language models. Improves downstream performance compared to existing stage-wise pre-training methods.

摘要

关键词
Stage-wise trainingEfficient pre-trainingImplicit benefits on downstream performance

评审与讨论

审稿意见
5

This article proposes a new form of stagewise training, called Random Path Training (RAPTR). Rather than increasing the number of layers trained by stacking them like in Progressive or Gradual Stacking, the authors propose to sample a subnetwork of the full network during training randomly and progressively increase the subnetwork size (while scaling outputs appropriately), which allows evaluation during training. They show empirical improvement for the same FLOPS over other stagewise training techniques, as well as a theoretical study of the change of the losses when changing stages.

[Edit: following my comment, I am increasing my note from a 3 to a 5 for the various improvements and clarifications provided by the authors, and may increase further to a 6 following discussions with the other reviewers.]

优点

The article presents a simple and effective stagewise training paradigm, by using layer dropping in the start of training. The experiments are clear and show the effectiveness of the method for a given FLOPs budget. The new proposed rescaling of the outputs seems sound and improves the results. The theoretical analysis of the stability of the network seems interesting.

缺点

Improvement The experimental results do not show a consistent improvement of the method over the other stacking methods. Table 1 and 2 show very limited improvement in some cases, for the same value of FLOPs; for schedules that have been chosen by unclear means.

The results in Table 3 are also surprising. RaPTr does show an improvement on Tydi QA and SQuADv2 other Stacking, but not for the other metrics. The alternative schedule is not detailed and merits more introduction. It is impossible to compare Stacking with RaPTras as long as Stacking has not been trained with this alternative schedule, or a similar one. (This could be done by training the entire model with 30K steps, then training only the first layers, and restacking the previously trained layers to extend the model, rather than stacking the same layer). Otherwise, the improvement due to RaPTr seems relatively small without this schedule. The small variance claimed by the authors is only proven with the alternative schedule and not RaPTr in general.

Notations All along the article, the notations of the schedule of RaPTr are very inconsistent and unclear. The schedule T1:k×(p1:k,I1:k)T^{1:k} × (p^{1:k}, I^{1:k}) is never used after its definition if I'm not mistaken. This is logical as it is often useless: II is denoted to be the entire set during all of the experiments, and the stage times are always equal. Only pp ever changes during the experiments. Similarly, The sense of II reverses during the paper. At first, it defines the set of fixed layers before seemingly being reversed during the experiments as II is the entire set (1,24)(1,24) (which is also not consistent with the claim that the first and last layers are always fixed). IsI_s in Algorithm 1. is not defined and just seems to be II.

Schedule The way schedules are chosen is never clear. In the experiments, the whole set of layers is used for II with equal the same number of steps in every stage. Why is only pp varying? In Table 2, the stages are defined as 6-8-10-12 for RaPTr and 6-9-12 for stacking for BERT-base, and then once again equal for BERT-Large. Why were they different at first?

Section 4.2 is extremely unclear. The normalization is very dependent on the initialization, however, the choice of initialization is never discussed. Similarly, the experimental values during training compared to initialization are not discussed. What we are supposed to conclude from Lemma 4.3. is not clear at all. In particular, the link between Section 4.2. and the experimental results in Table 2. are unclear. The goodness of fit of Fig 2.b seems pretty low. The sentence "Ψl\Psi^l increases slowly with ll" is surprising considering that the value decreases until l=21l=21. Many claims do not seem to make sense or are never explained: "suggesting a worse-case bound of O(L0.88)O(L^{−0.88})" ? What does "A simple calculation shows that the gap between a L1L−1 random subnetwork and the full model scales as O(Lc)O(L^−c) for some c ≥ 0.5" mean?

Novelty The technique proposed is very close to other layer-dropping techniques, in particular of Zhang et al. 2019, which also aims at reducing the FLOPs needed for the training. The main difference is that the dropping is done at the start of training rather than the end, and the different rescaling.

Figures The Figures either bring very limited information or are very hard to read. Figure 1 could have been substituted by a simple Table. Figure 2 is never referred to anywhere. Figure 3 is extremely hard to read and never really explained well.

This article thus brings limited novelty or improvement to other methods and is often very unclear. As I have trouble understanding Section 4.2, I am for now keeping my confidence at 3 in case I am just misreading this Section.

问题

"The intuition is that since the first and last layers play different roles": This seems logical. This does ask the question of whether not only the first and last but kk first or/and kk lasts layers may need to be fixed. Are the early or later layers more important to be fixed?

I don't really understand why the experiments on BERT are qualified as "short horizon settings".

Errors or remarks:

  • Introduction: "Gradual stacking (Gong et al., 2019) and Progressive stacking (Reddi et al., 2023)", the order is reversed, it should be Progressive for Gong et al. and Gradual for Reddi et al.

  • Sec 2: "the computed based"

  • Figure 1.b. "for fixed set" -> "for the fixed set"

  • Table 2 is quite unclear. The values displayed are not defined clearly (losses?). "lower better loss" means nothing. "Layerdrop" is not defined anywhere in the article. If it is meant to represent the paper method, it is concerning that it uses a name that is already of another method.

  • Table 3: Equiflop is not defined.

  • Sec 4. "L-RaPTr" was not defined until now.

  • "table" and "appendix" should be in upper case. Replace "fig" by "Figure".

  • Sec 4.2 The norm is not defined. Vector/Matrix norm?

  • Why is Section 4.2 referred several times inside Section 4.2??

  • "O(L≥0.88)" means nothing.

评论

Clarity: Section 4.2 is extremely unclear.

  • “Normalization is dependent on initialization, however, the choice of initialization is never discussed.”

The initialization has been clearly stated in the statement of Lemma 4.3. We show that under Xavier initialization normally used in deep learning, normalization layers, and residual connections help stabilize the training of a linear model under RAPTR.

  • “What we are supposed to conclude from Lemma 4.3”. “... the link between Section 4.2. and the experimental results in Table 2 (is unclear)”

Lemma 4.3 aims to understand how stable the loss is at the transition between different stages, which is key to why RAPTR can provide speedups. In general, one could expect the loss to spike and training could spend additional training steps (FLOPs) to first recover after the stage transition. However, Lemma 4.3 shows that because of the presence of normalization layers and residual connections, the blowup in training loss at the transitions is bounded by O(1/L)\mathcal{O}(1/\sqrt{L}) for linear networks. We believe a theoretical understanding of the benefit of RAPTR for training is an important future direction to consider as a follow-up to this work.

  • Many claims do not seem to make sense. “Suggesting a worst-case lower bound of O(L0.88)\mathcal{O}(L^{-0.88})”?

The lower bound of O(L0.88)\mathcal{O}(L^{-0.88}) follows from the observation that Ψ\Psi_{\ell} varies with \ell as (L/)0.12(L/\ell)^{0.12} (please see figure 2b), while the norm of the output changes linearly with LL (please see figure 2a). Hence, the net difference between the two losses under study by theorem 4.2 should be bounded by O\mathcal{O} (1/L(1/L Ψ/F)\sum_{\ell} \Psi_{\ell} / ||F||) == 1/L1/L O(1/L1.881/0.12)\mathcal{O}(1/L^{1.88} \cdot \sum_{\ell} 1/\ell^{0.12} ) == O(L0.88)\mathcal{O}(L^{-0.88}).

  • “A simple calculation shows the gap between L-1 random subnetwork and full model scales as O(Lc)\mathcal{O}(L^{-c})” even mean?

We observe that the norm of the output increases as Ω(L0.88)\Omega(L^{0.88}) in all settings, while ψ\psi_{\ell} varies as (L/)0.5(L/\ell)^{0.5} with \ell. Hence, the net difference between the two losses under study by theorem 4.2 should be bounded by O(1/LΨ/F)\mathcal{O}(1/L \sum_{\ell} \Psi_{\ell} / ||F||) =O(1/L1.38(1/)0.5)=O(1/L0.88)= \mathcal{O}(1/L^{1.38} \cdot \sum_{\ell} (1/\ell)^{0.5} ) = \mathcal{O}(1/L^{0.88}).

We will clarify more on this in the main paper in the revision.

Novelty: “The technique proposed is very close to other layer-dropping techniques”

We would argue that RAPTR (and progressive subnetworks) is different from layer-dropping in motivation & concept, implementation, and effectiveness. Please see the common response for details on differences.

Figures carry minimal information as well.

Thank you for the suggestion. We will add additional details to the plots in the new version.

Questions

Q1: Further ablation studies are necessary on the layers that are necessary to be fixed during training.

Thank you for the suggestion. We conducted more experiments for the BERT-base, which shows that fixing more layers during training can hurt performance. Fixing the first and last layers seems to be the best. Please refer to Table 6 in the revision.

Q2: Why are experiments on BERT qualified as “short horizon settings”?

We conducted experiments on BERT-base in Table 2 under 3 different settings, short horizon (75k training steps on baseline and 100k training steps on RAPTR and stacking), medium horizon (170k training steps on baseline and 225k training steps on RAPTR and stacking), and longer horizon setting (510k training steps on baseline and 675k training steps on RAPTR and stacking). By short horizon, we mean training for fewer steps (or epochs), motivated by experiments in Kaddour et al.

Errors and typos

Thank you for the suggestions. We have fixed the typos and the errors pointed out by the reviewer in the revision.

References

1: Efficient training of language models using few-shot learning. Reddi et al'23

2: No train no gain: Revisiting efficient training algorithms for transformer-based language models. Kaddour et al.'23

评论

We hope that our response has addressed all your concerns. Please let us know if you have any more questions or feedback. We are happy to address them.

评论

We thank the reviewer for their remarks and suggestions. Please find our responses to your questions below.

Weaknesses

Improvement

  • “Results do not show a consistent improvement of the method over the other stacking methods”. "Table 1 and Table 2 show very limited improvement”.

Table 1 aims to compare RAPTR and the baseline models at equal training steps. The aim is to show that RAPTR can achieve similar perplexity and fine-tuning performance as the baseline. However, motivated by previous works like [2], we conducted extensive experiments (with learning rate tuning) to compare baseline and RAPTR at equivalent flops. Even with extensive hyperparameter tuning, RAPTR still performs better than the baseline at shorter horizon training, when the models observe only a few epochs of training over the training data. This observation can be crucial for large language models, which are mostly trained with a single epoch over data.

  • “Schedules … chosen by unclear means.”

We followed the schedules from [1] and reported the best that we observed during our experiments. We provide a description of the schedules and an ablation study on the effect of schedules on bert-base in section C in the appendix in the newer version.

  • The set of evaluation tasks in Table 3 isn’t comprehensive enough.

We further perform evaluations on other downstream tasks in Table 11 in the appendix. In summary, RAPTR performs better than baseline models in completion tasks like Lambada, Hellaswag, and Storycloze, in addition to QA tasks like web questions and natural questions. We also provide downstream performance at higher shot in-context settings. On average, RAPTR performs 1-2% better than baseline training while being 20% faster.

  • “RAPTR does show an improvement on Tydi QA and SQuADv2 other Stacking, but not for the other metrics.”

This is not entirely correct. RAPTR without a 30k warmup is also better on TriviaQA and is almost as good on SuperGLUE. We have included more detailed evaluations and find RAPTR to be better than stacking on multiple metrics.

  • “The improvement due to RAPTR seems relatively small without this (initial warmup based) schedule”, “It is impossible to compare Stacking with RAPTR as long as Stacking has not been trained with this alternative schedule”

This is not true; RAPTR without this schedule is already ~4% better on evals compared to Stacking, which is a significant gap. With the alternative schedule, the gap is even larger (5.5%). Thus the schedule is not as important when compared to Stacking.

Besides, the schedule proposed by the reviewer for stacking with warmup is RAPTR in disguise, with a different subnetwork schedule (the first stage trains all layers, the second stage only trains the first few layers, and the depth is increased incrementally over stages). At this point, the method is no longer performing “stacking” of layers.

Notation issues

  • “The schedule T1:k×(p1:k,I1:k)T^{1:k} \times (p^{1:k}, I^{1:k}) is never used”.

This notation is used for a more general definition of RapTr in Algorithm 1.

  • II is denoted to be the entire set during all of the experiments, and the stage times are always equal.”, “The definition of II reverses”

This is not true. II is defined as the set {1, 24} and not the range (1,24)(1, 24) in our experiments (see “Notations for RAPTR and stacking schedules” section).

As for stage times, they need not always be equal. For BERT we use equal schedules, but for UL2 experiments, as presented in section B.2, we use a different “proportional” schedule, where we spend time proportional to ii in stage ii. We have added more details in section C in the new version. The only quantity that stays fixed during training is the fixed set II that comprises the first and last set. However, this set in itself can be changed over time to give other RAPTR schedules.

  • "How were schedules selected", "Changes in schedule from table 2 to table 3?"

The decision of schedules is motivated by the schedules in [1]. The schedule lengths were decided to be either Equal or Proportional, depending on the flop counts we need in the end. Please see the response to the previous question for more details. Please find more details in Appendix C in the new version. As for the difference between RAPTR and Stacking schedules in Table 2, we picked the best schedule for each based on Table 1.

审稿意见
6

The paper proposes a new stagewise training approach called "progressive subnetwork training" for efficiently pretraining large language models. It introduces an instantiation called Random Path Training (RAPTR) which trains random subsets/paths of layers in the network, progressively increasing the path length over stages. Experiments on BERT and UL2 models show RAPTR can match or improve baseline pretraining loss with 20-33% fewer FLOPs. RAPTR shows improved downstream task performance on UL2, gaining 1-5% on QA/GLUE over baseline and stacking methods.

优点

  • The paper is well-written and clearly presented;
  • The paper proposes a novel and intuitive stagewise training paradigm with theoretical motivation and it achieves strong empirical gains over baseline and stacking methods, especially for short training regimes.
  • The paper provides theoretical analysis on stability of subnetwork training in residual networks and the algorithm is simple to implement on top of standard training. Detailed ablations regarding the fixed layer and scale have also been presented;

缺点

  • Theoretical results are good while limited to simplified linear residual network settings.
  • The gains in pretraining flops diminish in the asymptotic long training setting.
  • Downstream task improvements lack sufficient analysis on why RAPTR has implicit biases on different tasks, for example, it seems to hurt the multilingual QA performance when adding the 30k.
  • Besides the flops, the real wall-clock time might also be good to provide, given in some cases, flops disagree with wall-clock time for efficiency measurements [1];
  • Besides different architecture and objectives (BERT/UL2), whether the proposed method scales with model sizes and fit to LM is unknown;

[1] Dehghani, Mostafa, et al. "The efficiency misnomer." arXiv preprint arXiv:2110.12894 (2021).

问题

  • Could the author elaborate on the real wall-clock time gain with the pretraining experiments;
  • Could the authors explain more on the varied performance gain in Table 3;
  • Besides the ppl comparison for the ablation studies regarding the fixed layer and scale, could the author also provide the detailed downstream performance to show if the design choice affects both pretraning and downstream;
评论

We thank the reviewer for their remarks and suggestions. Please find our responses to your questions below.

Questions

Q1: "...elaborate on the real wall-clock time gain with the pretraining experiments"

The reason we reported FLOPs is because wall-clock time can be highly hardware and implementation-dependent. We tested wall-clock times on the machines we used and found that a theoretical FLOPs speedup of 1.33x for BERT-base was equivalent to a wall-clock speedup of 1.26x. For UL2, FLOPs speedup matches wall-clock speedup (due to the increased depth and width of the model), so 1.2x speedup in the paper is also the wall-clock speedup. Please see section D in the appendix in the revision.

Q2: "...explain more about the varied performance gain in Table 3"

We observe that RAPTR improves on some tasks more than others compared to the baseline. The trend is that gain is higher for contextual QA tasks (TydiQA and SQuADv2) and lower for non-contextual QA tasks (TriviaQA). Overall RAPTR improves over the baseline across the board, while requiring ~20% fewer FLOPs. Furthermore, we have added more results on other downstream tasks and higher in-context settings (5-shot) in the revision. RAPTR shows similar improvements over the baseline model in all settings.

Q3. “...show if the design choice affects both pretraining and downstream” (for fixed layers and scaling)

In Table 5 in the appendix of the revision, we conduct an ablation study on different fixed player candidate sets for BERT-base training with RAPTR. We observe that fixing the first and the last layers helps pre-training loss. Furthermore, differences in pre-training also show small differences in fine-tuning performance on different downstream evaluations.

Weaknesses

W1: “Theoretical results are good while limited to simplified linear residual network settings”

Thank you for appreciating the theoretical results. The results aim to understand the importance of layer normalization and residual connection layers in the model, and analysis for Transformers is still in very nascent stages. Thus results in the linear setting allow us to clearly isolate the role of these design choices.

W2: “Gains in pretraining flops diminish in the asymptotic long training setting.”

That is correct, this observation aligns with those made in [1] for BERT training. However recent language models (like GPT, UL2, Llama) only make 1 (or few) passes over a huge dataset, and are not in the “asymptotic” setting. In this setting our paper shows that RAPTR (and even stacking) improves training time by 20% or more, which could be a significant saving in such settings where models are trained for a few months.

W3: “Downstream task improvements lack sufficient analysis on why RAPTR has implicit biases on different tasks”, and “hurt the multilingual QA performance when adding the 30k”

This is indeed an interesting question, and we could not find a simple explanation for these findings. Perhaps it deserves a separate exploration like some recent papers that are dedicated to this topic of inductive bias [2, 3].

W4: “...whether the proposed method scales with model sizes and fits to LM is unknown”

In this paper we have tried 3 different model sizes, 100M, 300M, and 1B, and also two kinds of language models: encoder-only MLM and the more recent decoder-only causal LM. In both cases, we find RAPTR provides speedup and downstream benefits. This exploration is more diverse than many prior papers on efficient pretraining.

Our 1B baseline model already demonstrates good downstream 1-shot performance, and thus we considered it an appropriate setting. Exploring scaling behaviors for pretraining beyond 1B parameters can get very expensive. However, if the reviewer believes that this will add more value, we can include some experiments on a 3B scale in the final version.

References

1: No train no gain: Revisiting efficient training algorithms for transformer-based language models. Kaddour et al.'23

2: Understanding contrastive learning requires incorporating inductive biases. Saunshi et al.'22

3: Same pre-training loss, better downstream: Implicit bias matters for language models. Lieu et al'22

评论

We hope that our response has addressed all your concerns. Please let us know if you have any more questions or feedback. We are happy to address them.

审稿意见
6

This paper aims to pre-train transformers efficiently and proposes a pre-training framework Random Path Training (RAPTR). The main idea of RAPTR is to train random sub-layers of the transformer and progressively increase the number of layers in stages. To further stabilize the training, the authors propose several techniques e.g. scaling the intermediate outputs and fixing the first and last layers. The authors also show the theoretical support for the training stability of RAPTR. Experiments on BERT and UL2 language model pre-training demonstrated the effectiveness of the proposed method. Compared with baselines like progressive stacking and layer-drop, RAPTR achieves lower perplexity under the same computation costs.

优点

  • The theoretical analysis and experimental results provide useful insights for pre-training large models.
  • The proposed method is simple and effective and can be adapted to many deep neural networks.
  • The paper is clearly written and easy to follow.

缺点

  • The idea of training sub-layers progressively is not novel, which is similar to [1][2].
  • RAPTR introduces many hyper-parameters and it would be difficult to tune these hyper-parameters in the pre-training setup, which would hinder the application of this type of work.

[1] Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

[2] Towards Efficient NLP: A Standard Evaluation and A Strong Baseline

问题

  • How to determine the hyper-parameters of RAPTR, e.g. the number of stages and the training steps in each stage? Does it affect the learning rate schedule? Is it affected by the model scale?
  • What is the training setup of baselines and the proposed method in Table 2?

伦理问题详情

N/A

评论

We thank the reviewer for their comments on the simplicity and effectiveness of the proposed method. Please find our responses to your questions below.

Questions

Q1: “How to determine the hyper-parameters of RAPTR?”

In our explorations, we tried simple and intuitive schedules adapted from stacking approaches, and that already provided good results. It is very likely that tuning for the schedule could lead to larger gains. We also explored the effect of different schedules with similar FLOPs in Table 3, and found most of them to be reasonably effective and better than baseline. The schedule we used for most results, and that we prescribe, is: train an L layer network with L/2 - 2L/3 - 5L/6 - L schedule for RAPTR with either uniform schedule (equal time in each stage) with ~1.33x speedup or proportional schedule (time proportional to ii in stage ii) giving ~1.2x speedup. This in addition to other design choices (like scaling, fixing layers, and full training warmup) has been effective for both BERT and UL2.

Q2: “What is the training setup of baselines and the proposed method in Table 2?”.

We follow the same training setup as Table 1. To equate the flop counts of the baseline and the proposed method, we train the proposed methods longer. For example, when the baseline is trained for 510k steps, stacking and RAPTR (which are supposed to be 1.33x faster) are run for 675k steps (675k \approx 510k ×4/3\times 4/3 ). To ensure we have a fair comparison, we run each method at different learning rate scales and report the best eval loss among them.

The results show that under extensive hyperparameter tuning which allows fair comparison, RAPTR and stacking are better than the baseline model, especially at a shorter horizon (fewer epochs) settings.

Weaknesses.

W1: “The idea of training sub-layers progressively is not novel [1,2]”

We would argue that RAPTR (and progressive subnetworks) is different from layer-dropping [1] in motivation & concept, implementation and effectiveness. Please see the common response for details on differences. The motivation of [2] is very different, it is to provide efficiency during inference, not training. The aim of [2] is to train a model on which one can perform early exit for inference speeds. However, such training can possibly degrade the performance of the baseline model.

W2: “RAPTR introduces many hyper-parameters and would be difficult to tune.”

The important hyperparameters introduced in RAPTR are the number of stages, path lengths, and the number of training steps in each stage. We concur that the number of stages and path lengths in each stage is a heuristic that needs to be pre-decided before training.

We ran some more ablation studies (table 5 in the revision) on BERT-base that show that RAPTR is robust to stage schedule selection. Since it is computationally expensive to verify in UL2 settings, we continue with the following design principle. train an L layer network with L/2 - 2L/3 - 5L/6 - L schedule for RAPTR with either uniform schedule (equal time in each stage) with ~1.33x speedup or proportional schedule (time proportional to ii in stage ii) giving ~1.2x speedup. For other target speedups, please see section C in the revised version.

References

1: Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

2: Towards Efficient NLP: A Standard Evaluation and A Strong Baseline

评论

We hope that our response has addressed all your concerns. Please let us know if you have any more questions or feedback. We are happy to address them.

审稿意见
6

This paper proposes a framework to train an increasingly larger subset of layers of an LLM with a manual schedule of layer dropout with the motivation of using less compute to obtain the same pretraining performance. This work shows computational saving and a slight boost in downstream performance of BERT and UL2 pretraining.

优点

RAPTR is a simple and effective method for training progressively larger networks and saving compute. It's also interesting that layer dropout as a form of regularization can improve stability and even downstream performance.

缺点

BERT baselines seem strong but it's unclear how competitive the UL2 baselines are. The soundness of baseline might be important to show as weak baselines can make any results possible. A reference on these hyperparameters would be helpful.

One unsatisfying aspect of this work is that it does not analyze how the compute-efficient frontier changes with this pretraining procedure. Model sweep would also help quantify the amount of improvement in downstream performance as we can see an overall trend more clearly.

问题

.

评论

We thank the reviewer for their comments and suggestions. Please find our responses below.

How competitive are UL2 baselines? A reference on hyperparameters will be useful.

We use the default settings from [1] and do not optimize for hyperparameters. The mixture rates for the dataset are used from [2]. We tuned hyperparameters for the learning rate and learning rate schedule (square-root decay used in the original paper, and cosine decay).

How does the compute-efficient frontier change with this pre-training procedure? Model sweep will help quantify the amount of improvement in downstream performance.

We are sorry but we're having difficulty understanding the question. If it pertains to how RAPTR behaves concerning model scale, scaling behaviors for pretraining beyond 1 billion parameters can become quite costly. In the eventual version, we may incorporate additional experiments on a 3 billion scale if the reviewer deems the results worthwhile. Investigating the scaling laws for RAPTR could be a subsequent exploration following this research.

[1] UL2: Unifying Language Learning Paradigms. Tay et al.’22

[2] LLaMA: Open and Efficient Foundation Language Models. Touvron et al.’23

评论

We hope that our response has addressed all your concerns. Please let us know if you have any more questions or feedback. We are happy to address them.

评论

We thank the reviewers for their time and constructive feedback. We have addressed individual reviewer questions and concerns in separate responses.

We have added a new revision to the paper. We color the differences from the original submission with maroon color. The paper mainly consists of some more experiments in the appendix that we added to address various concerns. Here is a summary of these experiments

  • We compare different schedules of RAPTR on BERT-base (Table 5). We observe small gaps in evaluation loss (<0.01) and fine-tuning performance (<0.5%) across the different schedules.

  • We compare different fixed set candidates for BERT-base (Table 6). We show that fixing the first and last layers during training works the best.

  • We also report the true speedups on our hardware, with steps/sec incurred in each stage of training. For UL2, we show that RAPTR achieves the theoretical speedups on our hardware.

  • We compare RAPTR to the progressive layer drop algorithm of [1] (Table 7) on BERT-base and show differences in both evaluation loss (>0.05) and downstream performance (>1%) between the two algorithms.

  • We conduct extensive tests on the trained UL2 models on different downstream tasks (Table 11). We show that RAPTR improves by 1-2% over stacking and baseline runs on average.

Multiple reviewers brought up progressive layer dropping and how RAPTR is similar to it. We would argue that RAPTR (and progressive subnetworks) is different from layer-dropping [1] in motivation & concept, implementation, and effectiveness.

Motivation/conceptual differences:

  • Layer dropping is motivated by redundancies in baseline training (layer similarity) and thus tries to exploit that for efficiency.
  • RAPTR (progressive subnetwork in general) is motivated by stage-wise efficient pretraining that grows from smaller models to large models. This provides a fresh perspective by proposing to train smaller subnetworks within the large model instead. This opens up new possibilities since, as we discuss in the Conclusions, many growing functions can be converted to progressive subnetwork training. The subnetworks we train are based on depth-wise growing and thus pick random subsets of layers.

Implementational differences:

  • Layer-dropping gradually starts dropping more layers as training proceeds and also drops later layers more than earlier layers.
  • RAPTR does the exact opposite; it gradually trains larger and layer subnetworks, and also picks layers almost completely at random.
  • There are other design choices like fixing layers and scaling in RAPTR.
  • Layer-dropping leverages properties of baseline training for efficiency, whereas RAPTR is a different way of training and has different properties, as evident from the inductive bias towards better downstream evals at the same pretraining loss.

Performance differences:

  • [2] already found progressive layer dropping to be less effective, in fact even slower than baseline training for longer horizons. Our new experiments also verify the same.
  • RAPTR on the other hand can outperform the best stacking mechanism (which [2] found to be the most and only effective efficient training method in their study).
  • We include layer-dropping baselines for BERT in Table 7 of the revision. It is significantly worse than both baseline training and RAPTR (on pretraining loss and downstream evals), which matches findings from [2].

1: Accelerating training of transformer-based language models with progressive layer dropping. Zhang and He et al.'20

2: No train no gain: Revisiting efficient training algorithms for transformer-based language models. Kaddour et al.'23

评论

We have added more results in Table 8, where we compare RAPTR to the progressive layer drop algorithm of [1] on UL2 models. RAPTR is 343-4% better on downstream tasks on average.

1: Accelerating training of transformer-based language models with progressive layer dropping. Zhang and He et al.'20

AC 元评审

This paper proposes a method RAPTR to train LLM layers gradually with a schedule (but always keeping the first and last layer). This method will be able to retain the same or even better performance on downstream tasks while reduce about 25% of computation. They show the experimental results on both BERT and UL2 pre-training and 7 datasets.

Strengths:

  1. Simple and effective approach.
  2. Demonstration on two models with several tasks.
  3. A theoretical proof of the stability of layers.

Weakness:

  1. Reviewers have major concern about the method, comparing to progressive layer drop. RAPTR can be viewed as a special instance of progressive layer drop with a particular schedule (always keeping first, last layer, and random progressive schedule). The authors did provide justification on the difference on motivation, implementation, and performance. However, this does not convince reviewers regarding the method. The authors claim that progressive layer drop would not necessary gain speedup, however this claim is inconsistent with the progressive layer drop paper and not backed by the authors' experiment (table 7&8).
  2. The paper provides an analysis on linear layers but it is not clear if they apply to Transformers. It would be better to provide an analysis on even two layers of Transformers.

The authors may want to expand and highlight the theoretical results in this paper, which will be the main contribution.

为何不给更高分

Reviewers have major concerns regarding the novelty of this method comparing to progressive layer drop. This paper seems a special instance of progressive layer drop. The authors' response does not clear such concern.

为何不给更低分

N/A

最终决定

Reject