How to Train Your Multi-Exit Model? Analyzing the Impact of Training Strategies
We identify and address gaps in early-exiting literature by analyzing the weaknesses of commonly used multi-exit model training strategies.
摘要
评审与讨论
Multi-exit neural network model is a model that can exit at different layers. It remains a challenge to find an optimal way to increase accuracy of exiting at an earlier layer while not dropping accuracy of the last layer. This paper focuses on one angle: what is the best way to train a multi-exit model. The paper found that the best method to get the best of both worlds is the Mixed approach: train the model till the end without early exit, then add early exit classification heads, referred to as internal classifier (IC), and train the full model with ICs. The paper starts with analyzing the gradient contribution of each early exit, as well as the rank of activations and mode connectivity and mutual information to reach insights to support the decision to use Mixed training. The paper compared the Mixed approach with the Disjoint approach (i.e., train model first, then add ICs, then train ICs while freezing model) and the Joint approach (i.e., train from scratch both model with ICs), and found Mixed being the best trade off for accuracies of earlier layers and last layer. It also compared with different variants of loss and gradient scales and found that Mixed could be better or an alternative to such methods.
给作者的问题
- Figure 3: I didn't understand how can we infer from the plots that "Disjoint and mixed regimes produce similar models, while the model trained in joint regime lies in a different basin"
- Table 3: which model are the results for?
- Table 6: Why the difference in accuracy between 25% and 100% compute is 1.02% in accuracy for Mixed 1L here, while in Table 3 the difference is almost 30%?
- Similarly in Table 7: Again, the differences in accuracy between 25% and 100% compute is around 1%$ or 2% for mixed and joint here.
- Line 622: What is the difference between "Mixed-gradual training" and "Alternating training"
- Line 666: If theta is the model's weights, what are the numbers x and y?
- Table 15: What is PBEE?
- Table 16: What is GPF?
- Table 17: What is ViT-Entropy?
论据与证据
- The main claim that Mixed training is better for earlier and last layer accuracies than Joint training has been backed by results on training various architectures, modalities and datasets
方法与评估标准
- Authors trained on different architectures (transformers, ResNets, ...), modalities (vision and text) and datasets (CIFAR, ImageNet, STSB, etc.)
理论论述
The authors didn't really make theoretical claims but I have some comments about theoretical explanations of some analysis:
- Line 303: I suggest to re-phrase "representation of easy samples is not complex" to "representation of a subset of the training samples is not complex", and re-word accordingly subsequent references to easy samples..unless we provide definition of "easy" samples (e.g., those that have low loss) and show that samples that meet such definition do require processing with fewer layers as the paragraph claims
- Line 315: Similarly, I suggest to re-phrase "easy datasets where more samples exit at earlier layers." with "subsets of datasets where more samples exit at earlier layers." And similar for Line 319
- Lines 307 to 308: "To describe it in terms of mutual information, the network does not need to reduce the complexity of X to fit the internal representation Z," I am a bit confused. If X is the input, how can the network reduce or change its complexity? Should it only be able to change the internal representation Z, not the input?
实验设计与分析
Experiments seemed to be fine, but I have a comment on one of the analyses:
Figure 5: According to Figure 2 of this paper ( https://arxiv.org/abs/1909.01380 ), the mutual information between a model's input and intermediate activation should decrease monotonically across layers, but this is not the case in Figure 5 of this paper. Do authors have a reason why? Moreover, according to Data Processing Inequality concept ( https://en.wikipedia.org/wiki/Data_processing_inequality ) after the processing of each layer, information about input X in a layer's output should either reduce or stay the same, and it can't increase. Hence, my understanding is that mutual information entropy should monotonically decrease across layers.
补充材料
Yes. I have read all of the Appendix.
与现有文献的关系
While most papers in early exit focus on a specific technique (e.g., loss scaling) to increase accuracies of earlier and last layers, they take it for granted whether they train from scratch or finetune, and overlook the implication of this decision. This paper focused on that overlooked aspect, on whether it is better to train from scratch, or continually train, or perform mixed training.
遗漏的重要参考文献
Not really. I would say that since LLMs (and VLMs) are becoming popular, it would have been useful to discuss some of the results from papers that explored early exit loss for LLMs and (as I suggest in another part of this review) to test the findings on a small LLM:
- EMNLP 2023, "Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding", Sangmin Bae, Jongwoo Ko, Hwanjun Song, Se-Young Yun
- ACL 2024, "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu
其他优缺点
Strengths:
- "To the best of [the authors'] knowledge, this work is the first to directly compare models trained under different regimes and to provide a detailed analysis of the training dynamics of multi-exit models." If that's true, then publishing this paper will be useful for the community to have a holistic understanding of early exit.
- Tested on different architectures (transformers, ResNets, ...), modalities (vision and text) and datasets (CIFAR, ImageNet, STSB, etc.)
- Useful insights such as "the earlier layers are characterized by higher frequency features while later layers learn low frequency elements. This regularity is disrupted in the\ case of early exit architectures as the backbone network is given additional classifiers that are placed in earlier parts of the network." and "As placements become less frequent, the difference between joint and mixed regimes becomes less pronounced."
Weaknesses:
- While the paper claims that Mixed training is an alternative to Joint training with loss or gradient scaling, it could be argued that the latter could be better as it requires less training time compared to Mixed training, as we have to train only once.
- The claim or beneifts of the findings are quite incremental
其他意见或建议
- For experiments, I recommend adding a more commonly used architecture like GPT2 for LLMs
- Different Early Exit Adapters: What about using an early-exit adapter sub-network like SCAN [1]? In the case of ViT, how about adding 1 or 2 transformer layers before the classification head in the early exit adapter sub-network?
Formatting / Typos:
- Authors have forgotten to change the mini-title of pages 2 and onwards. It is still at "Submission and Formatting Instructions for ICML 2025".
- Figure 2: Please use same axes limits and steps for Figures 2a and 2b to make it easier to compare.
- in several parts of the paper, Latex quotes need to be fixed (e.g., line 245)
- in more than one place in the paper (e.g., line 276), the word "Testset" is used while I think it should be "Test set"
[1] NeurIPS 2019, "SCAN: A Scalable Neural Networks Framework Towards Compact and Efficient Models", Linfeng Zhang, Zhanhong Tan, Jiebo Song, Jingwei Chen, Chenglong Bao, Kaisheng Ma
伦理审查问题
N/A
We sincerely appreciate the reviewer’s thoughtful evaluation of our work and their recognition of its significance. We apologize for brief answers necessitated by the response length limit. Kindly if we resolved the concerns raised, we would be grateful if the reviewer would consider raising their score accordingly.
train time passes
We emphasize that the number of phases of training does not have to affect training time. As noted in the manuscript, in all of our experiments we used early-stopping. We report the resulting number of training epochs for the Tinyimagenet/ResNet50 experiment:
Disjoint: 523 +/- 156 Joint: 1610 +/- 395 Mixed: 1166 +/- 136
In practice the opposite is true, joint training is slower to converge.
benefits of the findings are quite incremental
In most cases the difference over joint regime is definitely significant, e.g. 2 percentage points for the ImageNet experiment. For low budgets the difference over disjoint is immense - and we emphasize that the disjoint regime is also in popular use.
LLMs
Please see the answer to reviewer cDMi.
mutual information should decrease monotonically across layers
The monotonic decrease shown in Shwartz-Ziv & Tishby (2017) and inferred through the Data Processing Inequality assumes exact calculations. In our work, we estimate mutual information using a Monte Carlo method (Kawaguchi et al., 2023) in high-dimensional spaces. This approximation can introduce estimation noise, which may result in non-monotonic fluctuations in the absolute MI values.
However, to further verify the MI impact, we perform additional experiment for another vision (CIFAR-100) and NLP (BERT-B/Newsgroups) datasets. In this case, the MI has a more decreasing tendency. However what we want to emphasize is the relative relationships between regimes present in all the figures that hint that higher Mutual Information may be an indicator for better performance of the mixed regime. results
Different Early Exit Adapters:...
As suggested, we run additional experiments where each head has an additional transformer block:
| Regime | 50% | 75% | 100% |
|---|---|---|---|
| disjoint | 51.49 | 55.39 | 56.99 |
| joint | 57.42 | 59.07 | 59.09 |
| mixed | 58.69 | 60.54 | 60.68 |
Due to the size of this head architecture, the model is not able to meet the 25% of the original model’s budget, so we omit the first column. We emphasize that the mixed regime has better performance as before.
Figure 3:...
When linearly interpolating the weights between the (models trained with) disjoint and mixed regime, we do not encounter a region of high loss. This means they lie in the same optimization basin. In contrast, when interpolating between the joint regime model and any other model, we do encounter a region of high loss (yellow color).
Table 3…? Table 6: Why the difference in accuracy…is 1.02%…while in Table 3 the difference is almost 30%?
Each dataset and model combination has a different budget-difficulty characteristic (the shape of FLOPs vs accuracy curves, see Figure 1). Tables 6 and 7 were computed for ViT-T architecture and the ImageNette dataset (a 10-class subset of ImageNet, with the same input image size as ImageNet), a combination which turns out to be a relatively easy problem for a model of this size. Table 3 was computed for CIFAR-100/ResNet-34, which turns out to be a relatively harder problem for this architecture.
For comparison, we repeat the experiment from Table 6 for Tinyimagenet below:
| Regime | Head | 25% | 50% | 75% | 100% |
|---|---|---|---|---|---|
| 1L | 26.35 | 41.69 | 53.06 | 56.72 | |
| disjoint | 2L-1024 | 33.85 | 47.21 | 54.97 | 56.72 |
| 2L-2048 | 33.96 | 45.98 | 54.74 | 56.72 | |
| 1L | 42.47 | 53.24 | 56.02 | 56.03 | |
| joint | 2L-1024 | 45.59 | 55.11 | 57.73 | 57.66 |
| 2L-2048 | 43.60 | 53.89 | 56.94 | 57.00 | |
| 1L | 44.03 | 57.91 | 60.42 | 60.28 | |
| mixed | 2L-1024 | 44.92 | 57.11 | 59.32 | 59.22 |
| 2L-2048 | 44.61 | 56.94 | 60.18 | 60.15 |
…"Mixed-gradual training" and "Alternating training"
"Mixed-gradual" trains the model with n ICs in n phases, each phase including a larger number of ICs, starting from the deepest ones. "Alternating" switches every training step between training with the last IC, and with all the ICs.
Line 666: …x and y?
x and y represent scalar coefficients used to perturb the model parameters θ∗ along randomly chosen directions (δ,η) to visualize the loss landscape around the trained model (Li et al., 2018).
What is…?
PBEE - Patience-based Early Exit - EE method where we exit for each sample after n consecutive ICs return the same class (Zhou et al., 2020), GPF - Global Past-Future - an EE method that incorporates hidden states from the preceding and deeper ICs (Liao et al., 2021) Entropy - means we base our exit decision on entropy of the prediction probabilities (Teerapittayanon et al., 2016) rather than maximum softmax probability
We sincerely appreciate the reviewer's detailed feedback. We incorporate all suggestions in the current version of our manuscript.
I appreciate the authors' comprehensive response. I have read all the reviewers' feedback and their corresponding rebuttals. I am leaning towards keeping my rating as Weak Accept. The analysis is detailed but the findings are incremental, so a solid accept would be difficult in my humble opinion.
A few comments:
- I recommend adding a comment or footnote to the paper explaining why mutual entropy is not monotonically decreasing as described in Tishby's paper or in the Data Processing Inequality
- Regarding the authors' response to one of the other reviewers that the proposed mixed approach is "extremely uncommon", I would like to mention it is used in LLMs as in ]LayerSkip](https://arxiv.org/abs/2404.16710) and Relaxed Recursive Transformers.
We thank the reviewer for the engagement in the discussion process and the valuable suggestions.
I recommend adding a comment or footnote...
As suggested, we modify our current revision of the paper to explain the non-monotonicity of the Mutual Information results.
Regarding the authors' response to one of the other reviewers that the proposed mixed approach is "extremely uncommon", I would like to mention it is used in LLMs
We agree that in LLM contexts, further pre-training using the same dataset is more likely. However, our work specifically targets setups such as image classification. In that response we stated that fine-tuning on the same dataset as used for pre-training is unusual. Therefore, it is unlikely that a machine learning practitioner or researcher would unwittingly apply the mixed strategy in such tasks.
While the content of the papers cited by the reviewer might suggest that LLMs are the main use-case of early-exiting nowadays, this is not really the case. The early-exit setups we investigate are particularly well-suited for low-power edge-device deployments [1], highlighting the real-world applicability of our findings [2, 3, 4, 5, 6].
as in LayerSkip and Relaxed Recursive Transformers.
LayerSkip [7] proposes a "loss curriculum" (training strategy) that is almost equivalent to the Mixed-gradual strategy from our appendix, with the difference being that they enable the next IC at regular intervals, while we use a early-stopping criterion as an indicator when to proceed into the next phase. However, this approach is used only in a single experiment of their work, with a different strategy - rotational curriculum - being used for other experiments. No explanation is given why the authors prefer one over the other, and no ablation study for this aspect is presented. The arbitrary adoption of a specific training strategy in multiple prior studies - for instance, in [2, 3, 9, 10, 11] from last year - without sufficient justification was the primary motivation of our work.
While Relaxed Recursive Transformers [8] includes an ablation study on training strategies, this constitutes a relatively minor component of their overall contribution. The transferability of their findings to the settings explored in our work remains uncertain. In contrast, our paper offers a substantially broader and more systematic analysis of the early-exit training strategy. Additionally, we emphasize that according to the ICML 2025 Reviewer Instructions, this work should be regarded as concurrent with ours.
We thank the reviewer for making us aware of these two works. We modify our manuscript to briefly discuss these papers in the related work section.
[1] Matsubara, Yoshitomo, Marco Levorato, and Francesco Restuccia. "Split computing and early exiting for deep learning applications: Survey and research challenges." ACM Computing Surveys 55.5 (2022): 1-30."
[2] Colocrese, Marco, Erdem Koyuncu, and Hulya Seferoglu. "Early-Exit meets Model-Distributed Inference at Edge Networks." 2024 IEEE 30th International Symposium on Local and Metropolitan Area Networks (LANMAN). IEEE, 2024.
[3] Wang, Jingcun, Bing Li, and Grace Li Zhang. "Early-exit with class exclusion for efficient inference of neural networks." 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS). IEEE, 2024.
[4] Ayyat, Mohammed, Tamer Nadeem, and Bartosz Krawczyk. "ClassyNet: Class-Aware Early-Exit Neural Networks for Edge Devices." IEEE Internet of Things Journal 11.9 (2023): 15113-15127.
[5] Dong, Rongkang, Yuyi Mao, and Jun Zhang. "Resource-constrained edge ai with early exit prediction." Journal of Communications and Information Networks 7.2 (2022): 122-134.
[6] Bajpai, Divya J., Aastha Jaiswal, and Manjesh K. Hanawal. "I-splitee: Image classification in split computing dnns with early exits." ICC 2024-IEEE International Conference on Communications. IEEE, 2024.
[7] Elhoushi, Mostafa, et al. "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.
[8] Bae, Sangmin, et al. "Relaxed recursive transformers: Effective parameter sharing with layer-wise lora." arXiv preprint arXiv:2410.20672 (2024).
[9] KhademSohi, Hossein, et al. "SelfXit: An Unsupervised Early Exit Mechanism for Deep Neural Networks." Transactions on Machine Learning Research.
[10] Jazbec, Metod, et al. "Fast yet safe: Early-exiting with risk control." Advances in Neural Information Processing Systems 37 (2024): 129825-129854.
[11] Meronen, Lassi, et al. "Fixing overconfidence in dynamic neural networks." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2024.
The authors study different training regimes for early-exit networks (EENNs). To this end, they propose a framework consisting of 4 different metrics (gradient dominance, mode connectivity, numerical rank, mutual information) for studying the tranining dynamics of EENNs. They use the framework to explore the differences between commonly used training strategies (joint, disjoint) as well as their newly proposed mixed strategy. In the experiments, they show that their mixed strategy performs favourably (in most cases) compared to joint or disjoint baselines.
给作者的问题
Questions:
- The fact that joint strategy leads to suboptimal performance for larger computational budgets reminds me a bit of the "interference" of the earlier classifiers talked about in MSDNet paper (see right plot in Figure 3 there). To tackle this, in MSDNet they introduce architectural changes to their CNN via "dense connectivity". So in some way your mixed strategy aims to do the same, but is more general since it doesn't require architectural modifications. And this is kinda confirmed by your experimental results, since, if I read Table 1 correctly, the mixed strategy improves over joint strategy the least for maximal budget on MSDNet
论据与证据
/
方法与评估标准
/
理论论述
/
实验设计与分析
/
补充材料
/
与现有文献的关系
/
遗漏的重要参考文献
/
其他优缺点
Strengths:
- I agree with the authors that studying and understanding of the training approaches for EENNs is under-explored and that most papers just do what some of the seminal papers from the past did (MSDNet, SDN etc.). Hence, I believe this works fills an important gap in the early-exiting literature
- I like the proposed framework for studying the training dynamics. I believe that going beyond just experimentally comparing different training regimes (e.g., via accuracy-FLOPs curves on ImageNet) is valuable and provides more insights into the differences between considered regimes
- The experiments presented support claims made in the paper and show that the mixed strategy might be the optimal one going forward in the early-exit community
Weaknesses:
- While I appreciate the framework presented, I wonder if all 4 metrics are indeed necessary. For example, I feel that the numerical rank one is not that informative as all 4 curves displayed in Figure 4 show quite a similar trend to me.
- While I understand it is not the focus on this work, it would still be valuable to say something about whether the findings on training dynamics presented in this paper translate to early-exit LLMs [1, 2, 3]
[1] Schuster, T., Fisch, A., Gupta, J., Dehghani, M., Bahri, D., Tran, V., Tay, Y. and Metzler, D., 2022. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35, pp.17456-17472.
[2] Bae, S., Ko, J., Song, H. and Yun, S.Y., 2023. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. arXiv preprint arXiv:2310.05424.
[3] Chen, Y., Pan, X., Li, Y., Ding, B. and Zhou, J., 2023. Ee-llm: Large-scale training and inference of early-exit large language models with 3d parallelism. arXiv preprint arXiv:2312.04916.
其他意见或建议
/
We sincerely appreciate and agree with the reviewer’s assessment that our work addresses a meaningful gap in the early-exiting literature.
numerical rank informativeness
Firstly, the key insight of the numerical rank metric is that placing multiple early exits increases the rank and also the expressive representation of the network. The difference in the ranks between the plain model (disjoint regime) and the mixed regime in Figure 4a is significant as it is substantially larger than standard deviation. In Fig. 4a we only show the mixed result for clarity but Fig. 4b shows both mixed and joint regimes obtain similar ranks (given that both regimes foster enriched representations), and both are respectively higher than the backbone.
Secondly, the backbone remains unchanged for the disjoint regime and so the numerical rank experiment seen in Fig. 4a is also valuable because it may explain the inferior performance of early ICs in the disjoint regime. Joint training results with layers that enable more expressive intermediate representations (higher numerical rank), thus allowing for coexistence of features relevant for the nearest IC, and those relevant for the deepest ICs. On the other hand, a consistently lower rank is obtained for the disjoint regime. As mentioned above when even an already trained backbone is allowed to be affected by the ICs’ gradients, the numerical rank of its representation rises (see Figure 4a), and so does its performance on lower budgets (see accuracy results for disjoint vs mixed in any setting).
In the current version of the manuscript we rewrite the explanation of the numerical rank results to make these points more clear to the reader.
While I understand it is not the focus on this work, it would still be valuable to say something about whether the findings on training dynamics presented in this paper translate to early-exit LLMs [1, 2, 3]
We appreciate this insightful suggestion. We agree that exploring early exit regimes in LLMs is a promising direction. However, we intentionally refrained from discussing generative LLMs as translating our observations to early-exit LLMs is challenging due to several fundamental differences (we also note that recent early-exiting studies [1, 2] refrain from studies on LLMs.). Methods like proposed CALM and FREE incorporate intricate confidence mechanisms, state-copying techniques, and synchronization strategies tailored specifically for token-level decisions, which are not directly analogous to the layer-wise early-exiting approach we analyzed. Moreover, EE-LLM emphasizes 3D parallelism and large-scale model training, aspects that differ substantially from our focus on training regimes and layer-wise representation dynamics.
However, as an early effort we add an experiment on the NLP classification problem using BERT-B trained on Newsgroups dataset. results For the backbone network (BERT-B), lower MI in earlier layers and higher in the rear layers used by the disjoint regime may explain its positive performance only at the deeper layers. A slightly increased MI for mixed regime is further indicative of better performance.
[1] Jazbec, Metod, et al. "Towards anytime classification in early-exit architectures by enforcing conditional monotonicity." [2] Meronen, Lassi, et al. "Fixing overconfidence in dynamic neural networks." [3] Yoshitomo, Levorato, Restuccia. "Split computing and early exiting for deep learning applications: Survey”.
The fact that joint strategy leads to suboptimal performance for larger computational budgets reminds me a bit of the "interference" …
We agree that interference between internal classifiers is a central issue - indeed, it’s what inspired our gradient dominance metric. Our experiments reveal that even MSDNet’s architectural adjustments, including its dense connectivity, does not fully resolve this issue. The same is true for gradient equilibrium from [1], as we show in Section 4.4. In fact, our mixed training regime outperforms these approaches, demonstrating more effective handling of interference and yielding superior performance, particularly under larger computational budgets.
[1] Li, Hao, et al. "Improved techniques for training adaptive deep networks."
The paper presents an enhanced early-exit training approach that combines two phases: initial backbone training followed by full multi-exit network training. This mixed strategy addresses the shortcomings found in both joint and disjoint training methods. While the paper presents its methodology clearly and provides thorough empirical validation, several limitations exist. The authors do not do experiment on SOTA settings, and, the proposed method, though well-explained, lacks technical innovation - essentially combining two existing approaches.
update after rebuttal
In the rebuttal period, the authors have add the FLOPs-Accuracy curve in the rebuttal, which is a valuable evaluation approach in early-exit models. The authors also clarify the novelty and the relationship between method and analysis, which well address my concerns. As a result, I raise my rating from Weak Reject to Weak Accept. I hope the authors can also add the FLOPs-Accuracy curve (budgeted batch classification, proposed in MSDNet[1], and widely used in following works[2~5]) in the final revision, because it is a very clean way to show the performance of early-exiting networks.
[1] Huang, Gao, et al. "Multi-Scale Dense Networks for Resource Efficient Image Classification." International Conference on Learning Representations. 2018.
[2] Li, Hao, et al. "Improved techniques for training adaptive deep networks." Proceedings of the IEEE/CVF international conference on computer vision. 2019.
[3] Yang, Le, et al. "Resolution adaptive networks for efficient inference." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
[4] Han, Yizeng, et al. "Learning to weight samples for dynamic early-exiting networks." European conference on computer vision. Cham: Springer Nature Switzerland, 2022.
[5] Han, Yizeng, et al. "Dynamic perceiver for efficient visual recognition." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
给作者的问题
No.
论据与证据
Yes, they provide much analysis and experiments for their claims.
方法与评估标准
I am afraid not. The results presented in Table 1~6 are problematic. They provide the x-axis as a ratio, make the reviewer hard to know they achieve this performance in which FLOPs. I suggent the authors present results following the follow literatures:
[1] Han, Yizeng, et al. "Dynamic perceiver for efficient visual recognition." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [2] Wang, Yulin, et al. "Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition." Advances in neural information processing systems 34 (2021): 11960-11973.
理论论述
The authors provide some experimental analysis in section 3, but I can hardly find the relationship between them and the proposed method.
实验设计与分析
I have checked. I think the results presented in Table 1~6 are problematic. They provide the x-axis as a ratio, make the reviewer hard to know they achieve this performance in which FLOPs. I suggent the authors present results following the follow literatures:
[1] Han, Yizeng, et al. "Dynamic perceiver for efficient visual recognition." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [2] Wang, Yulin, et al. "Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition." Advances in neural information processing systems 34 (2021): 11960-11973.
补充材料
Yes, the authors provide more results, visualizations and training setting in the supplementary.
与现有文献的关系
They are related to the early-exit and dynamic neural networks material.
遗漏的重要参考文献
The related work section is enough.
其他优缺点
Early exiting is a very valuable research topic. But the experiments should provide the FLOPs when you report each Accuracy. 25%, 50%, 75% is very unambiguous。
其他意见或建议
No.
We thank the reviewer for the effort spent on reviewing our paper and the valuable insights. If we further addressed the remaining concerns, we would kindly ask for possible score reconsideration.
The authors do not do experiment on SOTA settings. The results presented in Table 1~6 are problematic. They provide the x-axis as a ratio, make the reviewer hard to know they achieve this performance in which FLOPs
In this work, we followed the performance reporting scheme initiated by [1] due to its simplicity and readability (though without the absolute FLOPs values). However, as per the reviewer’s suggestion, we provide the same results as FLOPs vs accuracy plots (similar to the one from Figure 1) in this link. We also include them in the current version of our manuscript.
[1] Kaya at al. "Shallow-deep networks: Understanding and mitigating network overthinking."
We are also not sure if the reviewer might have meant it but we want to emphasize that our paper does not aim to achieve state-of-the-art performance on any setup. Its purpose is to analyze the training dynamics and compare the multi-exit model training strategies. We do this by evaluating all the considered regimes on multiple commonly-used architectures, on multiple modalities and datasets, and for multiple early-exiting approaches. Moreover, most of the results we present were computed for networks trained from scratch, for three different PRNG seeds. Our limited computational resources prevent us from running large SOTA models. Nevertheless, to demonstrate that our findings scale to larger models, we present the results for the ImageNet-1k experiment for the ViT-S variant of the vision transformer architecture in the link provided (Figure 11):
| Regime | 25% | 50% | 75% | 100% |
|---|---|---|---|---|
| Disjoint | 10.23 | 33.91 | 68.02 | 78.38 |
| Mixed | 49.50 | 75.17 | 78.28 | 78.33 |
| Joint | 50.10 | 73.99 | 76.38 | 76.44 |
We can see that - as before - the disjoint regime still is significantly inferior on lower budgets, while the joint regime exhibits a noticeable performance gap on higher budgets.
the proposed method, though well-explained, lacks technical innovation - essentially combining two existing approaches.
We thank the reviewer for their feedback. Our mixed training strategy is intentionally technically simple, and we consider that as an advantage rather than a weakness of our work. We believe that clarity and effectiveness should take priority over unnecessary complexity. Rather than adding complexity for its own sake, we focus on a practical, well-reasoned method that consistently outperforms existing approaches. Moreover, technically simple methods are easy to implement, and - as a consequence - are more likely to be widely adopted by the community. We emphasize that the primary goal of our work is to systematically analyze early-exit training strategies. Since existing strategies are also simple, it is important to first analyze their limitations before considering more complex alternatives. Our paper demonstrates the weaknesses of the two commonly used training strategies, and highlights the large impact of this previously overlooked aspect. The proposed “mixed” approach is the most straightforward approach to alleviate the issues of the joint and disjoint regimes. While being technically simple, it is not necessarily obvious, as it was not used in any prior work.
The authors provide some experimental analysis in section 3, but I can hardly find the relationship between them and the proposed method.
We appreciate the reviewer’s feedback and would like to clarify how the analyses in Section 3 are directly connected to the proposed mixed training strategy. Specifically, the experimental metrics - gradient dominance, mode connectivity, numerical rank, and mutual information - were chosen to reveal distinct aspects of the training dynamics under different regimes. For example, gradient dominance analysis shows that the mixed training regime shifts the optimization focus toward later exits. This shift directly correlates with improved performance under higher computational budgets, as later exits become more robust. Please note the relation to the gradient equilibrium experiments in Sec. 4.4. Gradient dominance results may explain the need to rescale the rear exists when trained in joint regime while mixed obviates the need for application of gradient equilibrium. Mutual Information and Numerical Rank show further aspects why the performance of the presented regimes may differ (e.g. for early vs rear exits). Finally mode connectivity reveals that mixed and disjoint regimes produce very similar solutions, despite having significantly different performance on low computational budgets.
This submission analyses different training strategies for early-exit models, namely disjoint (frozen backbone), joint (end-to-end) and the proposed mixed (backbone pretraining + joint) approach. Several metrics are proposed, including gradient dominance, mode connectivity, numeric rank and mutual information each capturing different angles of the training dynamics.
给作者的问题
Please consider conduction an ablation on the number and placement of ICs, and how the affect the findings of the discussed analysis.
论据与证据
In my opinion, there are limited evidence for the claims made in the submission, in some cases. This is because the joint training scheme can be severely affected by factors such as the number and placement of early exits, which have not be adequately considered in the submission.
方法与评估标准
In my opinion, the proposed methods and evaluation criteria are meaningful for the examined problem.
理论论述
Not applicable.
实验设计与分析
In my opinion, the experimental analysis considers numerous backbone models and dataset; but not adequate variations in the configuration of the early-exit models.
补充材料
I have read and fully considered the provided appendix in my review. I have not reviewed the provided codebase.
与现有文献的关系
In my opinion, the analysis conducted in this manuscript offers numerous insights for the training of early-exit models for CV and NLP. The proposed "mixed" approach lacks novelty, not because of its simplicity, but mostly because starting with an ImageNet-pretrained backbone can be considered common practice for ML practitioners deploying multi-exit models.
遗漏的重要参考文献
In my opinion, relevant literature has been adequately cited.
其他优缺点
Strengths:
- Overall the paper is well written and easy to follow; and studies an interesting problem.
- The conducted empirical analysis and proposed metrics are insightful and offer numerous findings that can be adopted to guide future research in the field.
Comments:
- The findings of the this analysis, however, may not generalise across different configurations of early-exit models as the number and placement of ICs affect the training dynamics of end-to-end (joint) approaches, as does the architecture of ICs which has been discussed in the manuscript.
- The proposed mixed training approach, essentially comprises a backbone pre-training followed by joint training, and cannot be considered novel as it is common practice in ML deployment. Nonetheless, the methods discussed in the appendix and their comparison to traditional techniques comprise a more interesting discussion.
其他意见或建议
I would suggest an alternative presentation, where "mixed" training is not posed as a novel contribution of this work. Instead the comparative analysis of different early-exit model training regimes could potentially be the main contribution of this work.
Post-rebuttal edit: Provisionally increasing my score from WR to WA, having considered the authors' rebuttal and other reviewers' comments.
We thank the reviewer for assessing our work and recognizing the importance of this previously overlooked aspect. We hope that the answers below adequately answered the reviewer's questions and concerns. If that is the case, we kindly ask for a reconsideration of the score.
number and placement of early exits, which have not be adequately considered in the submission … but not adequate variations in the configuration of the early-exit models … consider conduction an ablation on the number and placement of ICs
In Appendix B we provide this kind of an analysis for our ViT-T model trained on the ImageNette dataset. These results show that our findings generalize to different placement densities. However, to make our case even stronger, we rerun those experiments on Tinyimagenet/ResNet-50 and include additional placement schemes (Dense-Sparse places ICs at blocks: [1, 2, 3, 4, 5, 6, 7, 11], Sparse-Dense places ICs at blocks: [1, 4, 8, 9, 10, 11, 12, 13, 14]). We present the results below:
| Placement | Regime | 25% | 50% | 75% | 100% |
|---|---|---|---|---|---|
| disjoint | 38.92 | 49.25 | 60.10 | 65.76 | |
| Every-1 | joint | 52.20 | 62.49 | 65.52 | 65.59 |
| mixed | 52.22 | 63.03 | 67.21 | 67.35 | |
| disjoint | 37.34 | 48.03 | 60.34 | 65.65 | |
| Every-2 | joint | 51.81 | 62.60 | 65.55 | 65.38 |
| mixed | 52.41 | 63.19 | 67.14 | 67.33 | |
| disjoint | - | 47.91 | 60.95 | 65.77 | |
| Every-3 | joint | - | 63.33 | 67.22 | 67.21 |
| mixed | - | 62.52 | 66.72 | 66.71 | |
| disjoint | - | 41.32 | 57.78 | 65.78 | |
| Every-4 | joint | - | 62.54 | 66.30 | 66.27 |
| mixed | - | 62.07 | 67.08 | 67.14 | |
| disjoint | - | 39.85 | 56.20 | 65.72 | |
| Every-5 | joint | - | 61.10 | 65.61 | 65.79 |
| mixed | - | 61.95 | 67.33 | 67.40 | |
| disjoint | 38.47 | 50.64 | 62.04 | 65.74 | |
| Dense-Sparse | joint | 53.14 | 62.23 | 64.76 | 64.93 |
| mixed | 53.48 | 63.17 | 66.24 | 66.27 | |
| disjoint | 37.12 | 47.03 | 59.79 | 65.68 | |
| Sparse-Dense | joint | 50.47 | 61.19 | 65.36 | 65.42 |
| mixed | 51.19 | 62.27 | 67.26 | 67.47 |
The results are consistent with the main findings of the paper, that is:
- The mixed regime still presents generally better performance over the joint regime.
- The disjoint regime is still inadequate for low budgets.
The proposed "mixed" approach lacks novelty, … starting with an ImageNet-pretrained backbone can be considered common practice… …essentially comprises a backbone pre-training followed by joint training, and cannot be considered novel as it is common practice in ML deployment…
We emphasize that the proposed “mixed” approach is not in common use. The common practice of using pre-trained models for transfer learning is independent of the choice of the training regime. Starting with a model pre-trained on dataset A does not preclude us from fine-tuning on dataset B with different training regimes.
In particular, let’s assume that a ML practitioner starts with a backbone model (e.g. ViT-B) pretrained on dataset A (e.g. ImageNet-1k). His aim is a multi-exit model that performs well on dataset B (e.g. CIFAR-100). If he takes the backbone, attaches classification heads, and trains (finetunes) everything jointly on dataset B, then that is still joint training (finetuning) according to our terminology.
To perform mixed regime training in such a setting, we first finetune (on dataset B) the backbone only, and only then proceed to finetune everything together. In the paper we compare all regimes in the transfer learning setting and present the results in Table 4. In this setting the behaviour of each regime and the findings are the same as in other experiments - both joint and disjoint training display a significant performance gap on some budgets.
If A == B, then indeed taking a pre-trained model, attaching ICs and training jointly indeed would be equivalent to our “mixed” approach. However, we argue that such cases are extremely uncommon.
Again, we apologize for the misunderstanding, which resulted from our insufficiently thorough description of the transfer learning experiment. In our current version of the manuscript we modify this section to present the pre-trained setup more clearly.
…alternative presentation…
We thank the reviewer for this valuable suggestion regarding the positioning of our contribution. Although it may not have been emphasized enough, our goal in this work is to provide a comprehensive comparative analysis of different early-exit training regimes, and investigate their training dynamics and implications. In the paper, we point out scenarios where the joint or even disjoint training regimes can be more suitable or advantageous. For instance, we observe that the joint regime may be preferable at very low computational budgets (where early classifiers dominate), and the disjoint regime can show good performance when the backbone model is already well-trained or fixed and further training resources are limited. Nevertheless, our empirical results consistently suggest that, in most practical scenarios and computational budgets, the mixed regime tends to outperform the others, hence our emphasis on highlighting its benefits.
Thank you for the detailed replies on the raised concerns. Although I remain skeptical about the novelty of the proposed "mixed" training approach, I believe that the comparative analysis between different training methods for early-exit models is indeed quite broad to drive robust conclusions, offering useful insights to practitioners and researchers. As such, I am provisionally increasing my score to WA, pending the reviewer discussion.
This paper investigates different training strategies for multi-exit neural networks, identifying limitations in commonly used joint and disjoint approaches. The authors propose a set of analytical metrics (gradient dominance, mode connectivity, numerical rank, mutual information) to understand training dynamics and introduce a "mixed" training strategy (pre-training the backbone, then jointly training the full network) which empirically outperforms the standard methods across various settings.
Reviewers generally agreed that the paper addresses an important and under-explored area in early-exit model training (kw5N, CDMi, gPtc) and that the comparative analysis, along with the proposed diagnostic metrics, provides valuable insights (kw5N, CDMi, DJjQ). The empirical evaluation was found to be comprehensive, covering multiple architectures, datasets, and exit configurations, particularly after the authors provided additional results during rebuttal (kw5N, gPtc, DJjQ). While the novelty of the proposed "mixed" strategy itself was questioned by some reviewers (kw5N, DJjQ), who noted similarities to standard pre-training practices, the authors mostly clarified the distinction in the context of mixed pretraining vs. fine-tuning on the target dataset. Ultimately, reviewers acknowledged the main contribution lies in the thorough comparative study and analysis, which fills a gap in the literature. Initial concerns regarding evaluation (e.g., missing FLOPs-Accuracy curves raised by DJjQ) and experimental variations (kw5N) were mostly addressed in the rebuttal. Another concern was about the lack of language generation evaluations (cDMi) The authors claimed that this is beyond the scope of this paper, however the claims in the title/ abstract etc. are general and don't clarify the focus on classification models. Given the wide use of LLMs these days it would have been stronger to include such studies, or at least discussing how this work can translate to that domain.