5.3

/10

Poster4 位审稿人

最低4最高6标准差0.8

3.5

置信度

正确性2.5

贡献度2.3

表达2.8

NeurIPS 2024

TinyTTA: Efficient Test-time Adaptation via Early-exit Ensembles on Edge Devices

Hong Jia,Young D. Kwon,Alessio Orsino,Ting Dang,Domenico Talia,Cecilia Mascolo

OpenReview PDF

提交: 2024-05-14更新: 2025-01-14

TL;DR

Efficient Test-time Adaptation Framework for Microcontrollers along with an MCU TTA library.

摘要

关键词

Test-time adaptationefficiencyedge devicemicrocontroller

评审与讨论

审稿意见

评分: 4置信度: 42024-07-11

The paper "TinyTTA: Efficient Test-time Adaptation via Early-exit Ensembles on Edge Devices" presents a combination of test-time Adaptation of pre-trained models and early-exist strategies for the efficient inference of Deep Learning models on edge devices. The authors introduce the ideas of test-time Adaptation (i.e., some parameters of the model are changed during deployment by, e.g., entropy minimization) and model splitting by introducing "early exists" nodes in the model. Most notably, the authors introduce a novel weight normalization that does not rely on batch statistics (e.g. BatchNorm) but re-weights and re-scales the weights of the model directly. Finally, the authors briefly present their novel inference engine, TinaTTA, which leverages the author's ideas into a usable framework that is used for the experiments. The authors showcase some impressive results on established image datasets (CIFAR10C, CIFAR100C, OfficeHome, PACS) and known small devices (Raspberry et al. two and a dual-core embedded Arduino). In particular, the present method has low latency and low energy consumption while often offering a much better accuracy than existing methods.

优点

The overall idea of bringing test time adaptation through early exists too small devices is good
The TinyTTA engine and the experimental evaluation seem impressive and well done

缺点

Overall, this is a very engineering-focused paper without much methodological impact. Therefore, it can be questioned whether it fits the NeurIPs conference. More specifically:

The authors propose self-ensembling, which basically means splitting the network into smaller submodules. The idea is very straightforward and not really studied in depth in the paper. It is unclear how to split the network (the authors mention using activation as a guiding tool, but it is unclear how they do it exactly). Similar ideas have already been discussed in the literature (see next comment), which are not mentioned in the paper.
The authors propose early-exists, which is not a novel technique but is well-known in the literature already [1,2,3]. As far as I remember, this technique has already been used since Yolo v5 (although mainly for improved training, and this might be debatable) [4]. Unfortunately, the Related Work section focuses on test-time adaptation (Which is good. I learned a lot here, thanks!) but misses out on the related work on the early-exit networks. Hence, it is unclear to me to what extent self-ensembling and early-exists is really new or just a re-brand of existing methods
While the evaluation is generally good, it does not highlight the memory overhead of introducing early exists. At the very least, the authors have to introduce some classification heads for each exit, which requires additional parameters. I could not find a mention of this in the paper, but maybe I missed it (see my question).
Minor issues with the paper
- Citations can be improved: There are a few arxiv papers that have been published already. I suggest using dblp for high-quality references (Note: There might be more than these three; I stopped checking after three):
  - Tent: Fully Test-time Adaptation by Entropy Minimization is an ICLR 2021 paper I think
  - Towards Stable Test-Time Adaptation in Dynamic Wild World is an ICLR 2023 paper I think
  - RobustBench: a standardized adversarial robustness benchmark is a NeurIPs 2021 Paper I think
- The authors mention data distributional shifts multiple times in the paper. I suggest to clarify this a bit more, since it is unclear to me if we look at shift in the label space or data space.
- In section 3.1, the sentence "[...] and approximating each submodule with the full model's capabilities" is unclear. See my question.
- In section 3.1, the merging of subsequent layers into submodules is not really clear. See my question.
- The authors mention that "activations consume much more memory compared to weights", which is also unclear to me. See my question.

[1] Why should we add early exits to neural networks? by Scardapane etal. 2020 https://arxiv.org/abs/2004.12814

[2] BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks by Teerapittayanon etal. 2017 https://arxiv.org/abs/1709.01686

[3] T-RECX: Tiny-Resource Efficient Convolutional neural networks with early-eXit by Ghanathe and Wilton 2023 https://arxiv.org/pdf/2207.06613

[4] TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios by Zhu etal. 2021

问题

What is the overhead in memory/number of parameters by adding new prediction heads? Did you explore this?
How did you analyze the activations for grouping the models into sub-modules? Did you apply some principle here, or was it an ad-hoc grouping?
What do you mean when you write, "activations consume much more memory compared to weights"?

局限性

The authors discuss limitations in the evaluation of their method (image data only, only one MCU), but the evaluation is generally well-done and, hence, quite thorough.

作者回复

2024-08-07

Thanks for all your positive comments. Please see below our response to the specific weakness and questions.

Q1: Unclear how to use activation memory to split the network. How did you analyze the activations for grouping the models into sub-modules? Did you apply some principle here, or was it an ad-hoc grouping?

A1: Per-layer memory profiling is based on TinyTL[1], as discussed in lines 141-148 and 253-254. The grouping is post-hoc by visualizing the similar memory usage of adjacent layers as discussed in line 146. We will further elaborate this in Appendix D.

Q2: Proposed early-exists is not a novel technique but is well-known in the literature. What extent self-ensembling and early-exists is really new or just a re-brand of existing methods?

A2: We acknowledge that early-exit networks have been explored in the literature, primarily for inference efficiency. However, they have never been explored under data distribution shifts and for adaptation purposes. Our work is the first to introduce early exits for on-device test-time adaptation in edge devices, enhancing both inference efficiency and model accuracy with data distributional shifts. Specifically, our approach innovatively integrates weight standardization (WS) within the heads of early exits and designs these heads for subsequent modules. This ensures adaptation relies solely on WS in the heads, maintaining both inference speed and model accuracy with data distributional shifts. Moreover, our approach uniquely combines self-ensembling (frozen after fine-tuning), early exits, and WS normalization. This innovative design significantly improves both inference speed and accuracy. Additionally, our method brings novelty in hardware deployment by introducing the TinyTTA Engine, enabling the first on-device test-time adaptation with unlabeled data.

We will include more references on early-exit networks in the Related Work section and further clarify this novelty in Section 3.2.

Q3: While the evaluation is generally good, it does not highlight the memory overhead of introducing early exists.

A3: We now conducted experiments to explore the overhead in memory and the number of parameters by adding new prediction heads. The results are below:

Dataset	Model	1st Exit Memory (MB)	1st Exit Params (K)	2nd Exit Memory (MB)	2nd Exit Params (K)	3rd Exit Memory (MB)	3rd Exit Params (K)
CIFAR10C	MobileNet	0.0103	2.70	0.0323	8.46	0.0660	17.29
CIFAR100C	MobileNet	0.0436	11.43	0.0985	25.83	0.1652	43.30
OfficeHome & PACS	MobileNet	0.0306	8.03	0.0728	19.07	0.1266	33.19
CIFAR10C	EfficientNet	0.0198	5.19	0.1685	44.17	0.5230	137.10
CIFAR100C	EfficientNet	0.0696	18.24	0.3336	87.46	0.7540	197.67
OfficeHome & PACS	EfficientNet	0.0502	13.17	0.2694	70.62	0.6642	174.11
CIFAR10C	RegNet	0.0957	25.09	0.5349	140.22	0.5349	140.22
CIFAR100C	RegNet	0.1482	38.86	0.6616	173.43	0.6616	173.43
OfficeHome & PACS	RegNet	0.1278	33.51	0.6123	160.51	0.6123	160.51

According to the results of the above table, we observe the following:

Adding new prediction heads results in a minimal memory increase. For instance, MobileNet shows an overhead ranging from 0.01 MB to 0.17 MB (with parameters ranging from 2.7K to 43.30K), which is negligible compared to the 512 MB MPU memory.
Different models exhibit varying overheads. EfficientNet shows the highest increase in both memory (up to 0.75 MB) and parameters compared to MobileNet and RegNet.
Overall, compared to the 512 MB MPU memory deployed, the overhead remains relatively small. We will extend this discussion in Appendix D.

Q4: Citations can be improved: There are a few Arxiv papers that have been published already.

A4: We will thoroughly review our paper and update our citations using DBLP to ensure high-quality references.

Q5: The authors mention data distributional shifts multiple times in the paper. I suggest to clarify this a bit more, since it is unclear to me if we look at shift in the label space or data space.

A5: The shift occurs in the data space. As illustrated in Fig. 2, for an image classification task with an image as the input, the leftmost cat image shows a small distribution shift, as it still clearly depicts the cat; in this case, the severity is 1. Gradually, in the second and third images, the cat’s image becomes noisier and harder to discern with the human eye, corresponding to severities of 3 and 5, respectively. We will further clarify these distributional shifts in the data space in Section 3.

Q6: What do you mean when you write, "activations consume much more memory compared to weights"?

A6: Activations consume much more memory than weights in neural networks because the value of activations need to be stored for every layer during backpropagation, especially when using large batch sizes and deep networks. For instance, consider a single CNN layer with an input image (e.g., CIFAR100C) size of 224x224x3 and a batch size of 1. The input activations alone would be around 1 (batch size) * 224 * 224 * 3 (channels) ≈ 1.8 million values, and the output activations after a convolutional layer with 64 filters would be 224 * 224 * 64 (channels/filters)≈ 34.1 million values. In contrast, the convolutional layer weights would only be 64 (channels/filters) * 3 * 3 * 3 (kernel size) ≈1,728 values. This indicates that the sheer volume of activations, especially in layers producing large feature maps, leads to significantly higher memory consumption compared to the relatively small number of weights. We will clarify this in Appendix A.

[1] TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning. NeurIPS 2020

2024-08-12

Dear reviewer. Thank you for reading our rebuttal! We believe that our response addresses your raised weaknesses (more justification and ablation study) and questions (more experimental results). If you agree that our response addresses the weaknesses and questions, please consider raising your score. If you have any outstanding concern, please let us know so that we can do our best to address them.

审稿意见

评分: 6置信度: 32024-07-13

The paper "TinyTTA: Efficient Test-time Adaptation via Early-exit Ensembles on Edge Devices" presents a framework designed to enable test-time adaptation (TTA) on resource-constrained IoT devices. TinyTTA utilizes a self-ensemble and batch-agnostic early-exit strategy to adapt to data distribution shifts efficiently with smaller batch sizes, thereby reducing memory usage and improving latency. The approach is validated on Raspberry Pi Zero 2W and STM32H747, demonstrating significant improvements in accuracy, memory efficiency.

优点

The authors address the challenges of deploying TTA on devices with stringent memory and computational constraints, which is a less explored area
use of self-ensemble networks and early-exit strategies for TTA on edge devices allowing for adaptive inference based on confidence levels, optimizing both memory usage and computational efficiency.
uses Weight Standardization (WS) as a substitute for traditional normalization layers, specifically tailored for microcontrollers with strict memory limitations.
The paper provides a comprehensive evaluation of the framework across multiple datasets and compares its performance with several state-of-the-art methods. The evaluation covers all important metrics like accuracy, memory usage, latency etc.
The paper is well-structured and written in a clear way explaining all concepts and methodologies. Key concepts such as self-ensemble networks, early-exit strategies, and weight standardization are explained with sufficient clarity
Introduces TinyTTA Engine, a MCU library that enables on device TTA.

缺点

main:

The paper only presents results on vision data, what about other type of data?
Implementation Details needed for reproducibility are missing and the code is not provided. E.g, There are not details about how submodules are created for the models utilized in the paper and hyperparameters used during training which hinders the reproducibility.
While the results of the experiments are effectively demonstrated in the figures, the readability of the figures can be improved. Specifically, Figure 5 is hard to read. The caption of this figure is also not explanatory enough. Consider briefly summarizing the results in the caption.
The comparison was only limited to other TTA methods. Including non-TTA methods might provide a baseline to better understand the advantages of implementing TTA in resource-constrained environments
The experiments primarily focus on a specific type of MCU (STM32H747) and one MPU (Raspberry Pi Zero 2W). Is TinyTTA generalizable across other platforms as well?
Section 3.1. self-ensemble network, second paragraph: "(iii) certain groups of adjacent layers, specifically 145 layers 2-15, 16-28, 29-44, and 45-52, show similar sizes of activations. Based on this analysis, we 146 group layers with similar memory usage into submodules for subsequent early exits to improve 147 memory usage: i.e., layers 0-15 for submodule 1, 16-28 for submodule 2, 29-44 for submodule 3, and 148 45-52 for submodule 4. ". This is
- very model specific: does this generalize to other models?
- overly details. Please move such details to a table or so.
Section 3.1. second paragraph and also Figure 3: this is well known and, for example, also stated in the MCUNet paper.
Details on the early exit are not clear: what layers etc. do the authors use?
The ablation study is incomplete: needs to also show early exit with and without model updates.
Table 1 should get another line: "Inference-only with EE". -The paper does not explore the sensitivity of TinyTTA to various hyperparameters. -The paper lacks a detailed analysis of adaptation time and energy consumption. -The paper does not discuss the challenges and trade-offs involved in deploying the framework

minor:

last paragraph of the introduction: 512 KB of SRAM stated twice
Section 3.2 Early Exists: "For a given pre-trained model ," -> remove space before comma
Figure 5: I suggest using the similar color for the same models, e.g.., similar color for MCUNet and MCUNet+TinyTTA
Figure 6: font size of the figures is too small
Figure 7: please add units % and KB

问题

Could you possibly add more details about computational overhead of self-ensembles and early exits.
What is the motivation for using entropy thresholding in early exit?
While you evaluated TinyTTA on four benchmark corruption datasets, these are based on synthetic noise. - How does TinyTTA perform on datasets with real-world distribution shifts or noise?
While you mention improvements in energy efficiency, a detailed analysis is not provided. Can you provide a breakdown of energy consumption for different components of TinyTTA, and compare it with baseline methods -how do the optimal entropy thresholds in Appendix G differ and impact the performance of the system?

局限性

The authors provided a limitations section, and to the best of my understanding, the paper does not have any negative societal impacts.

作者回复

2024-08-07

Thanks for all your positive comments. Please see below our responses.

Q1: Other type of data other than vision data?

A1: We tested on the Musan Keywords Spotting audio dataset [2] using a pretrained MicroNets [1] (86% accuracy on Speech Commands V2), which includes 35 speech commands with real-world noises, as below:

Method	Accuracy
No Adaptation	0.53
CoTTA	0.21
TENT (Finetune)	0.05
TENT (Modulating)	0.11
EATA	0.07
ECoTTA	0.23
TinyTTA (Ours)	0.61

We observed:

A 33% performance drop in the pretrained model under distribution shifts.
TinyTTA achieved the highest accuracy of 0.61 (8% improvement).

Q2: Implementation Details.

A2: As in line 683, we will provide the source code of TinyTTA Engine. We will further elaborate on Appendix C.

Q3: Non-TTA method?

Q3:We implemented a new baseline using Test-Time Training (TTT) [3] with self-supervised rotation classification, using SGD (momentum 0.9, lr 1e-5), augmentation size 20 and batch size 1. Memory is tested on a Raspberry Pi Zero 2W.

Method	Model	Accuracy↑				Memory (MB)↓
		CIFAR10C	CIFAR100C	OfficeHome	PACS	CIFAR10C	CIFAR100C	OfficeHome	PACS
TTT	MCUNet	0.16	0.05	0.07	0.07	0.41	1.35	1.32	1.33
	EfficientNet	0.18	0.06	0.06	0.12	12.81	37.43	37.21	37.71
	MobileNet	0.17	0.07	0.08	0.10	11.43	36.27	36.71	36.58
	RegNet	0.15	0.12	0.06	0.07	12.28	14.33	14.45	14.20
TinyTTA (Ours)	MCUNet	0.64	0.52	0.58	0.64	0.2	0.73	0.71	0.72
	EfficientNet	0.68	0.53	0.62	0.66	5.65	16.94	16.97	16.91
	MobileNet	0.65	0.53	0.60	0.63	5.58	16.74	16.79	16.77
	RegNet	0.64	0.51	0.54	0.57	6.13	6.25	6.28	6.22

TinyTTA outperforms TTT by ~50% across four datasets and models.
TTT requires, on average, double the memory compared to TinyTTA.

Q4: The experiments focus on specific type of MCU and MPU. Generalizeable to other chips?

A4: Yes, we selected these two platforms as they represent a range of devices. Specifically, since our MCU is based on Core M processors, TinyTTA can be directly deployed on ARM Cortex-M-based MCUs. We also chose one of the smallest Raspberry Pi to ensure that other Pis can also run TinyTTA. Further clarification will be provided in Appendix C.1.

Q5: Ablation study of early exit w/ and w/o model updates.

A5: The accuracy of the model w/o updates is discussed in Fig. 5. The memory usage (MB) w/o updates is discussed below.

Model	CIFAR10C		CIFAR100C		OfficeHome		PACS
	w/o	w/	w/o	w/	w/o	w/	w/o	w/
MCUNet	0.189	0.2	0.726	0.73	0.69	0.71	0.711	0.72
EfficientNet	4.94	5.65	15.78	16.94	15.99	16.97	15.93	16.91
MobileNet	5.47	5.58	16.43	16.74	16.56	16.79	16.54	16.76
RegNet	5.81	6.18	4.44	6.25	4.47	6.28	4.41	6.22

The memory usage w/ and w/o adaptation is very similar.
TinyTTA generally requires limited memory to perform on-device TTA. We will update Fig. 7 and Section 5.4.

Q6: Computational overhead of self-ensembles and early exits?

A6: Self-ensembles are conducted offline (line 465). We analysed the overhead in memory and the number of parameters of the early exits:

Model	Dataset	1st Exit		2nd Exit		3rd Exit
Model	Dataset	Memory (MB)	Params (KB)	Memory (MB)	Params (KB)	Memory (MB)	Params (KB)
MobileNet	CIFAR10C	0.01	2.70	0.03	8.46	0.07	17.29
	CIFAR100C	0.04	11.43	0.10	25.83	0.17	43.30
	OfficeHome & PACS	0.03	8.03	0.07	19.07	0.13	33.19
EfficientNet	CIFAR10C	0.02	5.19	0.17	44.17	0.52	137.10
	CIFAR100C	0.07	18.24	0.33	87.46	0.75	197.67
	OfficeHome & PACS	0.05	13.17	0.27	70.62	0.66	174.11
RegNet	CIFAR10C	0.10	25.09	0.53	140.22	0.53	140.22
	CIFAR100C	0.15	38.86	0.66	173.43	0.66	173.43
	OfficeHome & PACS	0.13	33.51	0.61	160.51	0.61	160.51

Adding new prediction heads results in minimal memory increase, e.g., MobileNet's overhead is 0.01 MB to 0.17 MB, negligible compared to 512 MB MPU memory. We will extend this in Appendix D.

Q7: Motivation for using entropy thresholding? Sensitivity of hyperparameters?

A7: Entropy thresholding, as used in many TTA methods like [4], avoids high-entropy, less reliable samples to maintain TTA performance. Hyperparameters, including the entropy threshold, are determined post hoc [3] and discussed in Appendix G. We will add a table of layer exits and other hyperparameters in Appendix D.

Q8: Energy consumption compared baseline methods?

A8: We now compared the latency and energy consumption on Raspberry Pi zero 2 W MPU using CIFAR10C:

Method	CIFAR10C (50000 images)
	Latency (seconds)	Energy (Wh)
CoTTA	312,500	173.61
TENT (Finetune)	25,500	14.17
TENT (Modulating)	25,500	14.17
EATA	12,500	6.94
ECoTTA	18,850	10.47
TinyTTA (Ours)	11,000	6.11

TinyTTA has an inference time of 0.22 seconds per sample and energy consumption of 0.122 mWh, showing high efficiency.
TinyTTA outperforms baselines, reducing latency by 12% (1500 seconds) and energy consumption by 12% (0.83 Wh) compared to EATA.

[1] MicroNets. MLSys 2021
[2] Importantaug. ICASSP 2022
[3] Test-Time Training with Self-Supervision for Generalization under Distribution Shifts. ICML 2020
[4] Efficient Test-Time Model Adaptation without Forgetting. ICML 2022

2024-08-08

Thanks for the detailed rebuttal and new results. I do not have any further questions at this point.

2024-08-12

2024-08-13

Thank you for your detailed replies to my questions and also the questions of the other reviewers. After reviewing everything and taking into account the other reviews, I stand by my original assessment of the paper.

2024-08-13

Dear Reviewer. Thank you so much again for your time and effort in thoroughly reviewing our work/rebuttal and your response! We are glad that our response addressed your questions properly. In our final draft, we will update our paper based on your feedback and our rebuttal.

Sincerely, The Authors

审稿意见

评分: 6置信度: 32024-07-13

This work presents a test-time adaptation framework for tiny deep neural networks. Specifically, the proposed framework partitions a specific model based on the memory usage of each layer, clusters adjacent layers with similar memory usage into a submodule, and adds an early exit header for each module. To avoid using batch normalization, the authors adopt weight standardization for the early exit header layer. Only the early exit header layer and the corresponding weight standardization parameters are updated during test-time adaptation. The authors also developed an MCU library to support the aforementioned test-time adaptation. The framework can support low-end IoT devices with only 512KB of memory.

优点

Real device deployment: It is great to see that the proposed framework can facilitate the deployment of tiny neural networks on low-end IoT devices with only 512 KB of memory.
Well-motivated: The analysis of the memory usage of existing test-time adaptation techniques clearly highlights the drawbacks of previous methods, making the idea of partitioning models based on memory usage quite straightforward.
Impressive results: The significantly better accuracy versus memory usage compared to baseline test-time adaptation on four different models is very impressive.

缺点

Lack of discussion on design choices: For example, why can't the "fine-tune bias only" technique from TinyTL [28] be used in test-time adaptation? What is its performance compared to only fine-tuning the early exit header proposed in this work? Why is there a "lack of support for normalization layers on MCUs"? Since the authors have developed their own MCU library, why can't the normalization layer be added to the library?
Limited experiments: As the authors mentioned in the Conclusion section, this work only targets image data. Thus, it is unclear whether the design in this work can be generalized to other applications. For example, will only updating the early exit header be sufficient for other applications?
Lack of details on the TinyTTA Engine: Since the algorithm is not entirely novel (i.e., adding multiple early exit headers and weight standardization are not proposed by the authors, but the authors may be the first to use them in test-time adaptation for tiny models), the TinyTTA Engine itself seems to be the key factor ensuring these techniques work efficiently and effectively on real devices. More details and insights from the implementation of the TinyTTA Engine would be greatly appreciated by the community.

问题

Will you open-source the TinyTTA engine?
Is there any real-world case to show that updating the model via backpropagation, instead of simply switching some modes, makes a significant difference for tiny models?

局限性

The author discussed the limitations but did not address the potential negative societal impact of their work. This should be fine because, in my opinion, this work does not have any negative societal impact.

作者回复

2024-08-07

Thanks for all your positive comments. Please see below our response to the specific weakness and questions.

Q1: Lack of discussion on design choices: why can't the "fine-tune bias only" technique from TinyTL [28] be used in test-time adaptation? What is its performance compared to only fine-tuning the early exit header proposed in this work?

A1: TinyTL [28] needs labels for incoming samples to adapt the model, whereas TTA typically handles situations where only samples are available without labels. This is more practical in real-world settings, as providing labels is often challenging due to unforeseen noise and environmental factors. We now conducted "fine-tune bias only" via entropy minimization and compared it with adjusting the exits. Specifically, during TTA, only biases are allowed to be updated. The results are below:

Method	CIFAR10C	CIFAR100C	OfficeHome	PACS
Bias only-MCUNet	0.15	0.09	0.11	0.07
Exits-MCUNet	0.64	0.52	0.58	0.64
Bias only-EfficientNet	0.13	0.15	0.18	0.09
Exits-EfficientNet	0.68	0.53	0.62	0.66
Bias only-MobileNet	0.16	0.13	0.15	0.11
Exits-MobileNet	0.65	0.53	0.60	0.63
Bias only-RegNet	0.18	0.13	0.19	0.17
Exits-RegNet	0.64	0.51	0.54	0.57

Based on the experiment, we can observe that:
(1) Adjusting bias alone could not achieve reliable TTA performance.
(2) TinyTTA is relatively stable across four datasets and models.
We consider that the results are related to the characteristics of TTA, which aims to align data distribution shifts by adjusting the mean and variance. Adjusting the bias alone is insufficient to maintain reliable performance across different datasets and conditions. We will incorporate these new results in Appendix E under a new heading: "Comparison with Updating Bias Only.".

Q2: Why is there a "lack of support for normalization layers on MCUs"? Since the authors have developed their own MCU library, why can't the normalization layer be added to the library?

A2: Normalization layers are designed to work with mini-batches of data. However, due to limited memory resources, MCUs typically only allow a single batch of data. As such, normalization layers can technically be added to libraries for MCUs. In practice, the norm layer and the convolutional layer operations are combined into a single convolutional layer operation in order to save the computation and memory as in [1,2]. We will emphasize this further in the paper at Appendix A “Modulating and Finetune TTA.”

Q3: Limited experiments: As the authors mentioned in the Conclusion section, this work only targets image data. Thus, it is unclear whether the design in this work can be generalized to other applications. For example, will only updating the early exit header be sufficient for other applications?

A3:We conducted a different real-world distribution-shifted audio data modality experiment using a pretrained model on MicroNets [1], which was trained on Speech Commands V2 with 86% accuracy. This dataset contains 35 keywords such as “yes,” “no,” “forward,” etc. We tested the model on the Musan Keywords Spotting test dataset [2], which includes 35 speech commands under real-world noises such as dial tones, fax machine noises, car idling, thunder, wind, footsteps, rain, and animal noises. The setting aims to adapt the pretrained speech command model to real-world scenarios. TinyTTA parameters are: learning rate (lr) = 1e-5, batch size of 1, the SGD optimizer with a momentum of 0.9, and self-ensemble of early exit layers at [3, 5, 7]. The results are as follows:

Method	Accuracy
No Adaptation	0.53
CoTTA	0.21
TENT (Finetuning)	0.05
TENT (Modulating)	0.11
EATA	0.07
ECoTTA	0.23
TinyTTA (Ours)	0.61

Based on the experiment, we can observe that:
(1) The pretrained model could experience a performance drop of 33% in distribution shift settings.
(2) TinyTTA improved accuracy by 8% over the baseline, showing strong resilience to various noises.
(3) TinyTTA achieved the highest accuracy of 0.61, significantly outperforming other methods (the highest among the state-of-the-art techniques is CoTTA with 0.23).

Q4: Lack of details on the TinyTTA Engine: Since the algorithm is not entirely novel.

A4: Our approach uniquely combines self-ensembling (frozen after fine-tuning), early exits, and WS normalization. This innovative design significantly improves both inference speed and accuracy. Additionally, our method brings novelty in hardware deployment by introducing the TinyTTA Engine, enabling the first on-device test-time adaptation with unlabeled data. We will provide more details in Appendix C including details of backpropagation, operators, layerwise update strategy, and dynamic memory allocation.

Q5: Will you open-source the TinyTTA engine?

A5: Yes. As stated in line 683, The TinyTTA Engine Code will be made fully publicly available upon acceptance.

Q6: Is there any real-world case to show that updating the model via backpropagation, instead of simply switching some modes, makes a significant difference for tiny models?

A6: In realistic scenarios of TTA, we do not have knowledge of the given target domain. Hence, it is difficult to switch to the right model suitable for the target domain. Additionally, note that our target hardware consists of MCUs with extremely limited storage, typically at most 1 MB. Even if we have knowledge of the target domain for a stream of data, we can only store up to 2-3 tiny models (refer to Table 1 for the memory and storage requirements of a single model). We will clarify this in Appendix A.

[1] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CVPR2018
[2] TensorFlow Lite Micro. MLSys 2021.

2024-08-12

评论- Thanks for the rebuttal!

2024-08-14

Thank you to the authors for the detailed response and for the efforts in conducting the additional experiments.

I have revised my evaluation.

2024-08-14

Dear reviewer. Thank you so much again for your time and effort in thoroughly reviewing our work/rebuttal and your response! We are glad that our response addressed your questions properly. In our final draft, we will update our paper based on your feedback and our rebuttal.

Sincerely, The Authors

审稿意见

评分: 5置信度: 42024-07-14

In this work, the authors focus on enabling test time adaption on resource limited edge devices. To achieve that, the authors first to train a self-ensemble network where the sub-networks are partitioned according to memory usage. After that, the authors further adopt WS normalization to improve adaption capacity under batch size as one. In the experiments, the proposed methods show better accuracy with higher efficiency compared to prior test time adaption works.

优点

Although self-ensemble learning and WS normalization is not new, it seems to first use on-device test time adaption.
It's interesting to see that adopting WS normalization can achieve better accuracy compared to other normalization layers on the setting of batch size as 1.

缺点

The motivation to modulate the pre-trained model according to memory usage is confused.
The training process on device is unclear.
The paper writing needs be improved.

问题

The motivation to modulate the pre-trained model according to memory usage is confused. In the Fig.3, it shows that initial layers occupy much higher memory usage than later layers. However, the authors group the first several layers as first submodule that should be always activated during test time adaption on device, leading to less memory reduction. Will it better to avoid pass through some of the initial layers and active more later layers?
One main concern is that the training process happened on device is unclear.

(1) The self-ensemble learning has higher training cost. In line 206, the authors mention that "After training, only the submodules and early exits will be deployed on MCUs." It seems that the proposed method first do self-ensemble learning offline which suffers from extra training cost.

(2) Since the work targets on test time adaption where the source data should be unavailable. How does the authors to train the self-ensemble networks?

(3) In line 220, the authors said "The cornerstone of TinyTTA lies in its ability to perform backpropagation on-device". What is the difference compared to [1]? Which part of the model will be trained on device?

[1] On-device training under 256kb memory." NeurIPS 2022
The Fig.3 is unclear. From my understanding, modulation aims to partition the entire model into different groups based on activation memory. Why does activation memory change per layer, and there is no weight memory, compared to "fine-tuning per-layer memory usage"?
Writing needs to be improved:

(1) In line 97, "practical" should be "impractical"

(2) Line 170 is incomplete

(3) In line 175, the authors said "to entirely omit usage of normalization layers and .... on MCUs". Isn't WS normalization is a normalization layer used on MCUs?

局限性

Yes

作者回复

2024-08-07

We appreciate the insightful comments. Please see below our responses.

Q1: In Fig.3, initial layers occupy much higher memory than later layers. Is it better to avoid pass through some of the initial layers and active more later layers?

A1: Memory usage primarily concerns activations, which store the outputs at each layer. This storage is essential during backpropagation, as computing gradients requires retaining both the outputs and gradient values for each node in each layer. However, we only update the Weight Standardization (WS) layer in the heads at early exits and freeze the submodules after self-ensembling (cf. Fig 2). This process does not require backpropagation to the submodules, thus eliminating the need for activation memory. Consequently, it does not increase memory usage.

Additionally, avoiding the initial layers is not feasible because these layers capture crucial information from the input. Skipping them would result in incomplete information being learned, leading to degraded performance. We will further clarify this in Appendix A, “Modulating and Finetune TTA”.

Q2: The authors mention "After training, only the submodules and early exits will be deployed on MCUs." It seems that the method first do self-ensemble offline which suffers from extra training cost.

A2: The self-ensemble model is trained offline, so there is no additional training cost on the device. The training procedure is discussed in Appendix C.2. We will include a heading for the first paragraph as 'Pre-training of Self-ensemble' and further clarify this point.

Q3: Since source data should be unavailable. How does the authors train the self-ensemble networks?

A3: The self-ensemble networks are trained offline, so we still use the source data, the same as related work in the literature [2]. However, once deployed on the device, the adaptation using TinyTTA does not have access to the source data, adhering to the standard TTA pipeline and the same configuration as in [2]. We will clarify this point in Section 3.1.

Q4: What is the difference compared to [1]?

A4: We discussed [1] TinyEngine in Appendix C.4. Specifically, TinyEngine focuses on on-device training (with labeled data), whereas TinyTTA Engine (ours) focuses on on-device test-time adaptation (with unlabeled data). TinyEngine pre-determines the layers and channels to be updated before deployment statically (i.e., as a binary file), executing these updates at runtime. In comparison, TinyTTA is dynamic during inference towards exits in submodules, enabling the exiting of high-entropy samples for reliable TTA. To enable TinyEngine for TTA, the only viable solution is using TENT [3] to finetune with entropy minimization. To this end, we compared TinyEngine (dubbed as TE) using TENT on a Raspberry Pi Zero 2W, using batch size 1, with TinyTTA (ours) in terms of accuracy. All configurations are the same as in Appendices B and C. The results are below:

Method	CIFAR10C	CIFAR100C	OfficeHome	PACS
TE-MCUNet	0.13	0.06	0.07	0.06
TinyTTA (Ours)-MCUNet	0.64	0.52	0.58	0.64
TE-EfficientNet	0.19	0.11	0.09	0.07
TinyTTA (Ours)-EfficientNet	0.68	0.53	0.62	0.66
TE-MobileNet	0.18	0.05	0.05	0.06
TinyTTA (Ours)-MobileNet	0.65	0.53	0.60	0.63
TE-RegNet	0.15	0.12	0.07	0.08
TinyTTA (Ours)-RegNet	0.64	0.51	0.54	0.57

We can make the following observations:

Powered by the TinyTTA Engine, TinyTTA generally performs stably as it allows for dynamically exiting high-entropy samples.
TE is unable to perform stable TTA across all datasets. We will update the paper in Appendix C.4.

Q5: Which part of the model will be trained on device?

A5: After deployment, as shown in Fig. 2, only the exits will be updated on-device, while the remaining parts of the model are frozen, ensuring both high TTA accuracy and low memory usage.

Q6: Why does activation memory change per layer, and no weight memory in Fig 3, compared to "fine-tuning per-layer memory usage"?

A6: The primary memory usage for activations is determined by the size of the last output tensor of each layer, essentially storing each layer’s outputs. Since each layer’s output shape is different, their activation memory will accordingly be different. The weight memory is very small (a few KBs) compared to the activation memory. The weight memory usage for modulating TTA, i.e., the change of two parameters, Scale (γ) and Shift (β), is relatively small and not visible in Fig 3.

Consider a single CNN layer with an input image size of 224x224x3 and a batch size of 1. The input activations alone would be around 1 (batch size) * 224 * 224 * 3 (channels) ≈ 1.8 million values, and the output activations after a convolutional layer with 64 filters would be 224 * 224 * 64 (channels/filters)≈ 34.1 million values. In contrast, the convolutional layer weights would only be 64 (channels) * 3 * 3 * 3 (kernel size) ≈1,728 values. This indicates that the sheer volume of activations, especially in layers producing large feature maps, leads to significantly higher memory consumption compared to the relatively small number of weights. We will clarify this in Appendix A.

Q7: The authors said "to entirely omit usage of normalization layers ...". Isn't WS normalization a normalization layer?

A7: We discussed how to deploy Weight Standardization (WS) normalization in lines 189-190 and Fig 4. Specifically, WS will be applied within the CNN exit layer (i.e., a new CNN layer which was introduced by TinyTTA Engine during deployment) to avoid using batch normalization layers. We will further clarify this in Section 3.3.

[1] On-device training under 256kb memory." NeurIPS 2022
[2] EcoTTA: Memory-Efficient Continual Test-time Adaptation via Self-distilled Regularization, CVPR 2023
[3] Tent: Fully test-time adaptation by entropy minimization, ICLR 2021

2024-08-12

2024-08-13

Thank you for your thoughtful response. The authors have addressed my concerns, and I raised my score accordingly.

2024-08-13

Sincerely, The Authors

作者回复

2024-08-07

Dear reviewers and meta reviewers,

We appreciate all the positive comments of our work:

Reviewer yW6M: First time using WS and self-ensemble for on-device test time adaptation.
Reviewer DHsU: Well-motivated, impressive results, and real device deployment on low-end IoT devices with only 512 KB of memory.
Reviewer 2zcM: Comprehensive evaluation, well-structured and written, and novel TinyTTA Engine.
Reviewer bEbZ: First TTA for small devices; TinyTTA Engine is impressive and well evaluated.

We have addressed all the comments by providing more clarifications and new results:

Reviewer yW6M: We have clarified memory usage in initial layers, self-ensemble training, on-device updating components, activation memory, and WS normalization. New experiment: Comparison with on-device training.
Reviewer DHsU: We have clarified fine-tuning bias only, normalization layers on MCUs, TinyTTA engine, and switching some models instead of TTA. New experiments: real-world different data modality. Comparison with "fine-tune bias only."
Reviewer 2zcM: We have clarified implementation details and the motivation of entropy thresholding. New experiments: real-world different data modality, new non-TTA baseline, and ablation study of early exits.
Reviewer bEbZ: We have clarified the principle to group activation, novelty of exits, citations, and activation memory. New experiment: memory of exits.

Detailed Q&As are listed below. We look forward to further discussions and feedback.

评论- Start reviewer-author discussions right now

2024-08-08

Dear reviewers,

Authors submitted rebuttals, which should be visible to you now.

Please read the rebuttals carefully and start discussions with the authors now.

The reviewer-author discussion period will end on August 13, 2024. Since authors usually need time to prepare for their responses, your quickest response would be very appreciated.

In case you had requested additional experiments / analysis and the authors provided in the rebuttal, please pay extra attention to the results.

Thank you,
Your AC

评论- Reviewer-author discussions will end in about 30 hours

2024-08-13

Dear reviewers,

This is the final reminder for the reviewer-author discussions. It will end on August 13 11:59 AoE, and then we will start AC-reviewer discussions.

If you have already concluded the discussions with the authors, thank you so much!

If you have not responded to the author rebuttal yet, please do so immediately. We have been waiting for your response.

Best,
Your AC

2024-08-11

Dear reviewers and meta reviewers:

Hope this message finds you well.

We have answered your questions and provided new experiments as required. As the discussion period will end in less than two days, we kindly ask if you have any further concerns or questions that we might be able to address.

Thanks!

Best regards,
Authors

最终决定Accept (poster)

2024-09-25

We received author rebuttals, and all reviewers discussed the paper based on the rebuttals. This paper discusses a test-time adaptation approach (self-ensemble and batch-agnostic early-exit strategy), using resource-constrained edge devices such as Raspberry Pi Zero 2W and STM32H747. Considering the reviews and author-reviewer discussions, I recommend accepting this paper for NeurIPS'24.

For positive aspects of this work, I agree with the reviewers to many of their comments such as performance evaluations for TTA on the extremely resource-constrained devices, offering an MCU library, and improved tradeoff between model accuracy and memory usage.

For downsides, Reviewers yW6M's concerns are mainly about writing quality. Reviewer DHsU pointed out other weakness (evaluations solely on image data). Those drawbacks seem addressed through the discussions though the authors still should address them in the paper. Reviewer 2zcM shared similar concerns. From their discussion with the authors, it seems like that the authors' rebuttal addressed their questions, and they did not disagree to the answers. Last not least, weaknesses suggested by Reviewer bEbZ are fair points, and in fact this paper may look engineering-focused. I see that the key contribution of this work is not only introducing on-device TTA for extremely resource-constrained devices but also offering the library, which should be unique contribution and of community's interest.

Overall, I think that advantages of this work slightly outweigh the disadvantages. In other words, my recommendation for acceptance is conditioned on 1) the revisions that the authors promised below and 2) the public release of the library with detailed documentations for the community. If the authors are not ready for releasing the library with the proper documentations, I suggest that they start working on documentation as soon as possible.