5.5

/10

Poster4 位审稿人

最低5最高6标准差0.5

3.5

置信度

正确性2.8

贡献度2.5

表达3.0

NeurIPS 2024

DEX: Data Channel Extension for Efficient CNN Inference on Tiny AI Accelerators

Taesik Gong,Fahim Kawsar,Chulhong Min

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

We propose Data Channel EXtension (DEX) to improve CNN accuracy on tiny AI accelerators by using patch-wise sampling and channel-wise stacking, boosting accuracy by 3.5% without increasing inference latency.

摘要

关键词

TinyMLOn-device MLCNNAI acceleratorMicrocontrollerMCU

评审与讨论

审稿意见

评分: 6置信度: 42024-07-10

The paper proposes a novel method to address the accuracy degradation caused by downsampling on small AI processors. The authors observed that the input layer often has a small number of channels, leading to underutilization of the processors. To mitigate this issue, they introduce a technique involving patch-wise even sampling and channel-wise stacking. This method incorporates additional spatial information, thereby improving accuracy while efficiently using processing resources that would otherwise be wasted.

优点

The motivation for the proposed method is clear and well-justified.
The idea of sampling and channel-wise stacking is both simple and effective, demonstrating a practical solution to a common problem in small AI processors.
The paper provides well-defined baselines, including normal downsampling and CoordConv, and offers comparisons with various data channel extension approaches.
The authors conduct a thorough sensitivity analysis, evaluating accuracy, model size, and latency across different channel sizes.
The method shows low latency and minimal model size overhead, making it a promising solution for improving accuracy without significant performance trade-offs.

缺点

Further comparisons with a broader range of more complex models could strengthen the evaluation.
There could be more discussion of potential limitations and scenarios where the method might be less effective.

问题

How does the proposed method perform on more complex models and larger datasets beyond the scope of the current evaluation?
Are there specific scenarios or types of models where this method might be less effective?

局限性

The authors discussed the limitations concerning small models and acknowledged the potential negative societal impact due to the increased use of computational resources to improve accuracy. In my opinion, their discussion is sufficient and adequate.

作者回复

2024-08-06

We sincerely appreciate your time and effort in providing us with positive and thoughtful comments. We respond to your question in what follows. Please also refer to the global response we posted together.

Question 1 & Weakness 1) How does the proposed method perform on more complex models and larger datasets beyond the scope of the current evaluation?

Thank you for the suggestion. We agree that evaluating the proposed method on more complex models and larger datasets would be beneficial. However, the tiny AI accelerators currently face memory and architecture limitations that restrict support for complex and larger models. For instance, WideNet already utilizes 70% of the weight memory (432KB). Given that MAX78000/MAX78002 are the only available platforms with disclosed hardware details and open-source tools, we plan to extend our idea to a wider variety of models with more computationally capable tiny AI accelerators in the future.

Additionally, during the rebuttal period, we conducted an experiment to see if DEX is applicable to another task, face detection, using the VGGFace2 dataset. The results demonstrated the effectiveness of DEX over the downsampling, with mAP improving from 0.65347 to 0.69307. Please refer to the response to Reviewer k2BH, Question 3 for further details.

Question 2 & Weakness 2) Are there specific scenarios or types of models where this method might be less effective?

Thank you for the question. We think DEX might be less effective in certain tasks where incorporating more pixel information is not beneficial. For instance, DEX might be less effective in scene categorization where the overall structure and composition of the scene are more important than the detailed textures or pixel-level variations, such as determining if an image is an indoor or outdoor scene. In those cases, alternative data extension strategies might be used instead of patch-wise even sampling to utilize the additional channel budget.

We will incorporate this discussion in our final draft.

2024-08-12

Thanks for the additional experiments on the face detection task. I wil keep my score.

评论- Thank you for your response

2024-08-13

Thank you for your response to our rebuttal! Thank you again for the positive review and valuable comments.

Best,

Authors

审稿意见

评分: 6置信度: 32024-07-12

This paper introduces DEX, a novel technique designed to enhance the efficiency of DNN inference on resource-constrained tiny AI accelerators by extending the data channels. This approach aims to improve both resource utilization and inference accuracy by incorporating additional spatial information from the original image through patch-wise even sampling and channel-wise stacking. The authors identify that the limited memory budget on resource-constrained tiny AI accelerators requires downscaling the input image, which can degrade model quality and underutilize resources. By extending the data channel, the proposed method retains more information from the input image, thereby improving inference accuracy and maximizing resource utilization without increasing inference latency. Evaluations on real tiny AI accelerator devices demonstrate a 3.1% accuracy improvement with no additional inference latency.

优点

The paper identifies the accuracy degradation and resource under-utilization issues for DNN inference on tiny AI accelerators and proposes a simple yet effective solution that can improve inference accuracy without additional inference latency.
The evaluations are conducted on real hardware devices.
The paper is well-written with the motivation, methodology and results being presented in a clear and logical manner.

缺点

While the paper has notable strengths, several areas could be improved:

End-to-end Performance: The impact of proposed DEX on the end-to-end latency and throughput is not evaluated. Although the authors claim no increase in inference latency, the overhead of the channel expansion from input RGB images (including several preprocessing steps) needs to be studied.
Power/Energy Measurement: The paper does not include the analysis of power consumption and energy efficiency with DEX, which are crucial considerations for tiny AI accelerators.
Scope of Models: The study is limited to classification DNN models. The impact on other tasks, such as object detection, face recognition, and more complex applications, should be explored.

问题

What is the overhead of the channel expansion process in terms of computational and memory resources? How does this impact end-to-end inference performance? And on which hardware does such pre-processing happen?
How does the power consumption vary when using the proposed technique?
Can you comment on the applicability of DEX to other DNN tasks, such as object detection or natural language processing, which are also common tasks for tiny devices? Will DEX still be effective in improving inference accuracy for those tasks?

局限性

The authors have acknowledged the limitations regarding the exploration of larger models. However, they have not addressed the broader applicability of DEX to other tasks beyond classification. It would be beneficial to include evaluations on other applications such as object detection, segmentation, and more to demonstrate the generalizability of the proposed method.

作者回复

2024-08-06

Question 1 & Weakness 1) What is the overhead of the channel expansion process in terms of computational and memory resources? How does this impact end-to-end inference performance? And on which hardware does such pre-processing happen?

Thank you for the question. While our work currently focuses on AI accelerators—specifically their utilization and inference latency—considering data processing overhead is an important discussion for practical deployment. Note that the overhead and impact of data processing depend on the target application scenario and benchmark setup.

Overhead of the channel expansion and hardware: The latency of the channel expansion process depends on the processor's computational capability. During our evaluation, we pre-processed data on a powerful server, and thus data processing was negligible.

We additionally conducted data processing on the ultra-low-power MCU processor on the board (Arm Cortex-M4) to understand the data processing overhead on less-capable devices. We measured the overhead of applying DEX to expand channels from a 3x224x224 image (a typical size for ImageNet) to 64x32x32 (the highest channel expansion used in our accelerators) on the Arm Cortex-M4 (120MHz).

This process took 2.2 ms on the Arm Cortex-M4. In terms of memory, this addition took the SRAM memory of 62KB (64x32x32 Bytes - 3x32x32 Bytes) on the processor. However, since DEX extends data to a size that the data memory in the AI accelerator can accommodate, this additional memory will not be an issue from the AI accelerator’s perspective.

Impact on end-to-end inference performance: Note that the MCU processor and the AI accelerator are independent processing components that run in parallel. This means that if the inference latency on the accelerator is higher than the data processing latency, data can be pre-processed for the next inference during the current inference (and thus data processing latency can be hidden). For inference, the inference latency of EfficientNet (11.7ms) is higher than the data processing latency of 2.2ms, and thus the inference throughput remains the same under continuous inference.

However, this depends on the scenario. The end-to-end impact of data processing latency depends on the processor's computational capability, the dimension of the data, and the size of channel expansion. For instance, in scenarios where data processing is done and transferred in more capable machines (e.g., cloud servers, smartphones, etc.) than the MCU processor on the tiny AI accelerator, the impact of data processing can be even more negligible.

We will incorporate this into our final draft.

Question 2 & Weakness 2) How does the power consumption vary when using the proposed technique?

Thank you for the question. We measured the power by varying the size of the channel extension with a Monsoon Power Monitor. The results are as follows:

Model	Chan=3	Chan=6	Chan=18	Chan=36	Chan=64
SimpleNet	53.82	53.85	58.21	61.42	68.97
WideNet	60.74	61.37	63.76	67.92	77.14

All numbers are in milliwatts (mW).

As the number of channels increased, power consumption increased accordingly. This is because a higher number of channels uses more processors in the AI accelerator, leading to increased power consumption.

We will incorporate this into our final draft.

Question 3 & Weakness 3) Can you comment on the applicability of DEX to other DNN tasks, such as object detection or natural language processing, which are also common tasks for tiny devices? Will DEX still be effective in improving inference accuracy for those tasks?

Thank you for the question. The core idea of this work is to utilize additional channels for extra inputs to improve task accuracy. We validated the generalizability of our method across 16 cases (four datasets and four models). We believe this idea would still be beneficial for other tasks that have a low number of input channels, such as RGB images.

During the rebuttal, we conducted an experiment to see if our idea generalizes to a face detection task. Specifically, we used the VGGFace2 dataset and the Tiny Single-Shot Detection (Tiny SSD) model [r1]. Due to computational complexity and limited time during the rebuttal period, we downsized the dataset by taking the first 100 identities, resulting in 33K training samples and 2K test samples for 50 epochs. We used the Adam optimizer with a fixed learning rate of 0.001.

Here is the result:

Method	Channel Size	mAP
Downsampling	3	0.65347
DEX	18	0.68317
DEX	64	0.69307

The results show that mAP (mean Average Precision) improved with DEX compared to Downsampling, illustrating that DEX works for a face detection task.

Nevertheless, we acknowledge that DEX might be less effective in certain tasks where incorporating more pixel information is less beneficial and where detailed pixel-level information might not significantly improve performance. A thorough evaluation is necessary to verify this for different tasks.

We will incorporate this in the final draft.

[r1] Tiny SSD: A Tiny Single-shot Detection Deep Convolutional Neural Network for Real-time Embedded Object Detection

审稿意见

评分: 5置信度: 32024-07-13

Recent advancements in tiny ML accelerators, such as MAX 78000 and MAX 78002, have significantly boosted hardware processing power. On one hand, these accelerators feature 64 parallel processors with per-processor memory instances, enhancing CNN inference speed compared to traditional MCUs. On the other hand, downsampling of input images due to limited data memory can lead to accuracy degradation. To address this, this paper proposes DEX, which integrates patch-wise even sampling and channel-wise stacking to incorporate spatial information from original images into input images. Evaluation results demonstrate that DEX improves accuracy without introducing additional latency.

优点

This paper presents a simple yet compelling idea to tackle CNN inference on a specific tiny ML accelerator.
Figures, such as Figure 5, clearly illustrate the DEX procedure.
The analysis in the paper provides a clear understanding of how DEX operates.

缺点

This approach appears suitable only for specific small devices.
Some procedures are unclear and require further clarification. Detailed questions are listed below.
Limiting the approach to processing only the first layer for simplicity may be a limitation of this work.

问题

Lines 111 to 112 mention that 512KB memory is divided into 64 segments, giving each core an 8KB dedicated memory instance (as shown in Figure 1). Is this scenario realistic? In other words, does each core physically have its own private 8KB memory? Typically, all processors share the same 512KB memory. Moreover, the Block Diagram from the MAX78000 specification in Ref [34] does not indicate these dedicated memory instances.
Line 114 states that an image with an input shape of 3x224x224 may not fit MAX78000 even with Q7 format due to memory limitations for each channel. However, the MAX78000 product datasheet mentions "input image size up to 1024x1024 pixels" (Ref [34]). Could this be double-checked?
What if the channels of the intermediate layers are significantly fewer than the number of cores? Would it still be feasible or straightforward to extend DEX to those layers?
This work is similar to data augmentation but is designed for specific devices. How important or valuable is this hardware device in the context of TinyML? Even for the MAX78000/MAX78002, the contribution seems limited since DEX has only been applied to the first layer, and the accuracy improvement appears to be limited.

Ref [34]: https://www.analog.com/en/products/max78000.html

局限性

N/A

作者回复

2024-08-06

Weakness 1) This approach appears suitable only for specific small devices.

Please see our response to Question 4 below.

Weakness 2) Some procedures are unclear and require further clarification. Detailed questions are listed below.

Thank you for pointing this out. See our response below. We will incorporate the changes.

Weakness 3) Limiting the approach to processing only the first layer for simplicity may be a limitation of this work.

Please see our response to Question 3 below.

Question 1) Lines 111 to 112 mention that 512KB memory is divided into 64 segments, giving each core an 8KB dedicated memory instance (as shown in Figure 1). Is this scenario realistic? In other words, does each core physically have its own private 8KB memory? Typically, all processors share the same 512KB memory. Moreover, the Block Diagram from the MAX78000 specification in Ref [34] does not indicate these dedicated memory instances.

Each core cannot share the same 512KB memory. This is due to the per-processor memory instance property in tiny AI accelerators, which allows for rapid data access and parallelization. The block diagram in Ref [34] provides only an abstract view of the memory architecture and does not illustrate this detail. For a more detailed explanation, please refer to the MAX78000 User Guide. To be precise, four processors share one data memory instance, as noted in our draft (lines 524-526). Nevertheless, parallelization occurs at the channel level. Hence, the data memory can be viewed as 64 segments in terms of parallelization.

Question 2) Line 114 states that an image with an input shape of 3x224x224 may not fit MAX78000 even with Q7 format due to memory limitations for each channel. However, the MAX78000 product datasheet mentions "input image size up to 1024x1024 pixels" (Ref [34]). Could this be double-checked?

Note that 1024 x 1024 pixels = 1048 KB does not fit within the 512KB data memory size. The MAX78000 datasheet description, “Programmable Input Image Size up to 1024 x 1024 pixels,” is feasible using its “streaming mode.” According to the official documentation, “Streaming allows input data dimensions that exceed the available per-channel data memory in the accelerator.” This mode leverages special hardware support, such as the streaming queue in the MAX78000. However, this comes at the cost of increased inference latency. We did not cover this in the paper as it is a special implementation specific to the hardware and may not generalize to other types of hardware. The focus of our analysis was on the standard operation mode.

Question 3) What if the channels of the intermediate layers are significantly fewer than the number of cores? Would it still be feasible or straightforward to extend DEX to those layers?

Yes, it is both possible and straightforward to extend DEX to those layers by modifying the output channel size of those layers. In this work, we focused on modifying the first CNN layer due to simplicity, effectiveness, and memory constraints. The first layer, representing image data in three channels (RGB), has the most unused processors after initial data assignment. Extending channels at the first layer significantly increases data utilization with minimal impact on model size. This approach aligns with the design of weight memory in tiny AI accelerators, which maximizes model capacity by collective use across processors. We discussed this in Lines 317-322 in our manuscript.

Question 4) This work is similar to data augmentation but is designed for specific devices. How important or valuable is this hardware device in the context of TinyML? Even for the MAX78000/MAX78002, the contribution seems limited since DEX has only been applied to the first layer, and the accuracy improvement appears to be limited.

While our solution might look similar to data augmentation, it is specifically designed for tiny AI accelerators to maximize both processor utilization and accuracy improvement. We strongly believe that these tiny AI accelerators are crucial platforms in the context of tinyML. The advent of tiny AI accelerators is bringing AI closer to us than ever before, offering reduced latency, low power cost, and improved privacy. These accelerators with small form factors are becoming integrated into wearable devices recently, e.g., smart earbuds, patches, watches, glasses, wristbands, and shoes [r1, r2, r3, r4].

In this paper, we focus on the MAX78000 and MAX78002 since they are the most widely used tiny AI accelerator research platforms [1, 6, 13, 39, 40, 43] thanks to their disclosed hardware details and open-source tools, enabling in-depth analysis and modification of their operations. These tiny AI accelerators are common research platforms, similar to the STM32 series in MCU research and the NVIDIA Jetson series in Edge TPU research.

In that context, we believe DEX is an important step in utilizing tiny AI accelerators within TinyML by improving accuracy without sacrificing inference latency. We identify inefficiencies in these accelerators and enhance accuracy through a novel data extension algorithm. We believe that an average 3.1%p accuracy improvement is meaningful, especially for resource-constraint tiny devices, and we focused on the first layer due to the reasons explained in our response to Question 3.

[r1] Ananta Narayanan Balaji and Li-Shiuan Peh. 2023. AI-On-Skin: Towards Enabling Fast and Scalable On-body AI Inference for Wearable On-Skin Interfaces. Proceedings of the ACM on Human-Computer Interaction 7, EICS (2023), 1–34.

[r2] OmniBuds - Sensory Earables powered by AI accelerators.

[r3] Hearables with GAP9. TWS Processor with GAP9.

[r4] Shift Moonwalkers.

2024-08-14

Thanks for the careful response. I still feel positive about this paper.

评论- Thank you for your response

2024-08-14

We sincerely appreciate your response. Thank you once again for your positive feedback and valuable suggestions, as well as for recognizing the value of our contributions. We are happy to continue the discussion if you have further questions.

Best,

Authors

审稿意见

评分: 5置信度: 42024-07-14

This paper indicates that current AI accelerators with limited data memory often require downsampling input images, which leads to reduced accuracy. Therefore, the proposed Data channel EXtension (DEX) includes additional spatial information from original images as informative input through two procedures: patch-wise even sampling and channel-wise stacking. This effectively extends data across input channels. As a result, DEX enables parallel execution without increasing inference latency. The numerical experiments consistently show improved model performance on four datasets.

优点

• The proposed method is easy to understand, with clearly written paragraphs and well-organized sections.

• The experiments conducted demonstrate the effectiveness of the proposed method.

缺点

The proposed data channel extension requires the assumption that only a limited number of processors tied to memory instances are utilized while the remaining processors remain idle. However, it is not ensured that such an assumption is always true to trigger the proposed method.
The proposed method is as simple as an implementation trick; hence, the technical contribution is limited.

The compared channel extension methods are all proposed by the authors and hence failed to show a fair comparison.
It is curious what the performance of patch-wise random sampling could achieve.

问题

The primary concern of this paper is that the proposed image sampling approach is too simple to make a significant technical contribution. Moreover, the proposed method relies on the assumption of having unused per-processor memory instances to initiate the sampling process, a condition that may not always be met.

局限性

Included

作者回复

2024-08-06

We sincerely appreciate your time and effort in providing us with thoughtful comments. We respond to your question in what follows. Please also refer to the global response we posted together.

Weakness 1) The proposed data channel extension requires the assumption that only a limited number of processors tied to memory instances are utilized while the remaining processors remain idle. However, it is not ensured that such an assumption is always true to trigger the proposed method.

We acknowledge that our work targets tiny AI accelerators that feature parallel processors and per-processor memory instances for rapid data access and parallelization, as we described in Section 2. Tiny AI accelerators with these hardware-level optimizations are crucial for performance improvement compared to traditional MCUs. We believe these tiny AI accelerators will be widely adopted in various small devices, such as recent AI-capable smart earbuds, patches, watches, glasses, wristbands, and shoes [r1, r2, r3, r4].

Given the extensive tinyML research on the STM32 MCU series and edge AI work on the NVIDIA Jetson series in the past years, we believe these tiny AI accelerators will become a key enabling force for true on-device AI in tiny devices such as wearables. In this context, we focus on the tiny AI accelerator platforms (MAX78000 and MAX78002) since they are not only the most widely used tiny AI accelerator research platforms [1, 6, 13, 39, 40, 43] but also feature these hardware-level optimizations. Our insights into utilizing parallel processors for accuracy improvement without sacrificing inference latency will remain valuable as long as future AI platforms continue to incorporate these tiny AI accelerators.

[r2] OmniBuds - Sensory Earables powered by AI accelerators.

[r3] Hearables with GAP9. TWS Processor with GAP9.

[r4] Shift Moonwalkers.

Weakness 2) The proposed method is as simple as an implementation trick; hence, the technical contribution is limited.

The compared channel extension methods are all proposed by the authors and hence failed to show a fair comparison.
It is curious what the performance of patch-wise random sampling could achieve.

Simplicity of the method: While the proposed method might seem simple, we provide an in-depth analysis of its rationale, impact, utilization, and constraints in Section 3. Also, this simplicity allows our solution to be generally applicable to various types of models and AI accelerators. We would like to mention that the other reviewers pointed out our simple and effective solution as a strength of our paper (UAVo: “simple yet compelling idea”; k2BH: “simple yet effective solution”; Zbte: “both simple and effective”). Our approach is new and novel, specifically designed for emerging tiny AI accelerators, and we have shown its effectiveness. We believe many impactful papers, especially in the field of AI/ML, present simple yet impactful solutions.

Baselines: This area has been hardly explored as tiny AI accelerators are new platforms, resulting in few existing baselines. In our original submission, we did conduct a comparative study with existing channel manipulation methods proposed by prior art such as Downsampling, CoordConv, and CoordConv (r) in Tables 1 and 2. While these baselines were not originally designed for our target platforms, we believe they provide meaningful comparisons that validate our design rationales—data channel extension to achieve accuracy improvement without extra latency. In addition, we compared with other possible channel extension strategies in Table 4 proposed by us (except for Downsampling which is widely used) due to the lack of proper baselines in the literature. We believe this is a fair comparison incorporating existing approaches and possible alternatives.

Patch-wise random sampling: Thanks for the suggestion of comparison with patch-wise random sampling. Following your suggestion, we measured the performance of it and integrated it with Table 4 as shown below. DEX’s data extension algorithm outperformed the baselines, including patch-wise random sampling. We will incorporate this into our final draft.

Method	InputChan	InfoRatio (X)	Accuracy
Downsampling	3	1.0	57.8 ± 1.2
Repetition	64	1.0	56.3 ± 0.8
Rotation	64	1.0	55.7 ± 0.6
Tile per channel	64	21.3	39.3 ± 0.9
Patch-wise seq.	64	21.3	60.4 ± 1.5
Patch-wise random sampling	64	21.3	60.4 ± 1.0
DEX	64	21.3	61.4 ± 0.6

Question 1) The primary concern of this paper is that the proposed image sampling approach is too simple to make a significant technical contribution. Moreover, the proposed method relies on the assumption of having unused per-processor memory instances to initiate the sampling process, a condition that may not always be met.

Please see our responses to Weakness 1 and 2 above.

作者回复

2024-08-06

Global Response

Dear Reviewers,

We appreciate all of you for your positive reviews and for highlighting the strengths of our work:

ySD7: (1) Easy to understand, (2) clearly written, (3) well-organized, and (4) demonstrates the effectiveness of the proposed method.

UAVo: (1) Presents a simple yet compelling idea, (2) clear illustrations, and (3) in-depth analysis.

k2BH: (1) Identifies important issues in tiny AI accelerators, (2) a simple yet effective solution, (3) evaluation with real devices, (4) well-written, and (5) clear and logical motivation and methodology.

Zbte: (1) Well-justified motivation, (2) both simple and effective, demonstrating a practical solution, (3) well-defined baselines, (4) thorough sensitivity analysis, and (5) low latency with minimal model size overhead.

We also sincerely thank the reviewers for their constructive comments to improve our work. We have addressed all the questions from reviewers with clarifications and new experiments during this rebuttal period. We summarize how we addressed the reviewers’ main questions as follows:

ySD7:

We clarified the assumption and highlighted its importance in tiny AI accelerators.
We clarified our technical contribution.
We clarified that we compared with existing baselines.
We conducted an experiment to compare with patch-wise random sampling.

UAVo:

We clarified the memory architecture of the tiny AI accelerators.
We clarified the input size limitation.
We discussed the possibility of applying DEX to intermediate layers.
We highlighted the importance of the tiny AI accelerator platforms and the significance of the result.

k2BH:

We measured the overhead of data processing with Arm Cortex-M4.
We measured the power consumption using a Monsoon Power Monitor.
We conducted an experiment on a face detection task.

Zbte:

We clarified the scope of the experiments and conducted an experiment on a face detection task.
We discussed scenarios where DEX might be less effective.

We will carefully incorporate these points and our responses into our final draft. Thank you once again for your valuable feedback and suggestions.

Sincerely,
Authors

2024-08-12

Dear reviewers,

The discussion with authors is closing soon, please review the rebuttal to see if the authors have addressed your concerns, and acknowledge to the authors that you have read their response. Thanks

最终决定Accept (poster)

2024-09-25

This paper presents a method to improve the efficiency of running CNN models on tiny AI accelerators. Instead of using direct downsampling, the method involves extending the input data channels to incorporate additional image information into unused data memory and processors. The approach was verified on two tiny AI accelerator platforms with popular vision models and benchmarks, showing enhanced accuracy and efficiency.

The paper is well-written and easy to follow. The evaluation is thorough, and the results show the effective of the proposed methods. During the rebuttal, the authors addressed reviewers’ questions in detail, providing additional data and clarifications. The paper received generally favorable reviews from all reviewers.

However, the reviewers noted that the method is quite simple, and its technical contribution is somewhat limited. Additionally, the technique is mainly applicable to specific tiny accelerators. Nevertheless, considering the method’s effectiveness, the comprehensive analysis and evaluation, and the significance of tiny AI accelerators in applying CNN models, the paper is recommended for borderline acceptance.