7.0

/10

Poster3 位审稿人

最低4最高5标准差0.5

3.7

置信度

创新性2.7

质量3.0

清晰度3.3

重要性3.0

NeurIPS 2025

AegisGuard: RL-Guided Adapter Tuning for TEE-Based Efficient & Secure On-Device Inference

CHE WANG,Ziqi Zhang,Yinggui Wang,Tiantong Wang,Yurong Hao,Jianbo Gao,Tao Wei,YANG CAO,Zhong Chen,Wei Yang Bryan Lim

OpenReview PDF

提交: 2025-05-06更新: 2025-10-29

TL;DR

We propose a noval fine tuning framework that uses reinforcement learning and adapter compression to selectively shield model layers in TEEs, which achieves strong model protection with 2–3× efficiency gains and no accuracy loss.

摘要

关键词

Trusted Execution EnvironmentTEETEE-Shielded DNN PartitionModel Stealing AttackSplit inference

评审与讨论

审稿意见

评分: 4置信度: 42025-07-02

This paper is to introduce AegisGuard, a fine tuning and deployment framework that selectively shields the MS sensitive adapters while offloading the rest to the GPU, balancing security and efficiency. The main contribution includes: 1. the proposed solution reduces the communication workload between TEE and GPU for TSDP solutions through an RL-based sensitivity measurement approach for on-device LM inference. 2. the framework compresses intermediate feature maps to reduce the communication cost after deployment. The authors have provided comprehensive experiments for deployed LMs, demonstrating the inference efficiency, robust security against model stealing attack, and minimal accuracy loss of proposed AegisGuard.

优缺点分析

Strengths

This paper presents this fine tuning and deployment framework, which have shown its efficiency and effeteness. In order to reduce the communication cost, the framework adopts a selective shielding strategy, where only the most sensitive adapters are executed inside the TEE, while less sensitive ones are offloaded to the untrusted GPU. Furthermore, structurally prunes are also introduced to further lowering TEE computation and data transfer costs
The submission demonstrates a strong technical foundation with a robust experimental design and clear evidence supporting the claims presented in the paper. The experiments include a comprehensive evaluation for the proposed framework.
The paper is well-structured and clearly written, making it accessible to a broader audience. Weaknesses
Other than the efficiency of selective shielding strategy, more impact investigate of shielded adapter compression part could help the design more convincing, including the tradeoff between accuracy and latency, how much contribution from this specific part alone, etc.

问题

Is there latency overhead for the sparse metrics operations after structural pruning? What is the tradeoff and benefit from the compression?

局限性

yes

最终评判理由

With the rebuttal and discussions, I will keep the initial rating as my recommendation.

格式问题

None

作者回复

2025-07-30

We very thank the reviewer PYU5 for the valuable feedback on our paper. Our responses to the comments are below.

Question 1: Is there latency overhead for the sparse metrics operations after structural pruning?

Answer 1: No, there is no significant latency overhead introduced by sparse matrix operations in our method. This is because we apply structural pruning, specifically by removing attention heads, rather than performing unstructured pruning at the level of individual elements. As a result, the remaining weights can still be represented and computed using dense matrix operations, without introducing sparsity. Therefore, our method preserves the efficiency of dense computation and does not suffer from the irregular memory access patterns or hardware inefficiencies commonly associated with sparse operations.

Question 2: What is the tradeoff and benefit from the compression?

Answer 2: The tradeoff of compression is that it requires longer finetuning time before model deployment, it's about 1.5 $\times$ fine tuning time costs than normal fine tuning without compression. However, this process only occurs once and does not introduce repetive overhead after deployment. It does not introduce overhead to the online inference process.

The benefit is a lower computation workload in TEE in the online phase. We provide a breakdown of inference latency components (GPU, TEE, and Transfer) in Table 1 of Section 4.2. For example, in the case of ViT-Base, the GPU processing time is 5.94 ms, TEE execution takes 4.31 ms, and data transfer costs 4.89 ms per batch. These results confirm that compared to Shield-LoRA, AegisGuard achieves 2-3 $\times$ reductions in TEE execution and transfer overhead. Additionally, we also added the TEE computation costs reduction of pruning is presented below separately. The results are conducted on the ViT-Base model due to rebuttal time constraints. This experiment measures the TEE execution time (averaging 10 batches) under different pruning ratios, demonstrating how pruning directly reduces the secure enclave’s computation cost. We will put this ablation study into appendix in the next version.

Pruning Ratio (%)	TEE Execution Time (ms)	Relative Reduction
0%	6.01	1.00×
20%	4.82	1.25×
30%	4.31	1.39×
50%	3.26	1.84×

2025-08-08

Dear Reviewer PYU5,

Thank you again for your time and effort in reviewing our paper.

We would appreciate any further comments or clarification you might have, if time permits. We understand you may have many papers to review, but any follow-up feedback would be very helpful and appreciated.

Best regards

审稿意见

评分: 4置信度: 32025-07-04

This paper proposes AegisGuard, a framework that enhances the security and efficiency of deploying large lora models on user devices by selectively shielding only the most privacy-sensitive LoRA adapters inside TEEs while offloading less sensitive adapters to the GPU. Specifically, it introduces a reinforcement learning-based sensitivity measurement that uses Gaussian noise to rank adapter sensitivity to model stealing attacks and a shielded adapter compression (SAC) module to prune sensitive adapters. It reduces computation and communication costs. Experiments on LLaMA-7B and ViT show that AegisGuard maintains black-box level security, whiling significantly reducing inference latency and lowering TEE memory usage.

优缺点分析

Strength

+) The paper is well-written and easy to follow

+) The overall design is technically sound and reasonable to me.

+) It is a good idea to co-design the lora layers and secure inference platform.

Weakness

-) Limited scope to adapter-based fine-tuning, specifically LoRA.

-) The sensitivity estimation method may not be robust.

-) Some evaluation results are not clear to me (see questions)

问题

a) My main concern is the robustness of the layer sensitivity estimation, which is the core of the proposed method. The intuition was 'perturbing sensitive parameters leads to a larger loss perturbation', however, it also depends on the chosen input? How can we show the proposed method is robust to the chosen samples?

b) In Table 2, why ViT-Base C100 Shield-LoRA (19.25) was not the best (colored rectangle) but AegisGuard (22.67)? Same questions for LLaMA-7B.

局限性

Limited scope to adapter-based LoRA fine-tuning might be a limitation.

最终评判理由

After reading the rebuttal, i maintain my scoring of weak accept.

格式问题

N/A

作者回复

2025-07-30

We thank the reviewer RwdN for the valuable feedback on our paper. Our responses to the comments are below.

Question 1: The robustness of the layer sensitivity estimation.

Answer 1: Our layer sensitivity estimation method is robust to the choice of input samples, as shown by its convergence behavior, stability across random seeds, performance when experimented on different domain datasets, and its design based on parameter-space perturbations. We support this claim in the following:

The entire RL-guided fine-tuning is a convergence process that runs over many random batches, not a single sample. The overall objective of our method is to fine-tune a large model that would converge to an optimal loss with a fixed number of layers. Note that during our designed RL-guided fine-tuning process, the model is not just exposed to a single random batch, but multiple random batches throughout. Additionally, our weighted layer selection sampling mechanism is designed to differentiate layer wise sensitivity with more sensitive layers (those with higher sensitivity scores) being fine-tuned more frequently. Therefore, if our estimation were highly sensitive to the batch composition, the selected sensitive layers would fluctuate throughout training, leading to instability or divergence. In contrast, we observe that the final adapter selection converges, resulting in successful downstream task performance. This confirms that sample-specific variance is smoothed out through repeated exposure and weighted selection during RL optimization.
Layer sensitivity is stable or converges across different datasets. The results presented in Appendix D.3, Figure 6, also demonstrate this point. The sensitive layer selection process is stable and shows convergence in experiments conducted on domain-specific tasks across different random seeds. This shows that our method is robust to different task domains.
Our perturbation is injected into LoRA weights rather than inputs, and is larger than the perturbations injected into inputs in previous works. Since the core design concept of our method is to disrupt the impact of specific LoRA layer weights rather than subtly perturbing inputs to enhance training robustness, we can inject large noise into it without considering model performance. This observation is drawn from the analysis of weight magnitude distribution presented in Figure 5 of the Appendix. LoRA is designed to introduce small, low-rank updates to the base model, and its parameter magnitudes are significantly smaller. This ensures that our perturbation can be large to effectively diminish the contribution of the fine-tuned LoRA layer while preserving the functionality of the original base model. Thus, the influence of the perturbed layer can be isolated and evaluated robustly, without being influenced by the chosen samples.

Question 2: In Table 2, why ViT-Base C100 Shield-LoRA (19.25) was not the best (colored rectangle) but AegisGuard (22.67)? Same questions for LLaMA-7B.

Answer 2 In Table 2, Shield-LoRA is not a baseline defense solution so we don't highlight Shield-LoRA in all settings. It is only a reference point (security upper-bound of black-box setting) for comparison. The colored rectangle is only for the best-performing defense among the baselines and the proposed solution.

Limitation 3: The limited scenario of adapter-based LoRA fine-tuning.

Answer 3 We focus on LoRA as it is the most commonly used parameter efficient fine-tuning method currently. Especially in the privacy field, many works[a][b] design their privacy methods based on LoRA-based fine-tuning scenarios. Additionally, many industrial on-device practices such as Apple[d] and Nvidia[e] have also adopted LoRA as a mainstream fine-tuning method. For example, Apple developed a new framework using LoRA adapters that incorporates a mixed 2-bit and 4-bit configuration strategy to maintain model quality. Nvidia developed the NeMo framework that mainly supports dynamic multi-LoRA inference, enabling simultaneous inference requests with different LoRA models.

Besides, compared with normal adapters, LoRA contains less trainable parameters and are thus more compatible with TEE contraints (memory and computation constraints[c]). When we adopt TEE to provide black-box level security for LLM fine-tuned with private or sensitive data, efficiency is also a key objective. LoRA achieves this by significantly reducing the number of trainable parameters through low-rank matrix decomposition, compared with traditional adapter-based techniques. It not only improves computational efficiency but also achieves strong performance in modern applications, making it well-suited for TEE-based secure fine-tuning. As a result, in our setting (where resource efficiency and security are critical), LoRA is a practical choice.

[a].Sun, Youbang, et al. "Improving LoRA in Privacy-preserving Federated Learning." The Twelfth International Conference on Learning Representations(ICLR2024).

[b].Luo, Zihao, et al. "Privacy-Preserving Low-Rank Adaptation Against Membership Inference Attacks for Latent Diffusion Models." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 39. No. 6. 2025.

[c].Sun, Zhichuang, et al. "Shadownet: A secure and efficient on-device model inference system for convolutional neural networks." 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2023.

[d].Apple. "Introducing Apple’s On-Device and Server Foundation Models"

[e].NVIDIA. "Parameter-Efficient Fine-Tuning with NVIDIA NIM for LLMs"

审稿意见

评分: 5置信度: 42025-07-04

This paper proposes AegisGuard, a fine-tuning framework that balances efficiency and sensitivity. Unlike traditional approaches, which guard LoRA adapters in TEE incurring extra latency, AegisGuard uses RL-based sensitivity measurement to figure out which adapter is more sensitive, and reduce the TEE execution overhead via pruning and compression. Evaluation shows great reduction in latency and memory while maintaining model accuracy as well as model stealing resilience.

优缺点分析

+ Neat idea and great execution

+ TEE execution overhead is a pain point to solve, and the solution is simple yet elegant

+ Nice evaluation on inference latency, defense effectiveness, model accuracy, as well as memory results in appendix

- From figure 1, transfer latency seems the biggest bottleneck. In the evaluation results, it would’ve been nice to articulate the reduction, breaking down how much is attributed to transfer, how much to TEE execution, etc. The TEE execution reduction should also highlight the effect of pruning. Combining both, a comprehensive ablation study on latency would serve the paper well.

- There are many approaches in analyzing layer/adapter sensitivity. It would have been nice to have a discussion of the design philosophy of the chosen approach. It would also be nice to have a comparison, just on this component, with other selection approaches.

- Format: Font in figure 1 is too small. In Figure 2, the flow of the upper half is a bit messy. It would be helpful to rearrange the layout for better readability. Table 1, please add units (i.e. ms vs sec) to avoid confusion

问题

How is the impact of pruning on accuracy compensated by the next step of fine-tuning? This part is a bit counterintuitive. If it's because pruning is losing very little, then where is the limit? How much more pruning can be done to actually affect downstream accuracy? This is a complex interplay, with the selection of adapters. Unfortunately, the current evaluation, even appendix, couldn't capture this interplay well.

局限性

Yes

最终评判理由

The author response addressed most of concerns. The reviewer maintain favorably position on the review

格式问题

N/A

作者回复

2025-07-30

Thanks for providing valuable feedback on our paper. Our responses to the comments are below.

Question 1: How is the impact of pruning on accuracy compensated by the next step of fine-tuning? If it's because pruning is losing very little, then where is the limit? How much more pruning can be done to actually affect downstream accuracy?

Answer 1: The accuracy drop of pruning can be compensated by the next step of fine tuning is because we adopt a mixed strategy of pruning and fine-tuning. Pruning is periodically applied during the fine-tuning process, and the pruned weights are permanently removed (i.e., not reconnected). After each pruning step, the remaining weights continue to be updated, progressively learning more task-relevant information during subsequent training. This allows the model to gradually adapt to the reduced parameter space and partially recover accuracy.

In contrast to traditional structural pruning methods that prune uniformly across all layers [a], our method adopts a dynamic and selective pruning strategy, focusing primarily on the shielded layers, those would execute inside the TEE. Consequently, the total number of pruned parameters is relatively small compared to conventional pruning methods, which helps preserve downstream task accuracy.

To better understand the pruning limits, we conduct additional experiments under various pruning ratios (e.g., 20%, 30%, and 50%) on ViT-Base and report the corresponding interplay and downstream accuracy.

Model	Variant	CIFAR-10	CIFAR-100	UTKFace	MNIST	GTSRB	Sun397	Avg.
	Shield-LoRA(baseline)	97.47	85.96	75.03	98.90	93.35	53.37	84.01
	AegisGuard(20%)	97.83	86.20	75.21	98.68	93.51	53.54	84.16
ViT-Base	AegisGuard(30%)	97.97	86.07	75.01	98.25	93.20	53.76	84.04
	AegisGuard(50%)	95.21	79.84	70.27	95.02	89.65	50.43	80.07

As shown above, AegisGuard's performance remains stable at 20% and 30% compared with baseline, but it has a obvious degrade (around 4% drop) at 50% pruning ratio. These results help delineate the boundary of effective pruning in our setting.

Weakness 2： Ablation study of latency overhead.

Answer 2: We provide a breakdown of inference latency components (GPU, TEE, and Transfer) in Table 1 of Section 4.2. For example, in the case of ViT-Base, the GPU processing time is 5.94 ms, TEE execution takes 4.31 ms, and data transfer costs 4.89 ms per batch. These results confirm that compared to Shield-LoRA, AegisGuard achieves 2-3 $\times$ reductions in TEE execution and transfer overhead.

To further quantify the effect of pruning within the TEE, we include an ablation study as shown below to illustrate the execution cost of ViT-Base. This experiment measures the TEE execution time (averaging 10 batches) under different pruning ratios, demonstrating how pruning directly reduces the secure enclave’s computation cost. We will put this ablation study into appendix in the next version.

Pruning Ratio (%)	TEE Execution Time (ms)	Relative Reduction
0%	6.01	1.00×
20%	4.82	1.25×
30%	4.31	1.39×
50%	3.26	1.84×

Weakness 3: The discussion of the design philosophy of the chosen approach compared to other methods.

Answer 3: One representative approach is gradient-based methods[b][c], but they can not be applied to our scenario to estimate layer/adapter sensitivity. It is because these methods estimate the sensitivity based on layer/adapter gradients and generally assume that all layers are updated simultaneously during training.

In our setting, we adopt a dynamic fine-tuning strategy, where only a subset of layers/adapters are activated and updated at each training step, that the rest are frozen. This strategy is to identify which layers/adapters are more sensitive, and can better absorb task-specific features. As a result, gradient-based sensitivity estimation becomes infeasible, since gradients are not available for the frozen layers at any estimation step.

To address this limitation, we propose a perturbation-based sensitivity estimation method. This approach estimates a layer's influence by observing the loss change when a perturbation is applied to its parameters. It is straightforward to implement in the dynamic fine-tuning setting, computationally efficient. Additionally, to mitigate variance from single batche samples, we perform sensitivity estimation at fixed intervals (e.g., every 20 steps), ensuring more stable and robust evaluations. Overall, our design philosophy prioritizes and can better compatibility with our dynamic fine-tuning strategy.

Weakness 4: Format issues

Answer 4: Thank you for pointing out the format issues. We will revise the formatting of fonts, as well as improve the layout and flow of figures and tables to enhance the readability and presentation quality in the next version.

[a]. Ma, Xinyin, Gongfan Fang, and Xinchao Wang. "Llm-pruner: On the structural pruning of large language models." Advances in neural information processing systems 36 (2023): 21702-21720

[b]. Sahoo, Sabyasachi, et al. "A layer selection approach to test time adaptation." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 39. No. 19. 2025.

[c]. Lee, Yoonho, et al. "Surgical Fine-Tuning Improves Adaptation to Distribution Shifts." The Eleventh International Conference on Learning Representations.

2025-08-08

Dear Reviewer 16rH,

Thank you again for your time and effort in reviewing our paper.

Best regards

最终决定Accept (poster)

2025-09-17

This manuscript proposes AegisGuard, a lightweight mechanism for protecting on-device models in TEEs against model stealing attacks. The claimed contributions are: (1) selective protection based on an RL-based layer-wise sensitivity measure, and (2) model compression techniques tailored to adapters deployed within TEEs. Empirical evaluation shows that AegisGuard is resilient to black-box attacks while reducing inference latency by up to 4x times.

The AC recommends therefore, "accept as a poster".

All reviewers were positive about AegisGuard, particularly its focus on addressing the inference latency of existing protection mechanisms and its comprehensive evaluation. The manuscript is also clearly written and easy to follow. The AC agrees: this work demonstrates how an ML-level framework can complement system-level protection to mitigate latency challenges.

The AC also agrees with several limitations raised during review. (1) The impact of the proposed protection mechanism is somewhat limited, as the solution focuses primarily on LoRA. The contribution would have been stronger had it extended to full fine-tuning, which is standard in fine-tuning for constructing chat models. (2) The paper lacks a comparison or discussion of alternative approaches to estimate sensitivity (or alternative metrics for evaluating layer vulnerability to model stealing attacks). These issues do not block acceptance, but they suggest that the impact and audience for this work may be narrower.

Revision guidance. For the camera-ready version, the AC recommends including any additional experiments in the Appendix, while restricting main-body revisions to minimal changes (e.g., typos, clarifications) so the scope does not exceed shepherding.