CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment
摘要
评审与讨论
CoreGuard protects edge-deployed LLMs using a permutation-based locking mechanism. The method applies row permutations to input-processing layers and column permutations to output-processing layers. A TEE provides initial authorization at a middle layer, then the permutation structure propagates this authorization through subsequent layers. This reduces TEE-GPU communication overhead compared to existing methods.
优缺点分析
Strengths:
-
The topic is important and the idea is interesting and novel, with security analysis provided.
-
The evaluation is solid with four baselines on four tasks, also including an adaptive defense.
Weakness:
-
TEE oracle attack: The most practical attack is for adversaries to fine-tune around the TEE by using it as a black box. Attackers can train layers 1-3 and 4+ while treating TEE authorization as just another frozen layer in their training pipeline. The evaluation on the defense effectivess against this attack is beneficial to understand the robustness of the proposed method.
-
Security analysis gaps: The claimed connection to Learning With Errors problem lacks rigorous mathematical proof in Appendix C.
-
Limited architectural scope: Only works for transformer architectures. Unclear how it extends to CNNs or other model types.
-
Presentation issues: Tables 2 and 3 have extremely small, unreadable fonts.
问题
-
How do you prevent attackers from using TEE as an oracle during fine-tuning rather than trying to extract permutation matrices?
-
Can you provide more rigorous proof of the LWE hardness connection in Appendix C?
-
How does the method extend beyond transformer architectures?
局限性
Yes
最终评判理由
Thank you for the clarification. Your refined threat model definition makes sense - focusing specifically on preventing off-device model extraction and redistribution rather than all forms of on-device misuse.
I understand that "foundational capability stealing" in your context refers to extracting the model to run independently without TEE authorization, and that on-device fine-tuning (while remaining TEE-dependent) falls outside your protection scope.
This clarification resolves my main concern. Your method successfully achieves its stated objective of preventing off-device model theft, and the experimental evaluation appropriately targets this threat. The suggestion of access control mechanisms for limiting query frequency also demonstrates awareness of potential on-device misuse scenarios.
Two minor suggestions for the final version:
-
Clarify threat model scope early: The refined definition of "foundational capability stealing" as specifically referring to off-device extraction would help readers understand the protection boundaries from the introduction.
-
Table readability: Tables 2 and 3 remain difficult to read due to small font sizes.
格式问题
The table captions font seem smaller than the standard format.
Dear Reviewer,
Thank you very much for your valuable feedback and thoughtful comments. We appreciate the time and effort you have dedicated to reviewing our paper. In the following, we will address each of your concerns point by point.
On TEE Oracle Attacks --We fully acknowledge your concern. However, we would like to clarify that attackers cannot perform such a TEE oracle attack on CoreGuard. Specifically, the TEE participates in the forward pass but does not support gradient propagation during backpropagation, ensuring that attackers cannot optimize the model. Second, even if attackers attempt to bypass the TEE by training only part of the model—such as freezing layers before or after the TEE and reinitializing the remaining parameters—this remains infeasible. This is because CoreGuard places the authorization point in the middle of the network, forcing attackers to retrain at least half of the model’s parameters, which is unrealistic for an attacker.
Rigorous Proof of the Matrix-LWE Hardness Connection -- We appreciate the reviewer’s feedback and agree that the connection to the Learning With Errors (LWE) problem can benefit from a more rigorous exposition. Below, we provide a refined justification that formalizes the reduction to a Matrix-LWE instance.
Assume the attacker is given: An approximate estimate of the true matrix , satisfying ; A known output matrix ; The recovery target: . What the attacker can actually construct is: , where: . In other words, the attacker can only observe: , which exactly corresponds to the form of a Matrix Learning With Errors (Matrix-LWE) problem.
A formal proof of this reduction will be provided in a subsequent section of the paper.
Applicability of CoreGuard to Non-Transformer Models -- While our implementation focuses on Transformer models, the design of CoreGuard itself is inherently architecture-agnostic, as it builds upon the three core design principles that each independently exhibit compatibility across diverse architectures. First, from the protection perspective, CoreGuard introduces structure-aware defenses by embedding permutation-based transformations into linear mapping layers, which serve to disrupt potential low-loss merging paths between models. This mechanism targets the mathematical structure of linear layers rather than relying on Transformer-specific components. Since linear transformation layers are ubiquitous in neural networks—including convolution layers in CNNs and fully connected layers in MLPs—our protection approach applies readily to those architectures as well. Second, from the authorization perspective, the generation and injection of both the permutation matrices and OTP noise are handled entirely by the TEE. These operations depend solely on the TEE’s capability to manage secure randomness and authorize parameter access. The architecture of the underlying model does not affect the authorization process, regardless of whether the model is a Transformer, CNN, or any other architecture. Third, from the perspective of authorization propagation, CoreGuard ensures that downstream layers can only be unlocked if upstream layers have been correctly authorized. This function operates on the local behavior of each linear transformation layer and does not require any specific structural assumption about the model. Thus, the propagation rule is universally applicable to any architecture that contains linear transformation layers.
We take a standard CNN like ResNet as an example. CoreGuard protects the convolutional kernels by permuting the input channels of the kernel, effectively locking it. To ensure proper operation, the input feature map must be similarly permuted to match the kernel's new channel order (i.e., it must be authorized). Additionally, the output channels of the convolutional kernel are permuted to ensure that the output channels correspond to the input channels of the next kernel. Once the first layer is authorized with the correct input permutation, the authorization propagates automatically to subsequent layers (i.e., authorization propagation). This approach works seamlessly with activation functions like ReLU, as they only process the values of the elements and are independent of their positions. In this way, CoreGuard provides model parameter protection and ensures the model can only operate under authorization.
Once again, thank you for your constructive feedback. We look forward to any further suggestions you may have.
Thank you for your detailed response. However, I believe there may be a misunderstanding regarding the TEE oracle attack.
You mention that "TEE participates in the forward pass but does not support gradient propagation during backpropagation" as a defense. However, the attack I described does not require gradients through the TEE. The attacker can treat the TEE as a fixed black-box function during training. Specifically, during forward pass, they send features through TEE to get authorized outputs. During backward pass, they simply skip the TEE and compute gradients only for layers 1-3 and 4+. This is standard practice in machine learning when dealing with non-differentiable components.
The attack flow would be: Input → Layers 1-3 → TEE (black box) → Layers 4+ → Loss, then backpropagate gradients only through layers 1-3 and 4+. The TEE essentially becomes just another preprocessing step in their training pipeline.
Your point about retraining half the model being "unrealistic" also seems questionable, as fine-tuning large portions of models is common practice with techniques like LoRA and full parameter fine-tuning.
Could you clarify how CoreGuard prevents this specific attack scenario where the TEE is used as an oracle rather than being reverse-engineered?
Your explanations for the LWE connection and non-transformer applicability are helpful and address those concerns well.
Thank you for your clear follow-up and for raising these specific concerns about the TEE oracle attack. Below, we provide further clarification:
On TEE Oracle Fine-tuning -- First, we acknowledge your point: even without gradient flow through the TEE, attackers may still attempt fine-tuning using more advanced methods such as black-box adaptation or reinforcement learning. However, we would like to clarify that even if such fine-tuning succeeds, it does not constitute model theft or misuse, as the model after fine-tuning remains strictly dependent on TEE authorization for inference. Specifically, CoreGuard aims to prevent the unauthorized extraction and redistribution of proprietary model functionality. The critical requirement is that attackers must not be able to replicate or redeploy the model’s capabilities outside of the TEE. In the oracle-style attack described, the attacker does not gain access to the core logic or internal parameters protected by the TEE. Every inference still depends on access to the TEE, meaning the model remains tied to the authorized hardware environment. This prevents offline use, re-hosting, or redistribution of model capabilities—all of which are central to our protection goals. In this sense, the attack does not undermine the intended security guarantees of CoreGuard. Instead, it highlights the effectiveness of our structural binding: the model’s critical reasoning path remains inseparable from the TEE authorization, rendering black-box adaptation dependent and non-replicable.
Clarification on Fine-tuning -- Regarding fine-tuning, we conducted extensive experiments and confirmed that without TEE authorization, attackers are unable to recover the model’s performance through fine-tuning. Specifically, as shown in Section 4.2, we evaluated a wide range of attack settings—including LoRA and full parameter fine-tuning (FFT)—and observed that models fine-tuned on the protected parameters remained nearly unusable.
We hope this response clarifies your concerns. If there is anything further we can clarify, please let us know.
Thank you for the clarification. Your reframing of the security goals is reasonable - if the primary objective is preventing model redistribution and offline deployment, then the TEE oracle attack indeed does not violate those goals since the fine-tuned model remains bound to the TEE-enabled device.
However, I believe there are still some important concerns to consider:
-
On-device misuse: While the attacker cannot redistribute the model, they can still exploit the proprietary model's capabilities on the original device for commercial gain, competitive advantage, or unauthorized applications. This may still constitute "foundational capability stealing" as mentioned in your problem statement.
-
Experimental evaluation gap: Your experiments in Section 4.2 evaluate fine-tuning WITHOUT TEE authorization, but the oracle attack I described uses TEE authorization DURING the fine-tuning process. The attacker would have access to properly authorized features for training their downstream tasks. Have you specifically evaluated this scenario where attackers fine-tune while using TEE as an oracle?
-
Threat model consistency: Your introduction mentions protecting against attackers who "fine-tune the partially recovered model to exploit its embedded knowledge and strong generalization capabilities for new tasks." This seems to align more with the oracle attack scenario than with the model extraction attacks you primarily evaluate against.
Could you clarify whether preventing on-device misuse (even if device-bound) is within your security objectives? And have you evaluated the oracle fine-tuning scenario where TEE authorization is available during training? .
Thank you for your insightful comments regarding our threat model and the threat model of on-device fine-tuning scenarios.
As described in our Threat Model section, our primary objective is to prevent running the model outside the deployed device. Based on this, "foundational capability stealing" in our paper refers to an attack that recovers the model’s downstream capabilities after it is removed from the TEE and executed off-device.
Therefore, on-device fine-tuning is not a threat that our paper tries to address, as it does not enable the attacker to bypass the TEE. We emphasize that, under this defined boundary, as long as the attacker does not attempt to bypass the TEE’s authorization, any usage—including task-specific finetuning—does not violate our protection objectives. Such usage is analogous to better prompt engineering or integrating the model into an RAG pipeline, rather than constituting unauthorized access.
Since the oracle fine-tuning scenario is not considered a threat that our paper tries to solve, we did not include it in our evaluation. However, as discussed in the previous response, even if such an approach were successful, our method can also enable a potential solution. For example, an access control mechanism can be deployed within the TEE to limit both the total number and the frequency of authorized queries—sufficient for inference but inadequate for training.
We will revise the threat model section to avoid any potential misunderstanding.
We appreciate your feedback. Please let us know if there are any further questions or concerns.
Thank you for addressing the LWE proof and architectural concerns. However, the TEE oracle attack remains unaddressed. Your responses discuss different issues (CPU vs GPU TEE, gradient flow) but don't address the core attack: fine-tuning layers 1-3 and 4+ while using TEE authorization during training (treating TEE as a black-box oracle, not trying to reverse-engineer it).
It would be helpful to provide experimental evaluation of this specific scenario: attacker fine-tunes the model WITH TEE authorization available during training. If this attack succeeds, it undermines your core security claims regardless of device-binding.
Thank you for the clarification. Your refined threat model definition makes sense - focusing specifically on preventing off-device model extraction and redistribution rather than all forms of on-device misuse.
I understand that "foundational capability stealing" in your context refers to extracting the model to run independently without TEE authorization, and that on-device fine-tuning (while remaining TEE-dependent) falls outside your protection scope.
This clarification resolves my main concern. Your method successfully achieves its stated objective of preventing off-device model theft, and the experimental evaluation appropriately targets this threat. The suggestion of access control mechanisms for limiting query frequency also demonstrates awareness of potential on-device misuse scenarios.
Two minor suggestions for the final version:
-
Clarify threat model scope early: The refined definition of "foundational capability stealing" as specifically referring to off-device extraction would help readers understand the protection boundaries from the introduction.
-
Table readability: Tables 2 and 3 remain difficult to read due to small font sizes.
I will adjust my score accordingly. Thanks!
Dear Reviewer q37A:
Thank you for your encouraging response. We are delighted by your positive assessment of our work, and we will incorporate both the rebuttal points and the additional experiments into the revised manuscript.
Sincerely,
Authors
This paper proposes a method to protect all weights of an LLM with a single TEE operation: the TEE applies a secret permutation (plus a one-time pad) to the first hidden vector, after which the GPU does the heavy lifting while the permutation automatically propagates through the network. The result is full-model protection with minimal TEE involvement and no accuracy loss.
优缺点分析
Strengths:
- The method itself makes sense and is interesting: using permutations and one-time pads in the TEE in order to outsource compute-heavy matmuls to the GPU
- The overhead is smaller in terms of number of transfers required.
- The problem is important and timely
Weaknesses:
- On lines 43-45, the authors state "directly placing the entire model within a TEE is impractical, as it results in approximately a 50× reduction in model efficiency due to the TEE’s limited computational speed [43]". But the 50x throughput difference in the cited paper comes from an experiment where the TEE was run on a CPU core and the untrusted environment was a GPU. The throughput difference in this paper came from the difference in hardware, not the TEE itself. This is small but worrying misunderstanding, since it undermines a major pillar of the framing of this submission. Nevertheless, the goal of reducing TEE requirements is still valid and could broaden the type of hardware on which secure execution with LLMs can be run.
- Extending parameter shuffling from CNNs to transformers is valuable, yet conceptually close to ShadowNet
- The FLOP overhead is not improved compared to prior work, and the transfer overhead may be negligible with modern TEE hardware.
问题
See weaknesses
局限性
Yes, adequately addressed
最终评判理由
The author response addresses some of my concerns. I'm still worried about the 50x number that is cited, and the authors didn't directly address my concern about the validity of that statistic. I'll increase my score to a 4, since the value of the paper is more clear to me now, but I'll decrease my confidence since I'm not certain in the evaluation.
格式问题
No concerns
Dear Reviewer,
Thank you very much for your valuable feedback and thoughtful comments. We appreciate the time and effort you have dedicated to reviewing our paper. In the following, we will address each of your concerns point by point.
Real-World TEE Settings -- We appreciate the reviewer’s insightful comment. We clarify that CoreGuard is designed for broader and more prevalent edge hardware, such as smartphones and personal computers, where TEEs are typically implemented on CPUs (e.g., ARM TrustZone, Intel SGX). These TEEs are widely deployed in commodity devices and represent the dominant form of trusted execution available in edge settings. In contrast, GPU-integrated TEEs—such as those in NVIDIA H100—are designed for cloud or datacenter usage, and are generally inaccessible for most edge deployments due to their high cost, hardware footprint, and specialized infrastructure requirements. Therefore, CoreGuard does not conflict with these server-side TEE solutions but rather serves as a complementary defense tailored for practical, real-world edge scenarios.
Why Reducing TEE-Induced Overhead Still Matters? -- In edge deployment scenarios, where efficient computation and minimal latency are crucial, reducing TEE-induced overhead is essential. Specifically, TEEs inherently impose significant limitations on practical model deployment, causing computation and transfer bottlenecks that have been consistently observed in prior work. For instance, Table 7 in NPLO[51] shows that even when shielding only small adapters, 97.06% of total inference time is spent on TEE-related operations—35.61% on TEE-GPU transfer and 61.39% on TEE execution. Methods that attempt to protect model backbones face even worse bottlenecks. For example, ShadowNet incurs 1.3GB/token transfer for LLaMA3-8B due to frequent TEE-GPU interactions (Lines 64–72), making inference impractical. Given that mainstream mobile platform TEEs, like TrustZone, have a transfer rate of about 1GB/s between TEE and GPU [3], generating a single token takes about 1.3 seconds. Consequently, producing a complete output would require several hundred seconds (assuming it consists of 100 tokens) solely for data TEE-GPU transfer. This transfer overhead is a significant bottleneck, making inference impractical.
Key Distinction from ShadowNet -- Your concern is understandable. It is true that our method protects parameters through permutation, while ShadowNet uses shuffling, and they share a certain degree of similarity. However, the challenges they face are entirely different. Parameter protection is only the most basic goal of our method. Our main contribution lies in addressing the efficiency problem, which is not a simple “extension,” but the key to whether the scheme is practical. Specifically, ShadowNet still relies on TEE to directly execute model computations; even though it avoids computing linear layers, it still requires the TEE to execute nonlinear layers. Experimental results show that ShadowNet needs to keep about 30% of FLOPs inside the TEE. In contrast, CoreGuard uses the TEE only for authorization, without performing any model computation, reducing the FLOPs inside the TEE to 1e-5 of the original. Moreover, ShadowNet requires the feature to be transferred into and out of the TEE before every linear layer, resulting in enormous communication overhead (hundreds of transfers per inference). CoreGuard avoids this problem through one-time authorization and a propagation protocol, reducing the number of TEE transfers to just 5.
Once again, thank you for your constructive feedback. We look forward to any further suggestions you may have.
The author response addresses some of my concerns. I'm still worried about the 50x number that is cited, and the authors didn't directly address my concern about the validity of that statistic. I'll increase my score to a 4, since the value of the paper is more clear to me now, but I'll decrease my confidence since I'm not certain in the evaluation.
Thank you for your follow-up and for highlighting your concern regarding the validity of the 50× performance difference between TEE and GPU execution.
We would like to clarify that our method is intended for edge devices equipped with CPU-based TEEs. GPU/NPU TEEs on edge devices are still in the early stages of development and are not ready for practical deployment on edge devices. If high performance GPU TEE is available, then our method is not necessary. For example, our method is not necessary for TEEs in GPUs such as the NVIDIA H10,0 designed primarily for cloud or datacenter environments.
On this basis, directly placing the entire model within a CPU-based TEE, e.g., ARM TrustZone, Intel SGX, is impractical, as it results in approximately a 50× increase in runtime. The 50× performance gap we referenced is not an arbitrary estimate, but is grounded in prior work. Specifically, as stated in Appendix C of [47], “On a high-end GPU (an Nvidia TITAN XP), we achieve over 50× higher throughput but no security. For example, for MobileNet, the enclave evaluates 16 images/sec and the GPU 900 images/sec (56× higher).” This performance gap is supported by experimental results in that paper. Moreover, other publications have echoed similar observations. For example, in [51], the second paragraph of the Introduction notes: “Employing Trusted Execution Environments (TEEs) to directly host DNN models is also not practical, because shielding the whole DNN model in TEEs leads to about 50× deduction in model speed due to TEE’s limited computation speed.”
To avoid potential misunderstandings, we will revise the original statement to “directly placing the entire model within a CPU-based TEE, e.g., ARM TrustZone, Intel SGX, is impractical, as it results in approximately a 50× increase in runtime”.
We hope this addresses your concern regarding the source and validity of the cited performance difference. Please let us know if further clarification is needed.
The paper introduces CoreGuard, a lightweight defense for edge‐deployed LLMs against model theft. It uses a one‐time TEE authorization and a permutation‐based “propagation protocol” to lock and unlock Transformer weights with only a few TEE–GPU interactions. Experiments on four models (Qwen2, Gemma2, ChatGLM3, LLaMA3) and four tasks show CoreGuard matches black‐box security limits while incurring two orders of magnitude less overhead and negligible accuracy loss.
优缺点分析
Strength: 1.CoreGuard achieves low overhead through a one-time TEE authorization with effective protection propagation, significantly reducing runtime and memory costs. 2.The authors conduct extensive experiments across multiple LLM architectures to validate its effectiveness. 3.CoreGuard is architecture-agnostic, lightweight, and deployable without retraining, requiring only minimal TEE support.
Weakness. My review is short, since this paper definitely do not have broad audience in the AI community. I suggest submitting to security conferences such as CCS.
1.The paper lacks substantial theoretical analysis and offers limited novel insights to the AI community. The problem formulation is missing. In section 2, I expect a formal problem formulation, introducing the model, the locked part, what to modify, the architecture or parameters. The section 2 is very poorly written. 2.It reads more as a systems engineering effort rather than a method tailored to LLM-specific properties, which may not be suitable for Neurips. When I read this paper, I feel more like a MobiSys paper, not an AI paper. I am not sure why this paper is important to the people who work on the general AI safety. The audience is very limited. 3.The scalability to large models (13B–70B+) is untested, and the security evaluation omits more sophisticated adaptive attacks that jointly recover permutations and OTP noise.
问题
1.How does CoreGuard scale to very large models (e.g., 13B or 70B)? Are there memory or latency bottlenecks from storing or propagating large permutation matrices? 2.Have you evaluated or considered integration with side-channel defenses? What is the overhead when combining CoreGuard with techniques like HybCache [1]? [1] Dessouky, Ghada, Tommaso Frassetto, and Ahmad-Reza Sadeghi. "HybCache: Hybrid Side-Channel-Resilient caches for trusted execution environments." 29th USENIX Security Symposium (USENIX Security 20). 2020.
局限性
I believe this paper proposed a well-designed approach, but it reads more as a systems engineering effort rather than a method tailored to LLM-specific properties, which may not be suitable for Neurips and could be more suitable for a Security conference.
最终评判理由
part of my concern has not been addressed. I will keep my score.
格式问题
No
Dear Reviewer,
Thank you very much for your valuable feedback and thoughtful comments. We appreciate the time and effort you have dedicated to reviewing our paper. In the following, we will address each of your concerns point by point.
Scalability and Bottleneck Analysis of CoreGuard -- We appreciate the reviewer’s question regarding the scalability of CoreGuard. While our current evaluation focuses on 0.5B–8B models to align with real-world edge deployment scenarios (e.g., Apple Intelligence, Gemma2, Qwen2), CoreGuard is fundamentally designed to scale to much larger models such as 13B or 70B without incurring bottlenecks in memory or latency. Specifically, first, regarding memory usage, it is important to clarify that the permutation matrix never leaves the TEE and does not occupy GPU memory. Additionally, with CoreGuard, only one permutation matrix is needed for the entire model, which is a negligible storage requirement for the TEE. Regarding propagation overhead, no costs are incurred during inference, regardless of the model's size. Specifically, the propagation protocol involves only the column permutation of the weight matrices, which is applied before deployment and is not part of the inference process. Therefore, no additional propagation overhead is introduced compared to the original model.
Compatibility with Side-Channel Defenses such as HybCache -- CoreGuard and HybCache can be seamlessly integrated. HybCache is a side-channel defense mechanism designed to prevent cache-based attacks in SGX by periodically and randomly changing how memory pages are mapped within the enclave, making it difficult for attackers to infer sensitive information from access patterns. These measures, both functionally and in terms of efficiency, are fully compatible with CoreGuard. Specifically, from a functional perspective, CoreGuard’s use of the TEE involves a few simple matrix operations, such as applying row and column permutations and basic matrix addition. These operations follow predictable memory access patterns and avoid data-dependent branching, ensuring they do not interfere with HybCache’s memory randomization, which periodically changes memory page mappings within the enclave to prevent cache-based attacks. From an efficiency perspective, CoreGuard’s lightweight matrix operations are computationally efficient, as they involve simple arithmetic. This means CoreGuard’s operations do not introduce additional memory accesses or data transfers, allowing HybCache to perform its memory page randomization without extra overhead.
Moreover, CoreGuard can be integrated with a wide range of side-channel defense mechanisms because it relies only on basic TEE functionality. It does not require specialized memory layouts, privileged instructions, or microarchitectural modifications. Specifically, from a software perspective, CoreGuard uses the TEE only for basic matrix operations, making it easy to integrate with existing side-channel defense algorithms such as constant-time programming and memory access randomization. From a hardware perspective, CoreGuard requires no changes to the TEE’s microarchitecture, enabling compatibility with physical protection techniques such as EM shielding or power noise injection. These properties allow CoreGuard to serve as a complementary layer of defense in diverse threat models.
Security Under Joint Recovery Attacks Against CoreGuard -- We would like to clarify that joint attacks attempting to recover both permutations and OTP noise represent a very difficult problem. Specifically, in Appendix D, we present our Security against Permutation Matrix Simulation Attack, where we specifically attempt to recover the permutation matrix in isolation. Our results demonstrate that such attacks are unsuccessful, emphasizing the robustness of our approach to individual attacks targeting permutations. Additionally, recovering OTP noise is nearly impossible due to the nature of OTP encryption. Each OTP noise element is unique and used only once, making it infeasible for attackers to deduce the noise matrix through training or other methods. Moreover, attempting to recover both jointly is an even larger and more difficult problem. Specifically, we further analyze this infeasibility from a theoretical perspective in Appendix C. We show that any attempt to jointly fit the permutation and the noise leads to a system of equations that is mathematically equivalent to the Matrix Learning With Errors (Matrix-LWE) problem—a well-studied extension of the standard LWE problem. Matrix-LWE is widely believed to be computationally intractable, with hardness reductions to NP-hard lattice problems such as the Shortest Independent Vector Problem (SIVP) and the Gap Shortest Vector Problem (GapSVP). Therefore, even with unlimited access to input-output pairs, solving for both and remains provably hard. If the reviewer has other concrete attack strategies or threat assumptions in mind—such as specific attacker capabilities or attack methods—we would be happy to examine and evaluate them within our framework.
Clarifying the Contribution of CoreGuard -- We respectfully clarify that our approach is not merely a systems engineering effort. Rather, it introduces a novel method that deeply intervenes in the inference pathway of LLMs, by modeling and controlling the transformation of internal features through structured authorization and propagation mechanisms. Specifically, CoreGuard is designed to address foundational capability stealing, a threat unique to LLMs due to their strong generalization abilities. To counter this, we propose a semantic-preserving but structurally disruptive transformation on LLM layers (e.g., linear projections, FFN blocks, Add&Norm), which enforces runtime authorization and blocks unauthorized reassembly of the model’s knowledge pipeline. This design goes beyond conventional deployment strategies by reconstructing the LLM’s computation graph in a way that binds its functionality to a trusted enclave, thus preventing misuse of the model’s generalization ability. We believe this contribution is not only technically novel but also methodologically grounded in LLM-specific reasoning patterns and architecture semantics, making it relevant to the NeurIPS audience concerned with LLM behavior modeling, robustness, and secure deployment.
Clarifying the Formal Problem Formulation -- We recognize that introducing the formal definitions earlier would have better clarified the problem setup and supported the threat model more clearly. In the revised version, we have reorganized Section 2 (now titled Problem Statement and Threat Model) to include a formal definition of the model structure, the protected components, and the adversarial objective. Specifically, we define the LLM as a function , and identify the input-processing and output-processing layers involved. We explicitly describe how the model is “locked” via permutation of parameter matrices, and how this locking constrains unauthorized usage. The revised section also clarifies what parts of the model are exposed to the attacker (e.g., weights outside the TEE) and what transformations are applied to enforce protection. These definitions align with the attacker’s capabilities and the defender’s goals, and lay the foundation for the mechanism presented in Section 3.
Once again, thank you for your constructive feedback. We look forward to any further suggestions you may have.
The rebuttal does not address my main concerns:
-
Limited AI Community Impact – The work remains primarily a TEE-based security engineering solution with unclear relevance to the broader NeurIPS audience. Claims of LLM-specificity are not convincingly substantiated.
-
Weak Problem Formulation and Theory – The “formal problem” is still an operational description, lacking rigorous modeling or provable guarantees. Methodological depth remains limited.
-
Incomplete Security Evaluation – The Matrix-LWE hardness claim is untested against realistic adaptive attacks. No empirical stress tests are provided to validate robustness in practical adversarial scenarios.
I will keep the score. The work is better suited for a security venue than NeurIPS.
Thank you for your feedback. We note that some concerns remain unaddressed, and we would like to offer further clarification.
Limited AI Community Impact -- We note that you have simply characterized our work as “TEE-based security engineering.” However, we would like to clarify that TEE is merely our background hardware environment. Specifically, we use it only for the most basic functions, including storing keys and performing matrix permutations and additions/subtractions, and nothing else. Crucially, in order to make it possible for the TEE to perform only such simple authorization functions while still achieving efficient model protection, our entire design is tailored to the LLM itself—including how to permute model parameters to achieve protection, how to permute features so that they can be correctly processed by the protected parameters, and how to permute the output-processing layers of transformer blocks to enable proper authorization propagation.
Moreover, directly using TEE as an engineering implementation cannot solve the problem we aim to address. As discussed in the paper, many prior TEE-based approaches (e.g., [24, 35, 38, 51] in our paper) either fail to provide sufficient model protection or incur prohibitive efficiency costs. In contrast, our design—validated through comparisons with multiple representative schemes—leverages a targeted model parameter protection protocol, an authorization propagation protocol, and a secure input feature authorization mechanism, achieving significantly stronger security and higher efficiency than existing solutions.
Regarding your comment on “unclear relevance to the broader NeurIPS audience,” our work addresses the protection of LLM intellectual property (IP) safety and model parameter privacy in edge deployment scenarios, which falls squarely within our chosen primary area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)—which is an official NeurIPS 2025 track. Moreover, NeurIPS has consistently accepted papers on model IP protection in recent years, including [1] in NeurIPS 2019, [2] in NeurIPS 2020, [3] in NeurIPS 2021, [4] in NeurIPS 2022, [5] in NeurIPS 2023, and [6] in NeurIPS 2024. Beyond NeurIPS, there are numerous examples of neural network IP security research in other leading AI Communities —for instance, [7] on model parameter protection in edge deployment at ICML, [8] on model watermarking at ICLR, and [9] on IP protection at CVPR.
Finally, we would appreciate clarification on what you specifically mean by “LLM-specificity”. A clearer definition would help us better address your concern.
Reference:
[1] Fan L, Ng K W, Chan C S. Rethinking deep neural network ownership verification: Embedding passports to defeat ambiguity attacks[J]. Advances in neural information processing systems, 2019, 32.
[2] Zhang J, Chen D, Liao J, et al. Passport-aware normalization for deep model protection[J]. Advances in Neural Information Processing Systems, 2020, 33: 22619-22628.
[3] Knott B, Venkataraman S, Hannun A, et al. Crypten: Secure multi-party computation meets machine learning[J]. Advances in Neural Information Processing Systems, 2021, 34: 4961-4973.
[4] He X, Xu Q, Zeng Y, et al. Cater: Intellectual property protection on text generation apis via conditional watermarks[J]. Advances in Neural Information Processing Systems, 2022, 35: 5431-5445.
[5] Guo J, Li Y, Wang L, et al. Domain watermark: Effective and harmless dataset copyright protection is closed at hand[J]. Advances in Neural Information Processing Systems, 2023, 36: 54421-54450.
[6] Xian X, Wang G, Bi X, et al. Raw: A robust and agile plug-and-play watermark framework for ai-generated images with provable guarantees[J]. Advances in Neural Information Processing Systems, 2024, 37: 132077-132105.
[7] Zhou T, Luo Y, Ren S, et al. NNSplitter: an active defense solution for DNN model via automated weight obfuscation[C]//International Conference on Machine Learning. PMLR, 2023: 42614-42624.
[8] Gunn S, Zhao X, Song D. An Undetectable Watermark for Generative Image Models[C]//The Thirteenth International Conference on Learning Representations.
[9] Leroux S, Vanassche S, Simoens P. Multi-bit black-box watermarking of deep neural networks in embedded applications[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 2121-2130.
Weak Problem Formulation and Theory -- Regarding your comment on “lacking rigorous modeling or provable guarantees,” we would like to clarify that wherever rigorous modeling or provable guarantees are required, we have already provided them in full. For example, we offer a theoretical guarantee of the authorized model’s performance (Eq. 6) and a formal proof that key-recovery attacks cannot succeed (Appendix C)—which you yourself refer to in your next comment. If you believe there are still specific aspects that require further “modeling or provable guarantees,” please specify them, and we will be glad to provide additional details.
Incomplete Security Evaluation -- Regarding your comment on being “untested against realistic adaptive attacks,” we would like to clarify that we have already conducted extensive experiments in Appendix D: Adaptive Attack. These experiments adopt assumptions even stronger than those in real-world scenarios—for example, assuming the attacker has access to the entire training dataset and can make unlimited queries to the TEE. The results show that such attacks still fail.
Furthermore, during the discussion phase, other reviewers proposed specific adaptive attack strategies. Notably, after careful examination, we reached an agreement with the reviewers that these attacks would still be unable to achieve model-stealing goals. If you have concrete examples of “realistic adaptive attacks” in mind, we are willing to run additional experiments and incorporate the results into the final version.
We appreciate your feedback. Please let us know if there are any further questions or concerns.
The paper introduces CoreGuard — an efficient protection method for preventing model stealing in edge deployment. CoreGuard focuses on active protection against model architecture, weight, and foundational capability stealing attacks. CoreGuard implements an efficient variant of parameter shuffling protection methods in trusted execution environment (TEE) which dramatically reduces the TEE-GPU data transfer overhead. Specifically, CoreGuard employs a protection protocol based on row permutations of linear LLM layers (requiring their inputs to be correspondingly column-permuted) and introduces a propagation protocol that reduces TEE authorizations only to a single initial authorization. CoreGuard achieves security levels close to upper-bound black-box protection outperforming existing methods in terms of both security and computational efficiency. Overall, CoreGuard is novel and the only method so far that delivers protection with proactivity, runtime security, backbone protection, sufficiency and efficiency. Importantly, CoreGuard does not lead to model accuracy degradation and remains secure even if attackers have access to training data.
优缺点分析
Overall, while this paper is slightly outside of my expertise, this is one of the most impressive submissions I have seen at NeurIPS. The method is directly applicable in practice and has endless utility addressing a very important problem.
Strengths:
- CoreGuard is effectively the first method capable of sufficient protection of LLM models on edge which does not have prohibitive overhead (and also has runtime security through the use of TEEs)
- CoreGuard far outperforms existing baseline methods ( in terms of both security and computational efficiency, does not lead to model accuracy degradation and remains secure even if attackers have access to training data
- The paper validates CoreGuard through extensive experimentation across 11 baselines and 16 LLM backbone-task combinations
- The paper is very well written
Weaknesses:
- CoreGuard is only as secure as TEEs (however, this is outside the scope of this study and does not diminish the merit of the method itself)
- Model distillation is a real model stealing concern for edge-deployed LLMs since the attackers may be able to simulate a large number of inputs and outputs without being reliably blocked like in the remote model accessed through API case. While CoreGuard is impressive, it is still vulnerable to model distillation
- The appendix presents multiple possible attacks on CoreGuard showing they are ineffective even assuming access to 100% training data (for example, Authorization Simulation Attack). The key to this failure seems to be the challenge to reliably recover underlying weight matrices/FFN blocks given that even small errors in weight values could derail LLM accuracy. However, since the model is deployed on edge, the attacker can query it as many times as they want effectively creating unlimited training data for weight recovery and authorization simulation. Can you run an experiment querying the model to create additional data and showing that even 5x the amount of training data is not enough to reliably simulate authorization?
- Are there truly secure methods to reliably prevent the attackers from querying the model unlimited number of times?
问题
- Can CoreGuard be applied to any neural net? Not necessarily a language model, right?
局限性
limitations are discussed
最终评判理由
Keeping my high score
格式问题
no concerns
Dear Reviewer,
Thank you very much for your valuable feedback and thoughtful comments. We appreciate the time and effort you have dedicated to reviewing our paper. In the following, we will address each of your concerns point by point.
Regarding Black-Box Distillation Attacks -- We acknowledge that our work does not address black-box distillation attacks, as this constitutes a fundamentally different research topic. CoreGuard focuses on enforcing access control—ensuring the model cannot run without authorization—while assuming that authorized users can access the model normally. In such cases, attackers could indeed collect input-output pairs and perform distillation. Defending against distillation attacks is a new problem, typically categorized into passive defenses (e.g., output watermarking, randomness injection) and active defenses (e.g., training-time obfuscation, anti-distillation regularization). These techniques aim to reduce the effectiveness or fidelity of the distilled model but are orthogonal to CoreGuard’s objective.
Does More Training Data Break CoreGuard? -- Thank you for raising this important point. To evaluate the robustness of CoreGuard under more powerful attackers, we conducted additional experiments targeting both the finetuning attack and the Authorization Simulation Attack. To evaluate the robustness of CoreGuard under more powerful attackers, First, for the finetuning attack, we conducted an additional experiment where we expanded the original training set by querying the protected model to generate more data. We then used this enlarged dataset to perform model stealing attacks following the same procedure as in our main experiments. The results show no meaningful improvement in attack accuracy—accuracy remains close to the black-box baseline, indicating that even significantly increased query data does not compromise CoreGuard’s protection. Next, for the Authorization Simulation Attack, which attempts to simulate the entire TEE structure, including the FFN block and the permutation matrix inside the TEE, we trained with the enlarged dataset and ensured that the loss had already converged. The results show that accuracy remains close to the black-box baseline. This is because CoreGuard requires high precision in the authorization process, and even slight simulation errors can compromise the model’s performance. Since the results cannot be displayed on the webpage, all detailed results will be updated in the Appendix.
On the Existence of a Practical Upper Bound for Defense -- While we agree that attackers can, in theory, issue an unlimited number of queries, each query incurs real-world computational cost and, when scaled up, can become equivalent to training a new model from scratch—ultimately making the attack economically meaningless. From a defense perspective, CoreGuard is explicitly designed to raise the attack cost close to this upper bound: the cost of training a model from scratch given only the architecture (i.e., the black-box setting, where the attacker knows the model structure but must re-learn all parameters). Achieving such black-box-level protection is both a meaningful and practical target for securing edge-deployed LLMs, as it renders model stealing attacks economically and technically infeasible.
Applicability of CoreGuard to Non-Transformer Models -- While our implementation focuses on Transformer models, the design of CoreGuard itself is inherently architecture-agnostic, as it builds upon the three core design principles that each independently exhibit compatibility across diverse architectures. First, from the protection perspective, CoreGuard introduces structure-aware defenses by embedding permutation-based transformations into linear mapping layers, which serve to disrupt potential low-loss merging paths between models. This mechanism targets the mathematical structure of linear layers rather than relying on Transformer-specific components. Since linear transformation layers are ubiquitous in neural networks—including convolution layers in CNNs and fully connected layers in MLPs—our protection approach applies readily to those architectures as well. Second, from the authorization perspective, the generation and injection of both the permutation matrices and OTP noise are handled entirely by the TEE. These operations depend solely on the TEE’s capability to manage secure randomness and authorize parameter access. The architecture of the underlying model does not affect the authorization process, regardless of whether the model is a Transformer, CNN, or any other architecture. Third, from the perspective of authorization propagation, CoreGuard ensures that downstream layers can only be unlocked if upstream layers have been correctly authorized. This function operates on the local behavior of each linear transformation layer and does not require any specific structural assumption about the model. Thus, the propagation rule is universally applicable to any architecture that contains linear transformation layers.
We take a standard CNN like ResNet as an example. CoreGuard protects the convolutional kernels by permuting the input channels of the kernel, effectively locking it. To ensure proper operation, the input feature map must be similarly permuted to match the kernel's new channel order (i.e., it must be authorized). Additionally, the output channels of the convolutional kernel are permuted to ensure that the output channels correspond to the input channels of the next kernel. Once the first layer is authorized with the correct input permutation, the authorization propagates automatically to subsequent layers (i.e., authorization propagation). This approach works seamlessly with nonlinear layers like ReLU, batch normalization (BN), and LeakyReLU, as they only process the values of the elements and are independent of their positions. In this way, CoreGuard provides model parameter protection and ensures the model can only operate under authorization.
Once again, thank you for your constructive feedback. We look forward to any further suggestions you may have.
Dear reviewers:
We hope this message finds you well.
We are writing to kindly inquire about the status of your feedback on our recent rebuttal. We understand that your time is valuable, and we greatly appreciate the effort you have already put into reviewing our manuscript. Your insights are crucial to the improvement of our work, and we are eager to address any remaining concerns you may have.
If there are any additional questions or clarifications needed from our side, please do not hesitate to let us know. Since the discussion phase is still ongoing, we hope to take advantage of this valuable time to engage in more in-depth exchanges with you.
Thank you once again for your time and consideration. We look forward to hearing from you soon.
Best regards,
Authors of CoreGuard
This paper introduces CoreGuard, a lightweight protection method for safeguarding large language models (LLMs) deployed at the edge against model stealing. The key idea is to use a one-time trusted execution environment (TEE) authorization combined with permutation-based propagation, which secures model parameters with minimal computational and communication overhead. Experiments across multiple LLMs and tasks show that CoreGuard achieves near upper-bound protection while maintaining negligible accuracy loss, making it a practical solution for securing proprietary models.
In general, the paper addresses an important and timely problem and proposes a technically novel solution that is efficient, architecture-agnostic, and practically deployable. The method is supported by extensive experimental validation and engages with both theoretical and empirical analysis, earning strong endorsements from several reviewers. However, as pointed out by the reviewers, there are some concerns about the scope of the threat model, the rigor of the theoretical reduction to Matrix-LWE, and open questions about broader applicability and defense against model distillation. Despite these, the authors have provided convincing clarifications in their rebuttal, and the majority of reviewers raised their scores after discussion. Overall, I believe the strengths of this work outweigh the weaknesses, and I think it has a strong potential impact on secure deployment of LLMs.