PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
3
3
3
ICML 2025

Black-Box Adversarial Attacks on LLM-Based Code Completion

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We introduce the first black-box adversarial attack on commercial code completion engines to inject insecure code in unsuspecting user's codebases.

摘要

关键词
code completionsecuritycode securityadversarial attacksblack-boxlarge language modelslarge language model

评审与讨论

审稿意见
4

The authors propose INSEC, a black-box attack that craft a universal perturbation attached to code that, once submitted to LLMs, they include unsafe functionalities that can be later exploited by an attacker. This universal perturbation is computed on a training set, and it is generated through a n heuristic-based optimizer that ranks the modified prompts in terms of the vulnerability they contain and their functionality. Since the attack is optimizing a snippet of code, it is easy to provide unit tests that assess the functionality. Also, vulnerability is checked with a state-of-the-art tool, which can introduce reliable results. Experiments show that, while producing prompts which not always achieve the same functionality (the ratio of passed test is slightly diminished), the number of injected vulnerabilities is incredibly increased, even against commercial products.

给作者的问题

  1. How the results might change with different static analyzers?
  2. How sensitive is the methodology to the presence / absence of suitable unit tests?

论据与证据

Claims are supported by an extensive experimental evaluation considering most of the variants:

  • where the adversarial prompt is injected
  • how to alter the context in case of sanitization (to avoid trivial solutions which are discarded by preprocessors9
  • the hyper-parameters of the optimization process itself
  • code is available

方法与评估标准

The methodology is sound, and the evaluation criteria seems legit. In particular, code can be checked with static analyzers like CodeQL (or others) and easily checked with provided unit tests.

理论论述

This is an empirical evaluation of an attack, no theoretical findings are given.

实验设计与分析

The design of the experiments is sound, as the authors provided a full ablation study on all the possible parameters of their attack (from the parameters of the optimizers to the placement of the prompt).

补充材料

The supplementary material is provided, along with the code of the plugin and the attack. No unsafe content has been included into the repository that has been shared.

与现有文献的关系

This paper highlights the need for better policies for plugins, since this threat can be attributed to the usual unsafety of programs downloaded from a store. Specifically, this paper can be of interests for researcher of the security community.

遗漏的重要参考文献

None.

其他优缺点

I find the paper very interesting, easy to read, and contributing towards the security evaluation of LLM models in security-related domains. The method is simple and intuitive, since it is changing one token at the time through a very simple black-box algorithm. The weaknesses that I find are related to (i) the choice of CodeQL as vulnerability, which can be studied as well as an ablation study; (ii) the unsafe generated code might not be reachable in the victim application, since this evaluation is done on simple snippets of code; (iii) the unit test must be provided, which might not be always possible.

其他意见或建议

None.

作者回复

We thank the reviewer for their overall positive review and discuss their raised questions below. We will gladly incorporate their feedback into the next revision of the paper.

Q1: Do you expect the results to change with different static analyzers? Is using CodeQL an important choice for the attacks' success?

There are two possible ways in which CodeQL may be substituted with alternative methods: For the optimization step and for the evaluation step.

Regarding the use of CodeQL for optimization, we note that our method does not depend on CodeQL, which was chosen due to its customizability. Attackers may build their attacks using other static analysis tools. However, the tool and its capabilities are not relevant to the attack itself. In fact, we only utilize specific queries from the extensive repository of CodeQL. Attackers could similarly hand-craft a specialized tool to detect precisely the vulnerabilities that they want to inject, use it for training, and then manually assess the quality of the results. This would even increase the severity of the resulting attack, since the injected vulnerabilities would likely not be detected by potentially deployed, publicly available analyzers.

Regarding the use of CodeQL for evaluation, we note that the use of CodeQL is standard in the field [1,2]. It is highly accurate, since our evaluation setting is very controlled, in that we know how vulnerabilities can manifest in the generated code for all test cases. This allows us to carefully select a CodeQL query that works effectively for each test case as described in Section 5.1. We manually analyze the accuracy of CodeQL for our evaluation and determine an accuracy of 98% (cf. Appendix A, Line 660). Other static analyzers or oracle exploits, as explored in concurrent work [3,4], could be used instead, but would need to be assessed for their accuracy in this specific setting.

Q2: Can you please comment on the relationship between the reachability of code snippets and their criticality for code security?

Thank you for raising this important question about the reachability of code snippets in user code. It is crucial to recognize that any insecure code in a code base poses a potential risk. Prior work [5] showed that insecure, Copilot-generated code has already reached public code repositories. Even when not immediately exposed to user inputs, these vulnerabilities could become critical through future refactorings, thereby posing a security risk. Since thus the introduced vulnerabilities are always undesirable and should be avoided by code completion engines, it is common in the literature to analyze generated code snippets [1,2].

Q3: How does the attack behave when unit tests are included in the optimization procedure?

We would like to first highlight that our attack does not explicitly optimize for correctness, i.e., there are no unit tests considered in the optimization. Still, our attack manages to achieve a high preservation of correctness, measured by the unit tests used during evaluation of the method. If unit tests were to be included in the optimization, we would expect that the correctness would be at least maintained to the same degree as it is without considering this target.

References
[1] J He & M Vechev. Large language models for code: Security hardening and adversarial testing. CCS 23
[2] H Pearce et. al. Asleep at the Keyboard? Assessing the Security of Github Copilot’s Code Contributions. IEEE S&P 22
[3] M Vero et. al. BaxBench: Can LLMs Generate Correct and Secure Backends?. arXiv
[4] J Peng et. al. CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation. arXiv
[5] Y Fu et. al. Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study. TOSEM 25

审稿人评论

I have read the rebuttal, but Q3 is not really answering the question I have posed, which asks what happens when unit tests are not present. Can the authors better explain this?

作者评论

We thank the reviewer for the follow-up question. We want to highlight that unit tests are not present in our presented attack. We use unit tests during evaluation to assess preservation of functional correctness during our attack.

In the first phase of the attack, we optimize an attack string to increase the vulnerability rate of the attacked LLM using random optimization. During the optimization steps (Alg. 1, L. 7), we select the best attack strings based on triggered vulnerabilities. Vulnerability of completions is measured using heuristical classifiers that check for the presence or absence of security critical tokens. At the end of the optimization (Alg. 1, L. 9), we again select the best attack string based on triggered vulnerabilities, this time assessed more precisely using CodeQL. Neither selection takes into account passing or failing unit tests. Therefore, during optimization, unit tests are not present.

In the second phase, the attack is deployed by injecting it into user queries to the attacked LLM. The string is injected indiscriminately into every prompt, i.e., we neither check the potential vulnerability nor the functionality requirements of the query. Therefore, during deployment, unit tests are not present.

Finally, we evaluate the impact of our deployed attack on completions using two separate datasets for vulnerability and functional correctness: We construct a dataset to measure vulnerability using CodeQL. We then use HumanEval to measure function correctness using its unit tests, which is a standard evaluation pratice.

审稿意见
3

The paper introduces INSEC, a novel black-box adversarial attack that manipulates LLM-based code completion engines to generate vulnerable code while maintaining functional correctness. The attack works by inserting a specially crafted comment string before the completion cursor, which is derived through a query-based optimization procedure. The authors demonstrate INSEC's effectiveness across multiple state-of-the-art models and commercial services, including GitHub Copilot and GPT-3.5-Turbo-Instruct.

给作者的问题

See above weakness points.

论据与证据

Yes

方法与评估标准

Yes.

理论论述

I think this paper does not involve theoretical claims.

实验设计与分析

The experimental design is well-structured, covering attack effectiveness, stealthiness (code functionality), and numerous ablation studies. However, considering that this paper targets a Realistic Black-box Setting, such as Copilot, in actual practice users developing software are likely facing Project Level development rather than simply function-level problems like those in HumanEval. Therefore, I believe the paper lacks experimentation in scenarios like cross-file code completion to further support the usability of the attack in practical black-box settings. Additionally, the paper mentions that the attack must maintain its response speed, while the experiments indicate that 10-20 iterations might be needed to implement the attack. The authors do not directly present the time required - if it's at the millisecond level, users might not notice, but if it takes several seconds or even tens of seconds, I believe this would have a significant impact. Therefore, I suggest the authors supplement this with relevant intuitive data.

补充材料

They provide comprehensive code, even including the developed VSCode plugin.

与现有文献的关系

To my knowledge, this paper takes a step further in making actual code completion engines generate insecure code. Many studies may investigate general LLMs generating insecure code or conduct robustness testing (adversarial attacks) on code completion to generate incorrect code, but generating specific insecure code has not been deeply researched yet.

遗漏的重要参考文献

There are many related works on robustness and security testing of code completion engines that this paper doesn't discuss. For example:

[1] CCTEST: Testing and Repairing Code Completion Systems (ICSE23)

[2] Attribution-guided Adversarial Code Prompt Generation for Code Completion Models (ASE24)

[3] TPIA: Towards target-specific prompt injection attack against code-oriented large language models Although TPIA has not been published yet, it adopts a very similar approach, specifically attacking code completion engines through comment insertion. Therefore, I suggest the authors discuss this work as well.

其他优缺点

Strengths:

  1. The paper addresses an important problem
  2. The writing structure is clear
  3. Extensive experimental scale with many ablation studies

Weaknesses:

  1. Limited novelty: Considering that inserting comment strings to attack code inference engines (or code generation models) is a common approach, I would like the authors to clarify their key innovations, such as which string initialization and mutation types have not been considered by other methods, or provide further insight into what types of strings are more likely to lead to insecure code generation.
  2. Lack of experiments in more practical scenarios: Considering most users use Copilot for repository-level coding, I suggest the authors use more complex datasets closer to real-world scenarios to demonstrate effectiveness.
  3. Lack of reporting on attack time requirements: Given that the authors claim one of the core templates of the attack is to maintain response speed, I suggest authors directly show the time required for the attack.

Overall, I believe that although this paper has limited novelty, it does research an important practical security issue, which is very meaningful. I would be willing to increase my score if the authors could provide further clarification on the concerns mentioned above.

其他意见或建议

no

作者回复

We thank the reviewer for their insightful remarks. We briefly answer the raised questions below and will incorporate all feedback into our next revision.

Q1: How does your work differ from previous work that attacks Code LLMs through perturbed inputs?

We thank the reviewer for their references to [1,2,3] and will extend our discussion of related work that attacks LLMs using input perturbations, such as [4,5]. Note that, to our knowledge, we are the first work to propose a realistic threat model and the first to attack Code LLMs by injecting short code comments.

The more common setting is to assess model robustness by perturbing the entire user input. For example, [1,2,4,5] rename variables and functions, among other semantic-preserving perturbations, to trigger functionally incorrect or insecure code completions. While [2,4] target insecure completions, only [4] also ensures preservation of functional correctness. For all methods, their attacks are not suitable for stealthy attacks in our settings, as they assume white-box access or allow expensive search for individual queries. Overall, prior work is designed for model developers who interested in assessing LLM robustness. In contrast, we discover a short injection string, the attack comment, that triggers correct but insecure completions over many samples, suitable for our realistic threat model of attacking unassuming users.

Concurrent work [3] attacks LLMs to trigger insecure completions by injecting code snippets into the RAG context. Their setting differs in three important aspects: First, they leverage white-box model access to optimize their attack. Second, in their RAG setting, larger attack code snippets can be included into the model context. Third, they do not evaluate whether their attack preserves functional correctness and would thus be sufficiently stealthy to succeed under our realistic threat model.

Q2: How is functional correctness preserved on a repository-level benchmark?

We thank the reviewer for their question and refer them to App. D (L. 888 ff.), where we present how INSEC affects performance on the repository-level code completion benchmark RepoBench [8]. We observe that, matching our results on HumanEval, the performance of their exact match and code similarity metrics is preserved successfully with respective rates of over 83%.

As suggested by the reviewer, we will highlight this experiment more in the next revision.

Q3: How does your attack ensure minimal runtime overhead?

Note that our attack is conducted in two phases: First, an attacker performs the optimization procedure in Algorithm 1 and obtains an attack string. This is done offline, before the attack is deployed. Second, the fixed attack is injected into requests sent by the user. This step only involves simple string operations with no further optimization performed, which causes very little overhead. Since our attack string is short (e.g., 5-10 tokens), it causes minimal run time overhead for LLM inference.

To demonstrate the minimal runtime overhead of our attack during deployment, we run code completion with and without attack strings injected. We observe a negligible median increase of generation time of 0.14s (2.5%) on the functional correctness dataset and 0.33s (2.2%) on the vulnerability dataset. The increase stems only from the string insertion operation and additional tokens in the model input.

We will include this analysis in our next revision of the paper.

Clarification Request

We ask the reviewer to clarify their comment in the section Theoretical Claims. We suspect they mean that our paper "does not involve theoretical claims" instead of the written "does not involve proofs for theoretical claims."

References
[1] Z Li et al. CCTEST: Testing and Repairing Code Completion Systems. ICSE 23
[2] X Li et al. Attribution-guided Adversarial Code Prompt Generation for Code Completion Models. ASE 24
[3] Y Yang et al. TPIA: Towards target-specific prompt injection attack against code-oriented large language models. arXiv 25
[4] F Wu et. al. DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions. arXiv
[5] Q Ren et. al. CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion. ACL 24
[6] T Liu et al. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. ICLR 24

审稿人评论

Thanks for the author's response, which I believe has addressed my main concerns. I hope the author will supplement the advanced content in the subsequent revised version.

At the same time, regarding Theoretical Claims, sorry for the misunderstanding. The author's understanding is correct; what I meant to express was that the paper does not involve theoretical claims, and therefore does not require proof.

I decide to raise my score to 3.

作者评论

We are happy to hear to have addressed the points of the reviewer, and thank them for raising their score. We also thank the reviewer for clarifying their point regarding "Theoretical Claims".

Finally, we assure the reviewer that we will incorporate our rebuttal (to all reviewers) in the next revision of the paper.

审稿意见
3

The authors propose INSEC, a black box attack on code infilling models via adding comments right before the location of code completion that contain an adversarially optimized string. The goal of the attack is to produce functioning code that contains security vulnerabilities. The adversarial comment is initialized via five strategies. Thereafter, INSEC randomly mutates the sequence of tokens representing the adversarial comment in an iterative fashion until some stopping criterion is met. A training dataset is used for the procedure until here. Then, a final adversarial comment is chosen for a validation dataset. Hence, one can argue that INSEC searches for a universal adversarial comment. The authors demonstrate the effectiveness of their INSEC attack on 7 LLMs without comparison to baseline attacks.

给作者的问题

  1. How is the compute cost on an open-source model (e.g., runtime)?
  2. How does the compute cost scale with the dataset sizes?
  3. How are the train, val, and test datasets split? Does train/val vs. test contain similar cases, like rewrites of the same question etc.?

论据与证据

The work's empirical evidence largely supports the claims.

The main criticism regarding claims is the use of certain adjectives. The authors call the studied setting, e.g., "realistic" or "practical" without specifying what the terms actually mean. Especially if such terms are used in a way that says that the author's work was more "realistic" (like l. 382, right column). While I understand that the authors take the perspective of a hacker who has to operate on tight knowledge and resource constraints, adversarial robustness is way broader than cybersecurity. According to the seminal work of Szegedy et al. 2014 [1], adversarial robustness is characterized by the (close to) paradox situation that neural networks achieve great performance despite the fact that small/meaningless perturbations can (almost) arbitrarily mislead them. While a real-world adversary might be able to exploit this "intriuging property," from the perspective of a model developer who wants to understand the limitations of their model, a fully white box setting with excessive resourse use is still "practical".

[1] Szegedy et al. "Intriguing properties of neural networks" ICLR 2014

方法与评估标准

Yes.

理论论述

Not applicable.

实验设计与分析

The experimental design is largely sensible.

My major concerns with the experimental design are (1) the unconstraint usage of possible tokens. While many jailbreak attacks exclude non-ASCII characters (e.g., GCG in Zou et al., 2023), the authors seem not to constrain the possible strings. (2) Nor do the authors investigate a countermeasure via filtering comments using a perplexity filter. These are substantial differences from the recent jailbreak literature and should be considered.

补充材料

Only superficially but I was not able to locate the code.

与现有文献的关系

The authors are the first to study evasion attacks via adversarial comment insertion and propose a viable universal attack that efficiently generates such adversarial comments.

遗漏的重要参考文献

The authors discuss the most relevant works (known to me). However, the authors could be more explicit that random mutations are also heavily used for other adversarial attacks in the jailbreaking literature. E.g., see the recent work by Andriushchenko et al. "Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks" ICLR 2025.

其他优缺点

The paper proposes a simple procedure to generate universal adversarial comments that trigger LLMs to include vulnerabilities in their code. Due to the existence of universal adversarial comments, the paper reveals a quite catastrophic vulnerability of such code completion tools. Even though the algorithmic contributions are minimal, revealing such a limitation is a meaningful contribution. Although it is somewhat expected that this vulnerability exists.

A major criticism of the submission is the way how the authors detail their INSEC. The authors should emphasize much much more that the INSEC functions as a proof of concept to demonstrate that this vulnerability exists and that a real-world adversary currently does not even need heavy resources to exploit this vulnerability. Or do the authors actually want to provide a recipe for how to attack coding "copilots"?

其他意见或建议

  1. It is a bit unusual that the authors themselves find their method "surprisingly" effective. After all, INSEC is a simple "multi-beam" random research procedure that should work well with enough resources (under mild assumptions).
  2. When providing a "cheap" procedure on how to attack a model, it is not particularly strong mitigation not to include "concrete optimized attack strings" (l. 455).
  3. l. 209: the "six" should be "five", right?
作者回复

We greatly appreciate the reviewer's critical assessment and answer their questions below. We will incorporate all feedback.

Q1: Can you adjust your use of certain adjectives such as "realistic" and "practical"?

Yes. We thank the reviewer for pointing this out! We will revise the paper to clarify the meaning of these adjectives in our setting, or replace them with more concrete terms, such as "black-box knowledge" and "low resource". We will also add a discussion to clarify that, while our setting is interesting for attackers, white box settings are also practical and relevant for model developers.

Q2: Does your attack also work when restricted to ASCII characters?

Yes. We run our attack optimization excluding non-ASCII characters on GPT-3.5-Turbo-Instruct. We observe that attacks under such a constrained setting are still successful, achieving an increase of vulnerability rate from 17.1% to 73.1%, similar to 72.5% in the unconstrained setting. Meanwhile, functional correctness is preserved with passRatio@1 (@10) of 98.3% (99.9%). We will include a more detailed analysis in the next revision.

Q3: Can your attack be easily defended by a perplexity filter?

No. Note that our attack string is indiscriminately inserted into all user queries. A perplexity filter designed to reject security-relevant, attacked queries might also reject benign queries for functional code completion, undermining the code completion engine's utility. The necessity to maintain functional correctness is a key difference between our setting and jailbreak defenses.

To demonstrate this experimentally, we examine perplexity filters as employed by [1]. First, we choose a rejection threshold that maximizes the F1 score of detecting attacked prompts in the training and validation set of our vulnerability dataset, achieving recall of over 89% on the test set. Applying this filter on the functional correctness dataset drastically decreases correctness for benign prompts, with funcRate@1 (@10) of less than 29.8% (29.4%), rendering the defense impractical for completion engine providers. Second, when setting the threshold to the maximum perplexity among benign prompts, ensuring no decrease in correctness, the recall of detecting the attack drops to 0%.

Q4: What is the cost of the attack optimization on open-source models?

The optimization phase of our attack requires around 6 hours to find a highly effective string on commercial GPUs. Assuming a cost of between USD 1 and USD 2 per GPU/hour [2,3] results in estimated cost of USD 6 to 12 - a similar cost per attack as we reported for commercial black-box models.

Q5: How does the compute cost scale with the dataset sizes?

Overall, the cost of the optimization is O(nD_vultrain+D_vulval)O(n*|\mathbf{D}\_{\text{vul}}^{\text{train}}|+|\mathbf{D}\_{\text{vul}}^{\text{val}}|), for nn steps. We refer the reviewer to Alg. 1. and are glad to provide more detail if desired.

Q6: How are the train, validation, and test splits of the vulnerability datasets related?

The splits are entirely independent. After sourcing our test samples as detailed in L. 253 ff., we split them randomly into train, validation, and test sets. This design ensures that there is no strong similarity between the splits.

For example, for CWE-078 (OS Injection), one training sample is an independent method to execute a local binary and analyze its outputs line by line [4]. The validation sample is a two-method application allowing to build a Rust project [5]. In the test set, a method to call the ping command is exposed via Flask [6].

Q7: Please discuss related work that uses random search to optimize jailbreak attacks.

We thank the reviewer for this suggestion! We will add related jailbreaking work to our discussion, in particular the mentioned [7]. Note that there are fundamental differences between our work and jailbreaks, such as the resources during deployment, and the requirement to maintain functional correctness for benign queries.

For example in [7], a jailbreak prompt is optimized by leveraging initialization and random search, similar to our work. However, the resulting prompt is very large, unsuitable for code completion, and not analyzed for impact on benign queries (cf. App.D, "Number of Attack Tokens").

We will adapt our paragraph about the "surprising effectiveness" of INSEC to highlight the unexpected, but relevant, brevity of attack strings and preservation of functional correctness.

References
[1] J Geiping et al. Baseline Defenses for Adversarial Attacks Against Aligned Language Models. arXiv
[2] https://lambda.ai/service/gpu-cloud#pricing
[3] https://datacrunch.io/products#A100
[4,5,6] Supplementary Material:

  • sec-gen/data_train_val/main_data/cwe-078_py/10.py
  • sec-gen/data_train_val/main_data/cwe-078_py/13.py
  • sec-gen/data_test/cwe-078_py/1.py

[7] M Andriushchenko et. al. Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. ICLR 25

审稿意见
3

The paper introduces INSEC, a black-box attack on LLM code completion, to bias these LLMs to generate insecure code at a higher rate. The attack works by injecting an attack string as a short comment into the completion input, with the comment created through a query-based optimization procedure. INSEC was evaluated on various open-source models and commercial black-box services like GitHub Copilot. The results show an increase of 50% in the rate of generated insecure code.

给作者的问题

论据与证据

The evaluation of applicability, effectiveness, and practicality is supported by testing against a number of popular LLMs and using popular security weakenesses.

I wonder whether the results would be different with top of the line coding LLMs such as Claude 3.7 and DeepSeek R1.

方法与评估标准

HumanEval is the main benchmark to evaluate functional correctness in the paper. I have doubts whether HumanEval is relevant for code-completion LLMs nowadays, both because it is likely part of training data, and because the data in HumanEval is not representative of current software projects and software tasks.

RepoBench (tested in Appendix D) alleviates some of this concern, but it is only used for functional correctness and not attack success measurement.

理论论述

实验设计与分析

补充材料

The choice of vulnerability rate and pass@k as evaluation metrics is appropriate Using CodeQL for vulnerability assessment is a plus, as it provides a standardized and automated way to detect vulnerabilities.

It would be beneficial to explore the multi-CWE attack further, as the results show a noticeable loss in functional correctness in this case. What is the cause of this drop?

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

作者回复

We thank the reviewer for their insightful comments and discuss their questions below. We will incorporate their feedback into the next revision of the paper.

Q1: Can you please clarify the evaluation setting?

We highlight that we have two separate datasets for the evaluation of vulnerability rate and functional correctness.

For the evaluation of vulnerability rate, we construct a dataset based on code snippets sourced from Pearce et al. [1], real-world GitHub repositories, and GPT-4 generations. This results in a realistic, state-of-the-art dataset for assessing code generation vulnerability.

For the evaluation of correctness, we leverage HumanEval [2]. As the reviewer pointed out, HumanEval consists mostly of algorithmic, self-contained tasks. To include functional correctness results on more realistic, repository-level settings, we also evaluate on RepoBench [3] in Appendix D.

Q2: Would the results be different for chat models like Claude 3.7 and DeepSeek R1?

We don't expect results to be largely different for recent chat models. Concurrent work [4] evaluated these models on chat-applicable settings and found that they are highly likely to produce insecure code. Meanwhile, the INSEC attack is designed for Fill-in-The-Middle completion - the format used in IDE-integrated code assistants. We evaluated INSEC against the state-of-the-art and industry standard in this domain, Copilot, and successfully broke its self-claimed safeguards to actively prevent vulnerable completions [5]. We evaluate other state-of-the-art open-source completion models and the latest completion-API compatible model by OpenAI, GPT 3.5 Turbo Instruct.

Q3: What is the cause of this drop in functional correctness in the Multi-CWE ablation study?

In our Multi-CWE ablation study, we combine attacks for different CWEs by concatenating attack strings that were optimized for the individual CWEs. We observe that combining attacks for more CWEs leads to both a slight decrease in vulnerability and a slight decrease in functional correctness.

We believe the decreased functionality in this simple experiment is mainly due to the approach of concatenating attack strings. This obfuscates the intention of the user and code. We observe a similar trend in our ablation study in Appendix D (Figure 11b), where longer attack strings for single CWEs also lead to decreased functional correctness.

We expect this effect can be avoided by training a single short attack string, adapting the optimization to target several CWEs at once. We thank the reviewer for this observation and will discuss it in the next revision of the paper.

References
[1] Pearce et. al. Asleep at the Keyboard? Assessing the Security of Github Copilot’s Code Contributions. S&P 22
[2] T Liu et al. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. ICLR 24
[3] M Chen et al. Evaluating Large Language Models Trained on Code. arXiv
[4] M Vero et. al. BaxBench: Can LLMs Generate Correct and Secure Backends?. arXiv
[5] GitHub Blog: Filtering out security vulnerabilities with a new AI system at https://github.blog/ai-and-ml/github-copilot/github-copilot-now-has-a-better-ai-model-and-new-capabilities/#filtering-out-security-vulnerabilities-with-a-new-ai-system

最终决定

The authors proposed a new black-box adversarial attack approach called INSEC which injects an attack string as a comment in the code completion input. The attack string is constructed by a query-based optimization procedure. The authors evaluated a diverse set of LLMs and black-box commercial code generators. INSEC demonstrated a significant rate of attack by more than 50% while being cost-effective on common hardware requirements. The authors demonstrated the practical impacts of the current method with an IDE plug-in that can inject INSEC into Github Copilot extension.

There are some minor concerns raised by reviewers:

  • The attack allows the use of all possible tokens without excluding non-ASCII characters. The authors responded in the rebuttal with new results when constraining to ASCII characters.
  • The current results are mainly on single-file code completion. In practice, the completion can be done on cross-file code completion with a repo as a context. The authors added supplementary results for repo-level code completion in Appendix D.
  • Missing some related work e.g. attack optimization by random search

Overall, the proposed approach is simple yet effective and the authors have addressed most of the concerns during the rebuttal. I suggest the authors integrate all the feedback into the final revision.