PaperHub
4.9
/10
Spotlight4 位审稿人
最低2最高3标准差0.4
3
2
3
3
ICML 2025

BaxBench: Can LLMs Generate Correct and Secure Backends?

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

This paper introduces a novel benchmark to measure the correctness and security of LLM-generated code for backend applications.

摘要

关键词
large language modellarge language modelsLLMcode generationcode securitysecuritybenchmark

评审与讨论

审稿意见
3

This paper introduces BAXBENCH, a novel benchmark for evaluating large language models' (LLMs) capabilities in generating correct and secure backend applications. The benchmark consists of 392 tasks spanning 28 scenarios implemented across 14 popular backend frameworks in 6 programming languages. BAXBENCH evaluates two key aspects: (1) functional correctness through comprehensive test cases and (2) security vulnerability through end-to-end exploitation attempts. The authors evaluate 10 state-of-the-art LLMs, including flagship models like OpenAI's o1 and Claude 3.5 Sonnet. The results reveal significant limitations: even the best-performing model (OpenAI's o1) achieves only about 60% on functional correctness, and more than half of the functionally correct programs generated by LLMs were found to be vulnerable to security exploits. Performance further degrades when using less popular backend frameworks. The authors position BAXBENCH as a rigorous evaluation of LLMs' ability to generate deployment-ready code that is both functionally correct and secure.

给作者的问题

Given that SWE-Lancer also evaluates real-world software engineering capabilities using end-to-end tests, how do you position BAXBENCH's contribution relative to SWE-Lancer?

论据与证据

  1. LLMs struggle with backend generation tasks: The evaluation across multiple models shows that even top models achieve at most 60% on functional correctness, providing strong evidence for this claim.
  2. Security is a major concern: The authors demonstrate that over half of functionally correct solutions are vulnerable to security exploits, supporting their claim that security evaluation is critical and current LLMs are not yet ready for autonomous coding.
  3. Framework familiarity matters: The results clearly show performance variations across different frameworks, with models performing better on popular frameworks and languages.

方法与评估标准

  1. Task diversity: The 28 scenarios across 14 frameworks provide a fine coverage of real-world backend development tasks.
  2. Dual evaluation: Assessing both functional correctness and security is important, as both aspects are critical for deployment-ready code.
  3. Metrics: The pass@k and sec_pass@k metrics are appropriate for measuring both correctness and security.

理论论述

The paper does not make theoretical claims requiring formal proofs. The work is primarily empirical in nature.

实验设计与分析

The experimental design is rigorous and appropriate:

  1. Model selection: The authors evaluate 10 diverse state-of-the-art LLMs, including both open and closed-source models.
  2. Task framework: The evaluation setup tests models on realistic backend development tasks requiring both functional correctness and security.
  3. Prompt variations: The comparison between functionality-only and functionality+security prompts provides insights into how explicit instructions affect performance.
  4. Exploit verification: The security exploits were iteratively refined on both LLM-generated and human-written solutions, increasing their reliability.
  5. Analysis depth: The authors analyze performance variations across frameworks, languages, and scenario complexity, providing nuanced insights.

One minor concern is that the benchmark's exploits might not cover all possible security vulnerabilities, potentially making the sec_pass@k an overestimate of true performance. However, the authors acknowledge this limitation explicitly, stating their measured metrics provide an upper bound.

补充材料

Yes, it contains the scaffolding code of the dataset.

与现有文献的关系

The paper positions itself well within the existing literature on LLM code generation capabilities, It addresses limitations in existing benchmarks like HumanEval, which focus on function-level code or algorithmic tasks rather than end-to-end applications.

遗漏的重要参考文献

The paper has good coverage of relevant literature, but there's one significant recent work that should be discussed:

  • SWE-Lancer (Miserendino et al., 2025): This benchmark evaluates LLMs on real-world freelance software engineering tasks worth $1 million in actual payouts. While BAXBENCH focuses specifically on backend applications, SWE-Lancer takes a broader approach to full-stack engineering with end-to-end tests. A comparison would strengthen the paper, as both aim to evaluate deployment-ready code generation capabilities.

BAXBENCH does have some unique advantages over SWE-Lancer, particularly in its security-focused evaluation and the systematic coverage of multiple backend frameworks. Acknowledging this related work and positioning BAXBENCH relative to it would strengthen the paper's contribution claims.

其他优缺点

BAXBENCH makes valuable contributions by focusing on backend applications and combining functional correctness with security evaluation. This focus is distinct from other benchmarks and addresses important real-world concerns. Compared to SWE-Lancer, BAXBENCH offers several unique advantages:

  • Dedicated focus on backend applications, which are security-critical
  • Systematic evaluation of security vulnerabilities through real exploits

However, SWE-Lancer does offer some complementary strengths that BAXBENCH could learn from:

  • Mapping to real economic value through actual freelance payments
  • Broader coverage of full-stack engineering tasks
  • End-to-end tests validated by professional engineers

其他意见或建议

N/A

作者回复

We thank the reviewer for their insightful review and overall positive assessment of our paper. We address their questions below.

Q1: Can you discuss how SWE Lancer relates to BaxBench?

We thank the reviewer for pointing us to SWE Lancer and we will gladly add a discussion on it in the next revision of our paper. However, we would like to highlight that the first public copy of this paper has only been made available one month after the ICML submission deadline. Thus, it was impossible for us to include it in the submission.

First and foremost, the key difference between BaxBench and SWE Lancer is that BaxBench does security evaluation, an aspect that is not explicitly considered by SWE Lancer, and especially not using practical exploits.

Regarding the other differences between the papers, we largely agree with the reviewer’s analysis, except that: (i) BaxBench, like SWE Lancer, also uses end-to-end tests verified by professional software engineers for functionality testing (and not only function-level unit tests as other benchmarks), and (ii) BaxBench’s backend tasks, like SWE Lancer’s, are also highly diverse—in fact, while SWE Lancer sources all of its tasks from the Expensify repository, making all tasks correlated, BaxBench scenarios are highly diverse and constructed independently from scratch.

Finally, BaxBench is fully contamination-free, as the scenarios have been constructed manually from scratch. In contrast, SWE Lancer’s tasks stem from an existing open-source repository that could have very well been in the training data of the models.

We will include this discussion in the next revision of the paper.

审稿意见
2

This paper introduces a benchmark for assessing large language models' abilities to generate application backend code. The benchmark comprises 28 distinct scenarios across 14 popular backend frameworks that specify application requirements, API specifications, environment instructions, and database needs. Evaluation occurs through both functional testing and security vulnerability assessment. Each scenario is equipped with a set of security exploits targeting targeting around 3.3 CWEs. The author extensively evaluates the performance of many popular LLMs across both closed and open models.

给作者的问题

N/A

论据与证据

The authors claim that current leading models struggle with both correctness and security in backend code generation. Their evidence shows that often nearly half of functionally correct solutions contain security vulnerabilities. They demonstrate that security-focused prompting helps reduce vulnerabilities but doesn't eliminate them completely.

方法与评估标准

The methodology involves presenting LLMs with detailed application specifications and evaluating the generated code through automated functional tests. For security assessment, they develop scenario-specific exploit code targeting common weakness enumerations (CWEs). This evaluation approach provides tests for both functionality and security.

理论论述

N/A: The paper proposes a benchmark for LLM code generation on backend application scenario.

实验设计与分析

The experimental design covers 14 popular backend frameworks, providing broad coverage of real-world development environments. They designed specific exploits to test the security vulnerability which can test specific CWEs.

补充材料

The appendix provide more details about the benchmark dataset and the specific scenarios.

与现有文献的关系

This work contributes to the growing body of research on LLM code generation capabilities and limitations, with a particular focus on the critical area of backend security that has been underexplored in previous benchmarks. This is especially important as many LLM users today may deploy the LLM-generated backend application code to the wild.

遗漏的重要参考文献

The related work are discussed.

其他优缺点

Strengths: The benchmark addresses a practical and important use case as more developers rely on LLMs for backend code generation. The security focus is particularly valuable given the high stakes of vulnerabilities in production systems if the LLM users use them in real world.

Weaknesses: The evaluation could be expanded to include LLM agents, which are increasingly being used by many users for development tasks, and are already widespread like Windsurf or Claude code. This would provide a more complete picture of how current AI tools perform in real-world development scenarios. Additionally, the number of security exploits per scenario appears limited, which may not comprehensively capture the full range of potential vulnerabilities that could exist in real-world applications.

其他意见或建议

Consider extending this work to evaluate how LLM agents perform on these tasks, as they represent an increasingly common development approach. And please also consider extending the CWE tested per scenario.

作者回复

We thank the reviewer for their insightful and constructive review and address their questions below.

Q1: Can you evaluate coding agents?

Upon the reviewer’s request, we tested the most advanced open-source general coding agent, OpenHands (OH) powered by GPT-4o and Claude 3.5 Sonnet.

We use our testing environments for the agent, however, we found that our Python-Django and Compiler environments are incompatible with the OH sandbox base image. Therefore, we exclude this framework and scenario. We compare the agent performance against the base models on the remaining 351 tasks:

Modelsec_pass@1pass@1+Security Oracle Prompt \rightarrowsec_pass@1pass@1
Claude 3.5 Sonnet31.452.434.339.7
Claude 3.5 Sonnet + OH31.659.338.244.2
GPT-4o21.043.727.636.2
GPT-4o + OH16.638.123.233.5

Surprisingly, we see that the agent only provides an improvement on Claude 3.5, with GPT-4o (August’24 version) performing worse in the agent—the weaker model is overwhelmed by the agentic framework, making more functional mistakes. However, also on Claude, the improvement is not as drastic as one may expect (max 6.9%). Even the agent struggles strongly with BaxBench tasks, especially when it comes to security. We hypothesize that this is partially due to the limitations of the underlying LLMs, and the nature and complexity of the task at hand, differing from typical agent benchmarks focused on working in large pre-existing repositories (here, the agent needs to build from scratch, use different frameworks, write its own tests, etc.). This shows once again that fundamental development towards secure backend coding is needed.

We thank the reviewer for the interesting suggestion which will make a valuable addition to the next revision of the paper.

Note: We did not test on agentic IDEs such as Cursor and Windsurf as they are not suited for large-scale experiments. Further, we did not test Claude Code due to cost budget limitations. We will release the code in BaxBench for independent evaluations of Claude Code.

Q2: Please comment on the CWE coverage per scenario.

To address the reviewer’s question, we conduct a deeper investigation into CWE coverage.

First, we compare the CWEs that we test for with the CWEs reported by the SOTA SAST tool, Snyk Code, on OpenAI o1’s correct solutions. This analysis allows us to assess the comprehensiveness of our exploit attempts (i.e., the CWE coverage) per scenario. We find that for 24 of the 28 scenarios, we test for all CWEs that Snyk reports and further ones beyond. For the remaining 4 scenarios, Snyk and our tests overlap on all but one CWE. Looking at these cases, we see that this difference stems from Snyk testing for CWE-400 (uncontrolled resource consumption) where we test instead for CWE-703 (improper error handling), raised by crashing the server, which includes it crashing due to resource overconsumption. Note also that Snyk raises CWE-400 for rate limit issues, which are usually handled on an architectural level [2] and not considered relevant on the application level by us.

Beyond scenario-level exploit coverage, we also investigate how many insecure programs do we actually catch compared to Snyk. For this, we examine the Snyk reports on the 237 correct programs generated by OpenAI o1 and compare them to our exploit reports. We find that Snyk misses 64 (27%) vulnerabilities (i.e., where we could execute real exploits), while marking 25 (10.55%) programs vulnerable that were not marked by our exploits. Manually analyzing these 25 programs, we find that 16 are false positives (6.75%) and 2 concern rate limits which are architecture level concerns, i.e., amounting to 7 (2.95%) correct additional flags across 3 CWEs. Note that among these 7, we actually test for each of the raised CWEs, merely, our exploit attempts do not succeed. We will extend our exploits vectors accordingly.

Our analysis shows two things: (i) our exploit attempts are targeting a comprehensive set of CWEs per scenario; and (ii) our exploits are far more reliable than static analysis for actually detecting insecure code (Snyk 6.75% false positives and a whopping 27% false negatives).

Overall, we believe the above is strong evidence showing that BaxBench’s security coverage is both extensive and largely comprehensive. On a benchmark level, as also detailed in Table 4 in the appendix, we cover the most likely and critical security issues, based on the popular MITRE Top 25 and OWASP Top 10. Therefore, our benchmark is suited for supporting our key message of the paper—current models’ code is dangerously insecure.

References
[1] Yang et al., SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI. arXiv 2024.
[2] Serbout et al., API Rate Limit Adoption -- A pattern collection, EuroPLoP 23.

审稿意见
3

This paper introduces BaxBench, a benchmark for evaluating LLMs' ability in generating functionally correct and secure backends. It evaluates LLMs in 28 scenarios and 14 frameworks and show that generating secure and correct backends is still challenging.

给作者的问题

Do you have any suggestions about improving model's ability on BaxBench, and their ability to generate secure code in general?

论据与证据

Yes, I find the claims in the submission clear and convincing.

方法与评估标准

I find the evaluation pipeline a bit insufficient.

  1. The number of scenarios is 28, which seems too few to me. I understand that by multiplying it with the number of frameworks, there are 392 tasks. However, I think the ability to deal with semantically different scenarios is also important and should be measured with more scenarios.

  2. Lack of evaluation of agentic approaches. I understand that the authors are trying to evaluate the ability of LLMs. But lots of evidence shows that agentic approaches can be much better at multi-file and larger-scale software engineering tasks. It would be great if the authors can provide some insights into how agentic approaches perform on BaxBench.

理论论述

N/A

实验设计与分析

I find it sound.

补充材料

Yes. I took a closer look at the example of calculator in the appendix.

与现有文献的关系

Prior scientific literature mostly focus on either security or functionality, but not both. This paper tries to close that gap by providing a set of realistic backend development tasks that also have security implications.

遗漏的重要参考文献

No.

其他优缺点

I think the evaluation pipeline and the data curation pipeline could be useful for future benchmark development.

其他意见或建议

N/A

作者回复

We thank the reviewer for their insightful and overall positive review, and address their questions below.

Q1: Do the scenarios provide sufficient semantic diversity?

Yes. The programs have to handle files, databases, access controls, OS commands, and external binaries (e.g., compilers, png, and pdf tools). The large spread of the models’ performances signifies this diversity, while the complexity is evidenced by the low overall correctness across all models. However, the ultimate goal of BaxBench is to measure the LLMs’ performance on security-critical backend tasks, with other functionality aspects (e.g., algorithmic complexity) already covered by existing benchmarks. Therefore, we do not claim that it covers all possible backend coding tasks, but it does cover most security-relevant tasks for web application backends. For a discussion on security coverage we refer to the reply to Q2 of reviewer WvVR.

Further, we believe that in comparison to current standard functional correctness and especially security benchmarks, BaxBench’s semantic diversity and volume is highly competitive:

  • SWE-Bench and SWT-Bench [1,2] (agent-functionality) are based on only 12 Python repositories.
  • SWE-Lancer [3] (functionality) is based on a single TypeScript repository.
  • SafeCoder [4] (function-level, security-only) provides 42 security-only evaluation cases across 6 programming languages (avg. 7 per language).
  • SVEN [5] (function-level, security-only) has 24 security-only evaluation cases across 2 languages (avg. 12 per language).
  • CWEval [6] (function-level, security + correctness) contains 119 tasks across 5 languages, translating examples from 25 core tasks.

We thank the reviewer for their insightful question, and will add this discussion in the next revision.

Q2: Can you evaluate coding agents?

Upon the reviewer’s request, we tested the most advanced open-source general coding agent, OpenHands (OH) powered by GPT-4o and Claude 3.5 Sonnet.

We use our testing environments for the agent, excluding the Python-Django and Compiler environments as these are incompatible with the OH base image. We compare the agent performance against the base models on the remaining 351 tasks:

Modelsec_pass@1pass@1+Security Oracle Prompt \rightarrowsec_pass@1pass@1
Claude 3.5 Sonnet31.452.434.339.7
Claude 3.5 Sonnet + OH31.659.338.244.2
GPT-4o21.043.727.636.2
GPT-4o + OH16.638.123.233.5

Surprisingly, we see that the agent only provides an improvement on Claude 3.5, with GPT-4o (August’24 version) performing worse in the agent—the weaker model is overwhelmed by the agentic framework, making more functional mistakes. However, also on Claude, the improvement is not as drastic as one may expect (max 6.9%). Even the agent struggles strongly with BaxBench tasks, especially when it comes to security. We hypothesize that this is partially due to the limitations of the underlying LLMs, and the nature and complexity of the task at hand, differing from typical agent benchmarks focused on working in large pre-existing repositories (here, the agent needs to build from scratch, use different frameworks, write its own tests, etc.). This shows once again that fundamental development towards secure backend coding is needed.

We thank the reviewer for the interesting suggestion which will make a valuable addition to the next revision of the paper.

Note: We did not test on agentic IDEs such as Cursor and Windsurf as they are not suited for large-scale experiments. Further, we did not test Claude Code due to cost budget limitations. We will release the code in BaxBench for independent evaluations of Claude Code.

Q3: How could models be improved on BaxBench?

Please see the answer to Q1 of reviewer p9gW for a general discussion on the errors the models make on BaxBench. To improve on these errors, we believe incorporating code security during model training is crucial. For example, during post-training, a low amount of high-quality data could be utilized to steer the base model towards secure code, similarly to [4,7]. We will discuss these learnings and future directions in the next revision of our paper.

References
[1] Jimenez et al., SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. ICLR 2024.
[2] Mündler et al., SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents. NeurIPS 2024.
[3] Miserendino et al., SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?. arXiv 2025.
[4] He et al., Instruction Tuning for Secure Code Generation. ICML 2024.
[5] He & Vechev. Large language models for code: Security hardening and adversarial testing. CCS 2024.
[6] Peng et al., CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation. LLM4Code@ICSE’25.
[7] Xu et al., ProSec: Fortifying Code LLMs with Proactive Security Alignment. arXiv 2025.

审稿意见
3

This paper introduces BAXBENCH, a benchmark to evaluate LLM-based generation of correct and secure backend applications. Functionality of generated code is validated through testing, while security is evaluated through end-to-end exploits. The authors evaluated 10 LLMs on BAXBENCH and found that even the best model achieves only 60% on code correctness. Exploits were successful on more than half of the correct programs.

给作者的问题

论据与证据

While the proposed benchmark shows that various LLMs have varying level of success, thus achieving its key function as a benchmark, I am wondering how useful the benchmark is by itself since it does not help attribute the success to particular factors (e.g., framework, complexity of task). For example, given the observation from Fig. 4 that "The model struggles more with less popular programming languages and multifile frameworks", how would one use the results from the benchmark to guide the improvements on a model? Should be model be trained on particular frameworks? On particular tasks?

方法与评估标准

Yes, the proposed methods and evaluation criteria are suitable for the evaluating the generation of correct and secure backend code.

理论论述

实验设计与分析

The experimental setup is well-defined, and the results are presented in a clear and detailed manner. The statistical analyses used, such as pass@k and sec_pass@k, are popular metrics in the field.

There are few issues to consider:

  • Complexity of scenarios: The scenarios in BAXBENCH might not cover all possible types of backend applications or security vulnerabilities. It would be good to go into more depth about the selection process and how representative these scenarios are. Additionally measuring scenario complexity by length of scenario specification can be misleading.

  • Evaluation of security: It is possible that some vulnerabilities may be missed, when testing against a fixed set of exploits. Static code analysis could complement this approach.

补充材料

Appendices A-D

与现有文献的关系

This benchmark may prove useful to future whole-app generation efforts.

遗漏的重要参考文献

其他优缺点

  • Is there any guidance on how "resilient" this benchmark may be to memorization in future LLMs?

其他意见或建议

作者回复

We thank the reviewer for their insightful review and overall positive assessment. We address their questions below.

Q1: How can models be improved on BaxBench?

To understand functionality challenges in BaxBench, we manually investigate 20 incorrect programs generated by OpenAI o1, and find that the model often fails on trivial, boiler-plate tasks, such as adhering to the requirements set in the API specification, adhering to formats and response codes, handling files, setting CLI flags, or producing compilable and executable code. The simplicity of these errors is surprising given the success of LLMs on algorithmic benchmarks. We believe this is due to the focus on algorithmic coding performance in model development and the prioritization of the most popular languages.

Further, models often produce vulnerable code. Meanwhile, when prompted with potential vulnerabilities, the model's vulnerability rates decrease, at the cost of functional correctness. Reasoning models’ correctness decreases much less. This crucial observation highlights the capacity of reasoning models to generate correct, secure code.

In terms of concrete improvements, we believe the gathered insights can be used in the post-training phase, where a low amount of high-quality data could be utilized to steer the base model towards secure code, similarly to [2,3]. The lack of such considerations has already led to commercially deployed exploitable applications [1].

Q2: Can you provide more details on the coverage and complexity of BaxBench scenarios in terms of backend tasks and security?

BaxBench contains highly diverse scenarios, requiring correct handling of files, databases, access controls, OS commands, and external binaries (e.g., png and pdf tools). Regarding complexity, we do not aim to measure algorithmic difficulty, already covered by other benchmarks (e.g., HumanEval), but rather application logic, e.g., registration and login, handling server communication, or storing of user messages in databases. The large spread of models’ correctness over scenarios highlights the variety in our tasks.

In terms of security coverage, we cover the most impactful security vulnerabilities, collected in the MITRE TOP 25 and OWASP Top 10 lists. We also achieve high security coverage on a scenario level. We verified this by comparing to the industry-leading SAST tool Snyk-Code on a model different from what we used for development. More details in Q3.

Q3: Could static analysis complement BaxBench’s exploits?

To address the reviewer’s question, we conduct an experiment comparing our exploits to the industry leading SOTA SAST tool Snyk-Code on the 237 correct programs produced by OpenAI’s o1. We find that Snyk misses 64 vulnerable programs exploited by our tests, i.e., has at least 27% false negative rate, while marking 25 (10.55%) programs vulnerable that we did not exploit. Through manual analysis we find that 16 (6.75%) of these are false positives and 2 concern rate limits which are often handled outside of the application, amounting to 7 (2.95%) correct additional flags across 3 CWEs. Note that even for these 3 CWEs, we already make exploit attempts, merely, our attack inputs do not succeed.

Due to the high amount of false negatives and false positives, we consider SAST tools unsuitable for benchmarking. However, we consider them useful for guiding exploit design and will add test cases for the 7 correctly discovered exploits.

In addition, SAST tools limit reproducible benchmarking, as the best tools keep constantly changing, are not open-source, and introduce a dependency on external tool providers.

Our manual exploits cover the most important threat vectors, and are clearly sufficient to show that current models’ code is strongly subpar—supporting our key message. Moreover, they guarantee the absence of false positives.

Q4: How resilient is this benchmark against contamination in future LLMs?

Once a benchmark is public, we believe the only guaranteed way to avoid contamination is to continuously update the benchmark. As stated in the paper, we plan to continuously add new scenarios and frameworks to BaxBench.

Note that future models are less likely to be contaminated by BaxBench: (i) our tasks are not based on existing code (unlike, e.g. SWE-bench [4]), (ii) accidental contamination is difficult as we do not release golden solutions, publishing only the execution framework and the prompts; and (iii) malicious contamination requires careful curation of such golden solutions.

We thank the reviewer for the thoughtful question, and will add a discussion in the next revision.

References
[1] https://x.com/tedx_ai/status/1901640901505827148
[2] He et al., Instruction Tuning for Secure Code Generation. ICML 2024.
[3] Xu et al., ProSec: Fortifying Code LLMs with Proactive Security Alignment. arXiv 2025. [4] Jimenez et al., SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. ICLR 2024.

最终决定

This work introduces a benchmark for evaluating the ability of LLMs to generate correct and secure backends. It also provides an evaluation of 10 LLMs on this benchmark both in terms of functionality and security. Overall, this seems like a useful and timely benchmark and I recommend its acceptance.

I would kindly ask the authors to incorporate the rebuttal comments, especially the coding agent setup and evaluation in their final version.