PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
5
4
4.5
置信度
创新性2.3
质量2.3
清晰度3.0
重要性2.5
NeurIPS 2025

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization

OpenReviewPDF
提交: 2025-04-05更新: 2025-10-29
TL;DR

This paper introduces an iterative optimization framework that leverages online execution feedback to improve code efficiency.

摘要

关键词
Code GenerationSoftware Engineering

评审与讨论

审稿意见
4

This paper introduces an Iterative Optimization Framework (IOF) for improving code efficiency at test time using feedback-driven refinement. In each iteration, a language model receives the current version of the code and execution-based feedback (e.g., peak memory usage, latency, total memory) from a sandboxed runtime environment, then generates an improved version of the code. The authors evaluate various training strategies—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO)—for their ability to adapt under the IOF setting. They further release Venus, a new dataset consisting of programming problems and associated solution variants with varying efficiency profiles.

优缺点分析

Strengths

Clear Framework Design: The paper presents a well-defined framework for feedback-driven code-efficiency refinement. The iterative nature of the process and the way it leverages runtime metrics are clearly explained.

Clear Dataset Description: The Venus dataset and associated training data are systematically constructed and properly explained. This is a valuable resource for the community focused on efficiency-aware code generation.

Empirical Gains from GRPO: The GRPO-trained Qwen2.5 3B model demonstrates significantly better test-time efficiency improvements over baseline and SFT/DPO-trained models within the IOF, showcasing the method's potential.

Weaknesses

  1. Incomplete Evaluation of IOF Design Choices: The current implementation generates improved code based only on the most recent iteration. Would a history-aware model (i.e., C0:i1,M0:i1C_{0:i-1}, M_{0:i-1}) perform better? Using all previous iterations could be useful to break out of local minima where Miout=MiinM_{i}^{out}=M_i^{in} despite code differences Ciout!=CiinC_{i}^{out} != C_i^{in}?

  2. Lack of Scaling Studies: All key results focus on the Qwen2.5 3B model. It's unclear whether GRPO and IOF yield consistent improvements for larger-scale models (e.g., 7B, 13B). This is critical to assess the generalizability and practical relevance of the framework.

  3. Limited Baseline Comparisons in IOF Iterations: Table 1 compares performance across different models only at iteration 0, but does not show how baseline models (e.g., vanilla GPT, LLama4, Claude, etc.) evolve under IOF. This obscures how much of the final performance gain comes from model training (e.g., GRPO) versus the framework itself. Showing relative improvement over iterations (e.g. C0C_0 against CibestC_{i_{best}}) for different base models would help isolate the benefits.

  4. Insufficient Justification for Venus Dataset: The paper lacks a clear comparison with existing datasets such as Mercury, EffiBench, and EvalPerf. What are the limitations of those datasets that Venus addresses? What is the statistical diversity (e.g., memory, time distributions) of reference solution variants in Venus compared to prior work?

  5. Missing Benchmark Analysis on Existing Datasets: There is no empirical evaluation of IOF and GRPO-trained models on established efficiency-focused code benchmarks like Mercury, EffiBench. This limits the ability to compare against prior art and undermines claims of generalizability.

问题

  • See Weaknesses section
  • How does IOF performance vary when initialized with code that spans the efficiency spectrum (from highly inefficient to highly efficient)? Are GRPO-trained models more robust to such variation?
  • Have you conducted ablations on the number of iterations (e.g., convergence rate) or observed performance ceilings?

局限性

Yes, the authors have discussed their Limitations but I would suggest it to move it from the Appendix to the Main Body of the paper.

最终评判理由

The authors have addresses all my concerns with adequate experiments and justification. While using iterative feedback to improve LLM responses is not completely novel, using it for improving code efficiency is an interesting use case. Thus I am increasing my score from Reject to Borderline Accept.

格式问题

None

作者回复

Dear Reviewer UfKJ,

Thank you for your constructive and insightful reviews. We hope we can address your concerns as follows:

Q1: Would a history-aware model perform better?

We conduct the experiment as you suggested, and show results in the following table. The results below show that the history-aware model did not outperform the single-turn model in the current setting.

Single-turn Loop Iteration0125710
Pass %47.3350.3352.0058.1760.5061.67
Beyond-I %18.2424.8129.4435.4838.0138.95
Multi-turn Loop (History-aware) Iteration0125710
Pass %45.0046.3346.3346.3346.6746.67
Beyond-I %15.1015.1015.1015.1015.1015.10

We attribute the underperformance of history-aware model to two primary factors: 1) Distribution Shift: The performance drop is likely caused by a distribution shift, as the training format (single-turn) differs from the multi-turn, history-aware format used during inference. 2) Training Challenges: While fine-tuning the model with multi-turn format could mitigate the distribution shift, this approach introduces the significant challenge of long-term credit assignment, a well-known difficulty in Reinforcement Learning. We agree that history-aware modeling is a valuable and promising research direction. We appreciate the insightful feedback and plan to explore this more thoroughly in our future work.

Q2: Model Scale Generalization

While this work focuses on exploring the methodology of iterative efficient optimization rather than the scalability of larger models, we do agree that it is critical to evaluate our frame on larger models. Due to the rebuttal time limitation, we conducted additional experiments on Qwen2.5 7B as you suggested. The result demonstrates our iterative optimization framework can be generalized to larger models.

Iteration0125710
Pass %60.6765.6769.3374.6779.6783.67
Beyond-T %27.6732.8734.2239.5041.9142.86
Beyond-M %29.7931.5536.4840.9743.2644.51
Beyond-I %21.0228.1033.8537.5437.5438.66

Q3: Baseline Comparisons in IOF Iterations

We thank the reviewer for this insightful suggestion. To better isolate the benefits of our framework versus our model training approach, we have conducted new experiments applying IOF to several baseline models as requested. Specifically, We ran the IOF process for 8 iterations on the vanilla Qwen-2.5-3B, Qwen-2.5-7B, and OpenAI GPT-4o models. The results are presented below:

Qwen2.5 3B Iterations01248
Pass %27.9928.3328.6729.0029.33
Beyond-T %12.4012.7413.0813.4213.69
Beyond-M %13.2413.7314.2214.7115.19
Beyond-I %10.2910.7711.2511.7212.07
Qwen2.5 7B Code Iterations01248
Pass %52.2154.0054.3356.0058.33
Beyond-T %20.6621.4723.0323.4726.26
Beyond-M %25.2125.7226.4829.9331.20
Beyond-I %16.7817.6920.9123.2127.14
GPT-4o Iterations01248
Pass %82.2684.0084.3384.3385.33
Beyond-T %38.2240.1642.0244.1646.17
Beyond-M %42.0943.2844.7146.2048.01
Beyond-I %28.8930.6032.5535.0241.29

As the data shows, larger models do achieve better performance through the IOF iterations. However, the improvement in these vanilla models is significantly less than that of our Afterburner models. This demonstrates that while the IOF framework itself is beneficial, our proposed training paradigm are the primary drivers of the substantial performance gains we report. (Note: As closed-source models have not disclosed their training details for, we cannot determine if similar reinforcement learning techniques were already incorporated into their development pipeline.)

Q4: Venus Justification

We recognize the importance of a clear comparison with prior work and will clarify this in the revised paper.

  • Multilingual Coverage: A primary contribution is that Venus is the first multilingual code efficiency benchmark, covering six languages. Prior work, including Mercury [1], EffiBench [2], and EvalPerf [3], focuses exclusively on Python.
  • Solutions Diversity: Venus offers a significantly larger and more diverse set of reference solutions. While EffiBench uses a single baseline and Mercury averages 18.4 solutions per problem, Venus provides an average of 79.3 solutions per problem for each language. This massive increase directly contributes to the statistical diversity of performance metrics, as shown in the time and memory distributions in Appendix Figure 7.
  • Evaluation Dimensions: As a direct extension of Mercury, which only measures execution time, Venus evaluates execution time, memory usage, and their integral, providing a more holistic assessment of code efficiency.

We will incorporate this detailed comparison, currently summarized in Appendix C, into the main body to make the justification for Venus more clear.

Q6: IOF Efficiency Spectrum

It's a good suggestion! We conduct the experiment to start Afterburner SFT and Afterburner GRPO with the efficiency spectrum. Since we are not allowed to upload images in the rebuttal response, we present iterative Beyond-I results beginning from the inefficient (0% Beyond-I) to the efficient (50% Beyond-I) solutions in the following table. I will update the complete efficiency spectrum figure in the revision.

Less Efficient Start Iterations01248
Afterburner SFT3.764.745.247.3111.50
Afterburner GRPO3.764.285.218.6013.85
More Efficient Start Iterations01248
Afterburner SFT51.4451.4452.9354.2156.73
Afterburner GRPO51.4453.7055.1258.2162.17

We argue that a key advantage of RL is exploring more sufficiently in the exploitation space than SFT or DPO. Therefore, it is more robust if the inference distribution aligns with the training distribution.

Q7: Performance Ceilings

Yes, In our experiments, performance ceilings are observed from IOF round 7 on both the Venus and APPS benchmarks, as shown in Figures 5 and 9. We hypothesize that these ceilings result from the trade-off between task complexity and the model capability. We conduct an additional experiment on three difficulty categories. For a model with a fixed capability, more difficult tasks require more iterations to converge on an optimal solution, whereas on simpler tasks, performance saturates and hits this ceiling much earlier. The following table shows that the iteration of performance ceilings (less than 5% efficiency gap with the final iteration) on Venus and APPS:

EasyMediumHard
Venus (Beyond-T)367
APPS (Beyond-T)456
Venus (Beyond-M)358
APPS (Beyond-M)249
Venus (Beyond-I)458
APPS (Beyond-I)359

L1: Limitation Location

Thank you for your suggestion. We will move it from the appendix to the main body.

We hope our explanation can address your concern. Please ask follow-ups if you have any other questions. Thank you!

[1] Mercury: A code efficiency benchmark for code large language models.

[2] Effibench: Benchmarking the efficiency of automatically generated code.

[3] Evaluating language models for efficient code generation.

评论

I thank the authors for taking the time to conduct additional experiments to address my concerns. However, I still have remaining reservations regarding Q1 Q5 and Q6, which I elaborate on below:

Q1: Would a history-aware model/framework perform better?

It remains unclear whether a history-aware model would yield better performance. The current experiment evaluates a single-turn trained model in a multi-turn setting, introducing distribution shift and making the results difficult to interpret. A more direct experiment would involve explicitly including the history of code improvements and corresponding efficiency feedback from the iterative optimization trace in the prompt. Specifically, to predict CiC_i, the model could be conditioned on {C_0,M_0,…, C_{i−1}, M_{i−1}.

A comparison between this history-aware design and the current approach could yield important insights into the potential benefits of leveraging longer optimization traces—insights that are currently missing from the paper.

Q5: Missing Benchmark Analysis on existing Datasets like Mercury and EffiBench

Q6: IOF Efficiency Spectrum

While the additional results show efficiency gains for both less and more efficient initializations, it remains unclear how far these improvements can be pushed with additional iterations. Notably, the last two columns indicate there is still substantial headroom. It would be helpful to understand whether starting from a less efficient code leads the IOF toward a global minimum or if the model tends to get stuck in local optima. Exploring this behavior over extended optimization horizons would help contextualize the limits and strengths of the proposed framework.

评论

Dear Reviewer UfKJ,

Thank you for your insightful follow-up questions. We are pleased that our previous response resolved most of your concerns, and we appreciate the opportunity to provide further clarification.

Q1: Would a history-aware model/framework perform better?

Our previous experiment has confirmed that the history-aware model would not perform better on single-turn trained models. As you suggested, we managed to conduct additional experiments on exploring the performance of the vanilla model. We follow exactly your suggested format that explicitly includes the history of code improvements and corresponding efficiency feedback from the iterative optimization trace in the prompt:

InputOutput
(None, None)C_0
(C_0, M_0)C_1
(C_0, M_0, C_1, M_1)C_2
......
(C_0, M_0, C_1, M_1 … C_{i-1}, M_{i-1})C_{i}

Results of the Vanilla Model on the history-aware framework

Iterations01248
Pass %28.3328.3328.3328.6728.67
Beyond-T %12.4512.6112.9113.1413.19
Beyond-M %13.1913.5014.1214.2714.82
Beyond-I %10.2810.6410.9511.2211.87

The above results show that applying the history-aware method directly to a vanilla model yielded no performance improvement over the single-turn baseline. Despite instructions to avoid repetition, the model predominantly replicated existing solutions from the prompt. The few novel solutions it generated, while claiming to be superior, demonstrated no empirical performance gains. This observation aligns with our core hypothesis: vanilla models can generate correct code but lack an intrinsic awareness of code efficiency.

Inference: Single-TurnInference: Multi-Turn
Training:Single-Turn[Afterburner] - (P) Consistent data distribution. - (P) Stable RL training. - (P) Avoids credit assignment issues. - (C) Cannot handle history.[Distribution Mismatch] - (C) Critical mismatch between training and inference data leads to poor performance.
Training:Multi-Turn[Training Waste] - (C) It wastes the model's trained ability.[Ideal History-Aware] - (P) Creates a truly context-aware agent. - (C) Suffers from sparse rewards and unstable RL training. - (C) Difficult long-distance credit assignment.

In conclusion, we recognize that the history-aware framework (with longer optimization traces) can be a very promising direction, while it has a significant difference in focus from this paper. We focus on the single-turn self-improving framework in this work to ensure consistent input distributions and stable RL training. We'd like to explore this history-aware framework in future works.

Q5: Missing Benchmark Analysis on existing Datasets like Mercury and EffiBench

Sorry for missing this point in our previous response. Since Venus is a direct extension of Mercury and EffiBench, it covers all tasks in Mercury and EffiBench. which is the reason why we initially focused our analysis on Venus. As requested, we have now conducted experiments on the Mercury benchmark. The results are consistent with our findings on Venus and further validate the effectiveness of our methods.

Afterburner-SFT on Mercury

Iteration01248
Pass58.9858.9859.7760.5560.55
Beyond26.2826.9527.4329.6531.42

Afterburner-GRPO on Mercury

Iteration01248
Pass57.4259.7761.7267.1969.14
Beyond25.8427.4129.532.6937.23

We will add these results and a corresponding discussion to the revision.

评论

Q6: IOF Efficiency Spectrum

Thank you for your constructive feedback. We believe adding the efficiency spectrum will be very beneficial for our work. Due to the rebuttal's length constraints, we were unable to provide the complete iteration results previously. As shown in our previous response to Q7, efficiency performance for most tasks begins to converge around iteration 5 (note that the last two columns stand for iteration 4 and 8 in our previous response). More difficult tasks require additional iterations to reach their performance ceiling.

Following your suggestion, we manage to conduct an additional experiment to explore if starting from a less efficient solution traps the model in a local optimum. We compared our Afterburner-GRPO against the Afterburner-SFT baseline.

Iteration08121620242829303132
Afterburner SFT3.7610.4911.5011.5011.8712.0412.0412.0412.0412.0412.04
Afterburner GRPO3.7514.1516.0020.4225.5927.0527.0528.1429.0129.0129.13

When initialized from a less efficient solution, Afterburner-SFT stagnates in a local optimum by iteration 24. In contrast, Afterburner-GRPO proves significantly more robust, continuing to improve and achieving a much higher performance level. This outcome highlights the superiority of our RL-based approach in exploring the solution space more effectively to escape such suboptimal traps.

We hope the results and explanation can address your concern. If you have any other questions, we are happy to provide further clarification needed.

评论

Thanks for further clarifying my concerns. It is interesting to observe that history aware models even when trained with it is surprisingly doing poor as compared to single turn. Maybe the issues are worth investigating for future work.

Furthermore, given a programming problem, starting from a less efficient solution never reaches the final convergence point of the instance which starts with a more efficient solution to the problem. This suggests there exists some global minima but not every starting code could lead to it and can get trapped in local minima (29.13 v/s 62.17). An interesting observation worth investigating where techniques from the field of optimisation could be of help. Nonetheless the proposed GRPO based training indeed gives good performance as compared to baselines.

I am more or less satisfied with author’s rebuttal and have correspondingly increased my score.

Once again thanks for taking time to conduct more experiments.

评论

Dear Reviewer UfKJ,

Thank you for your insightful feedback and for raising your score! We are delighted that our responses and additional experiments have addressed your concerns.

We will incoperate our discussion in the revised manuscript, including the exploration of the history-aware framework, Venus dataset justification, model scaling generalization, baseline in IOF iterations, and the IOF Efficiency Spectrum.

We appreaciate your time and constructive engagement, which have significantly strengthen our work.

Best regards,

The Authors

审稿意见
5

The paper introduced AfterBurner, an iterative optimization framework to address the problem of LLM failing to generate efficient code despite their functional correctness. AfterBurner forms a closed loop: an LLM proposes an updated solution, the Monolith sandbox runs it and returns the runtime, memory and time measurements, and the best candidate becomes the seed for the next round. The authors explored three strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a reinforcement-learning variant Group Relative Policy Optimisation (GRPO) that leverages Monolith feedback as reward. To evaluate efficiency, the authors built Venus, a new dataset designed for rigorous code efficiency assessment, with 2181 training and 300 test tasks. The metrics they used is BEYOND-T/M/I, measuring percentile rank of runtime, peak memory and their integral against the human distribution. Experiments show that on Venus, GRPO improves pass@1 from 47% to 62%, BEYOND-I from 31% to 45%. Ablation shows that execution feedback and oracle context are essential.

优缺点分析

Strengths

  • The experimental design is solid. It compares three optimisation paradigm and provides ablations and confidence intervals. The BEYOND metrics can mitigate some hardware variance.
  • The reward design and overall technique flow is well presented.
  • The paper tackles an under-explored aspect in code generation, i.e. efficiency. It matters a lot for production deployment. It also introduces a sizeable benchmark that others can reuse.
  • Combining RL with execution-based rewards for efficiency rather than correctness is a novel angle

Weaknesses

  • There are many reported improvements, but their statistical significances are unknown.
  • Some typos like "paassed" in line 135.
  • There are no decontamination processes mentioned in the paper. APPS is also not a "live" benchmark in comparison to LiveCodeBench.
  • The novelty mainly lies in the application of RL to code efficiency. GRPO itself extends as an existing algorithm.
  • The focusing on just python and file-level coding makes the scope of the paper narrow. In practice, developers care more about real-world code living in real software.

问题

  1. How can the proposed strategy handle API-dependent real-world code? For example, does the proposed method still work in software engineering benchmarks like SWE-bench or Defects4J? I noted this was explained in the Limitations section, but I'm still curious about the authors thoughts. It's okay to leave this unaddressed for this paper, but then I will also keep my concerns about the relatively narrowed scope.
  2. The final reward has a combination of hyperparameters. How sensitive is it when given different values?
  3. Could you elaborate on how you measure the statistical significance of your results?
  4. Does the same conclusion hold when APPS is changed to LiveCodeBench with no problem contamination?

If these questions are properly addressed, I'm happy to raise the score.

局限性

yes

最终评判理由

The paper has made solid contributions and experiments. While my concern about its generalizability to open-ended software tasks is not addressed, the authors provide clear explanations and have addressed most of my other concerns. I'm giving an Accept.

格式问题

Some typos, no major concerns.

作者回复

Dear Reviewer z3gd:

Thank you for your constructive and insightful reviews! We hope we can address your concerns as follows:

Q1: Scaling to API-dependent Real-world Code

Thank you for this insightful question. While this paper focuses on establishing a robust method for function-level optimization, we agree that scaling to repository-level benchmarks like SWE-bench is a critical next step. We view our current work as the foundational layer for this larger goal and introduce the following strategy to address the main challenges:

  • API Dependencies: To handle API-dependent code, we may integrate Retrieval-Augmented Generation (RAG) and static analysis. This allows our model to dynamically pull in relevant API definitions, usage patterns, and code snippets from the broader repository, providing the necessary context for effective optimization.

  • Context Management: Real-world repositories are too large for current model context windows. We could employ a divide-and-conquer approach. This involves identifying and isolating the most relevant functions and modules for optimization, making the task computationally tractable.

  • Test Generation: Verifying changes across an entire repository is notoriously difficult. Our strategy would focus on generating targeted unit tests specifically for the modules undergoing optimization. This ensures the correctness and efficiency of our changes in a localized, verifiable, and recursive manner.

In summary, while scaling our method to repo-level presents a significant research challenge, we hope that structured approach provides a clear path forward.

Q2: Hyperparameters Sensitivity

In our experiment, we observed that a dominant weight for functional correctness (beta_c​ ≥ 0.5) is essential for stable training, as values below this threshold often led to training crash. To balance our objectives, we set a fixed beta_c ​​= 0.5 while gradually increasing the code efficiency weight (beta_e) from 0.3 to 0.5. This strategy ensures the model first learns to produce correct code before optimizing for efficiency.

Q3: Statistical Significance

We employ bootstrap sampling to construct 95% confidence intervals for our efficiency metrics. Here is the detailed procedure:

  1. Data Collection: For each individual problem, we generate 4 unique solutions (temperature=1) and measure the performance of each one 16 times. This results in a performance set of 64 measurements (4 solutions×16 measurements).
  2. Bootstrap Sampling: We then create 128 bootstrap samples. Each sample is formed by drawing one measurement with replacement from the original performance set.
  3. Distribution Generation: We generate an empirical sampling distribution for each efficiency metric by aggregating the 128 bootstrap samples over all problems.
  4. Score Reporting: We report the mean of these bootstrap distributions along with their 95% confidence interval. This method allows for a robust estimation of efficiency performance and its statistical uncertainty. We provide a detailed explanation in Appendix G.

Q4: Contamination-free Evaluation

We evaluate our methods on LiveCodeBench and report the pass rate below. The results confirm that our conclusions hold. Afterburner-GRPO shows substantial and consistent improvement as the number of iterations increases, while Afterburner-SFT plateaus quickly. This trend is consistent with our findings on the APPS and Venus benchmarks, demonstrating the general effectiveness of our approach.

Afterburner-SFT Iteration0125710
Pass %5.496.596.597.147.697.69
Afterburner-GRPO Iteration0125710
Pass %5.497.699.8911.5412.0913.19

Q5: Some typos like "paassed" in line 135.

We will fix this typo in the revision.

We hope our explanation can address your concern. Please ask follow-ups if you have any other questions. Thank you!

评论

Thanks for the rebuttal that properly addressed Q3 to Q5. While my concern about its scalability to real-world code still exists, the additional explanation is helpful. I remain very positive about the work and have increased Quality and Significance scores. Before concluding on the final rating, I will keep a closer eye on the discussion from the other reviewers (especially those who gave a negative scores).

gradually increasing the code efficiency weight (beta_e) from 0.3 to 0.5

You mentioned in Q2 that you gradually increased the weight. Could you elaborate on the process? Did you stop the run and continue with a different weight, or is it dynamically changed in a single training run, and how is it adjusted (say linearly)? If it's the former, how did you handle learning rate and optimizer states?

评论

We are glad our response addressed most of your concerns and thank you for the positive feedback!

In terms of annealing schedule for code efficiency weight (beta_e), we linearly increase beta_e within a single training run, much like a learning rate warmup schedule. Specifically, for each training step, we calculate the current beta_e based on the training progress and feed it into our custom reward function. This strategy enables our model to first focus on functional correctness and then increasingly on code efficiency as training advances.

def get_reward_coefficient(current_step, total_steps, initial_coef=0.3, final_coef=0.5):
    """Linearly interpolates the reward coefficient based on training progress."""
    progress = current_step / total_steps
    return initial_coef + (final_coef - initial_coef) * progress
评论

Thanks for the clarification. I have finalized a score of 5.

评论

Dear Reviewer z3gd,

Thank you for your positive feedback! We will incorporate our discussion into the revised manuscript, including scalability to real-world code, hyperparameter sensitivity, statistical significance, and the contamination-free evaluation.

We appreciate the time and consideration you dedicated to our work.

Best regards,

The Authors

审稿意见
5

This paper introduces Afterburner, an iterative optimization framework designed to improve the computational efficiency of LLM-generated code. The framework employs a closed-loop system where LLMs iteratively refine code based on empirical performance feedback from "Monolith," a high-fidelity execution sandbox that provides real-time efficiency metrics. The authors compare three training strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO), finding that while SFT and DPO quickly plateau in efficiency gains, GRPO using reinforcement learning with execution feedback continuously optimizes code performance. Experiments on the newly introduced Venus dataset and the APPS benchmark demonstrate that GRPO significantly improves both functional correctness (PASS@1 from 47% to 62%) and efficiency compared to human solutions (BEYOND-I from 31% to 45%). The work addresses a critical gap in LLM-generated code, where models often produce functionally correct but computationally inefficient solutions, and shows that online reinforcement learning with direct execution feedback enables models to achieve sustained improvements in code efficiency through iterative test-time optimization.

优缺点分析

Strength

  • This paper is among the earliest works that aim to improve the "efficiency" of LLM-generated code, not only its correctness. Efficiency is an important yet under-explored perspective of LLM-generated code
  • This paper implements a close-loop pipeline that can automatically generate code, reliably collect the efficiency feedback, and iteratively refine its solutions with Afterburner and Monolith
  • This paper releases the first dataset, Venus, that can be used for training LLMs for efficiency optimization, further boosting the future research of improve the efficiency of LLM-generated code.
  • This paper comprehensively and systematically studied the impacts of SFT, DPO, and GRPO for code efficiency refinement training, providing practical insights for the efficiency optimization training recipe.
  • This paper successfully trained Afterburner, which has only 3B parameters while matching much larger models, such as Qwen-2.5-7B-instruct, and Llama-4-Scout, in code efficiency.

Weaknesses

Overall, I am very positive about this paper, so I do not have much to complain about. One minor thing is that, while the paper replied "Yes" in their checklist about the release of their data and code, I could not find a link to these artifacts. This discrepancy might downgrade the transparency and reproducibility.

问题

  • Could the authors provide a pointer to the reproducible artifact of Venus, Afterburner, and Monolith?

局限性

Yes.

最终评判理由

I am glad to see that the authors commit to releasing their data and artifacts, which was my major concern. I am keeping my score to 5 to support this paper's acceptance.

格式问题

No formatting concerns.

作者回复

Dear Reviewer xncp,

Thank you for your positive feedback!

We have prepared the artifacts of Venus, Afterburner, and Monolith. Due to the new rebuttal policy this year, we cannot share external links at this time. We are committed to reproducibility and will update these artifacts in the final version immediately after the review process.

We hope it can address your concern. Thank you!

评论

I am glad to see that the authors commit to release their data and artifacts. I am keeping my score to 5.

评论

Dear Reviewer xncp,

Thank you for your strong support and positive feedback. We will release our code and data after the review period. We sincerely appreciate the time and effort you dedicated to our manuscript!

Best regards, Authors

审稿意见
4

This paper proposes a test-time computation method to iteratively improve the efficiency of code snippets generated by LLMs. Specifically, the authors apply reinforcement learning to fine-tune a Qwen-2.5 3B model. Once the model is sufficiently fine-tuned, it is used to optimize code snippets by leveraging execution feedback. To evaluate the effectiveness of the proposed approach, experiments are conducted on the Venus Python dataset. The authors define three metrics: PASS@1, BEYOND-T, BEYOND-M, and BEYOND-I. The results demonstrate that the method significantly enhances code efficiency. Moreover, the results also demonstrate that LLM-generated code may be more efficient than human-written code.

优缺点分析

Strength:

  • The paper is well-written and easy to follow.
  • The problem addressed is both significant and important.
  • The evaluation results are promising.

Weakness:

– The rationale behind the evaluation metric requires further discussion.

– Some details about the proposed dataset are missing.

– The generalization ability of the proposed model is unclear.

问题

  1. Does the proposed dataset contain ground truth code snippets? If not, the reliability of the LLM-generated test cases may be questionable.

  2. Are all the collected code snippets correct? For example, if some incorrect snippets simply return without executing any statements, they could achieve perfect efficiency but be meaningless.

  3. After fine-tuning on the Venus dataset, can the fine-tuned model generalize and perform well on other datasets?

  4. Did you comapre the proposed approach with native "SFT" baseline, which just finetune the model to generate most efficient correct code snippets.

  5. Can your proposed algorithm generalize to other base model architectures, such as Llama?

  6. Where do the test cases used in the iterative improvement stage come from?

  7. When computing the efficiency metric during evaluation, did you filter out incorrect code snippets, or did you only consider correct ones?

局限性

  1. Evaluation Metric Rationale: The rationale behind the chosen evaluation metric requires further discussion. The authors argue that absolute efficiency metrics are avoided due to their sensitivity to hardware configurations and operating systems. However, this justification is unconvincing for several reasons: (1) their method still relies on comparing absolute efficiency metrics; (2) when measured on the same hardware and operating system, efficiency metrics can be fairly compared, and system noise can be mitigated through multiple runs; and (3) the proposed metric heavily depends on the distribution of efficiency in human-written code. I suggest that the authors also present results using absolute efficiency metrics.

  2. Source of Inference-Time Test Cases: The origin of the inference-time test cases is unclear. The proposed method relies on these runtime test cases to provide correct and efficient feedback for self-improvement. However, it is not specified where these test cases come from: Are they the same as the evaluation test cases, or are they generated by GPT at runtime? If they are generated by GPT at runtime, there is a risk of intelligence leakage. In other words, if GPT is used to generate test cases, why not use GPT to generate correct or efficient code directly, especially since the results show GPT may outperform the proposed method?

  3. Generalizability of the Proposed Method: The generalizability of the proposed method is unclear. (1) It is not demonstrated whether the method can generalize to other model architectures, such as Llama. (2) Although Table 1 shows the proposed method outperforms baseline methods, all baselines have been fine-tuned specifically on the proposed dataset. It remains unclear whether the fine-tuned model can generalize to other datasets. I recommend that the authors conduct dynamic evaluation [1, 4] or test on other datasets [2, 3] to better demonstrate generalizability.

  4. Missing Baseline: Why was a simple baseline—fine-tuning the base model with only the most efficient code snippets—not considered?

  5. It is unclear whether the efficiency metric is measured only on the correct code.

[1] DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination. ICML 2025

[2] Effibench: Benchmarking the efficiency of automatically generated code. Nips 2024.

[3] Evaluating language models for efficient code generation. COLM 2024.

[4] DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

最终评判理由

Based on the rebuttal, I updated my score. However, the way authors evaluated (i.e., using the same set of test cases for improving the model and evaluating them) is still very problematic to me. All my other points are address by the authors during the detailed rebuttal period.

格式问题

N/A

作者回复

Dear Reviewer ovks,

Thank you for your constructive and insightful feedback. We are encouraged that you found our work valuable. Below, we address per questions and suggestions raised.

Q1 & Q2: Solution Correctness in Venus

All Venus code snippets are exclusively collected from accepted (including the official ground truth) solutions on Leetcode, meaning they passed all test cases. We agree that this validation process is a critical detail. Due to space constraints, we included the full pipeline in Appendix C.1. For improved clarity, we will add a brief explanation and a direct reference in Section 5.1 Dataset Recipe:

“Venus Python subset contains 2,181 algorithmic problems, each accompanied by a validated test case generator and an average of 106.6 validated human solutions... For each LLM-generated test case input, we follow the paradigm of Mercury, where we execute them through all collected solutions from LeetCode, and only keep those cases having consistent outputs over all correct solutions. More details can be found in Appendix C.1 Rigorous Validation.”

Q3: Model Generalization

We evaluated our Venus-tuned model on another dataset APPS [1]. Unlike existing code efficiency benchmarks such as Mercury [2], EffiBench [3] and Venus, which consist of LeetCode-style questions, the APPS dataset features distinct problem formats. As shown in the tables below, Afterburner demonstrates a consistent performance improvement pattern on the APPS dataset. Please refer to Appendix Figure 9 for a complete visualization. We also conducted an additional experiment on LiveCodeBench [8] to demonstrate the consistent performance gain on the contamination-free benchmark.

Afterburner Base on APPS Iterations0135710
Pass10.6711.6713.6715.0015.3315.67
Beyond-T4.144.144.144.234.534.53
Afterburner GRPO on APPS Iterations0135710
Pass13.0015.0020.3324.0028.0031.67
Beyond-T4.699.0412.3614.8516.0116.18
Afterburner Base on LiveCodeBench (release_v6) Iterations0135710
Pass %5.496.596.597.147.697.69
Afterburner GRPO on LiveCodeBench (release_v6) Iterations0135710
Pass %5.497.699.8911.5412.0913.19

We acknowledge that we missed the reference to the APPS results in the main body. Thank you for pointing it out! We will add [5,6] into related work and append the following statement in Section 6.2 to reference these findings in the main text:

“...To verify whether our models can generalize to out-of-distribution questions, we evaluated Afterburner on the APPS dataset [1]. As illustrated in Appendix Figure 9, Afterburner demonstrates a similar pattern of performance improvement, confirming its effectiveness on problems with distinct data distribution.”

Q4 & L4: Native "SFT" Baseline We agree that a comparison with a native SFT baseline is valuable. We interpret "native SFT" in two paradigms and address both below:

  • Code Refinement SFT: trains a model using (problem_description, original_code) to generate an improved_code. This is conceptually very similar to our Afterburner SFT baseline, which augments the input with performance feedback from the original code. We provide its results in Figure 5.
  • Direct Generation SFT: trains a model using (problem_description) to directly generate the improved_code. We assume your question refers to this paradigm. We initially omitted this baseline because it lacks a performance feedback mechanism, making a direct comparison with our feedback-driven method is unfair. Furthermore, as noted by prior work Mercury [2], this direct approach can be susceptible to catastrophic forgetting, especially on smaller models.

We agree including the native SFT baseline in this work could serve as a straightforward baseline. Prompted by your suggestion, we have conducted this additional experiment to fine-tune the 'Qwen2.5-3B' model using the same hyperparameters as our main experiments. The results are presented in the table below. We observed model collapse on the native SFT baseline.

SFT ParadigmIteration0135710
Code RefinementPass %46.0046.0047.0048.3348.6748.67
Beyond-I %21.0121.0921.8822.3122.4422.50
Direct GenerationPass %21.6721.6722.3322.3322.6722.67
Beyond-I %5.145.766.476.636.636.75

Q5 & L3: Model Generalization

Due to the rebuttal time limitation, we conducted additional experiments on another model ‘Qwen2.5 7B’ and 'GPT-4o'. As the data shows, larger models do achieve better performance through the IOF iterations. However, the improvement in these vanilla models is significantly less than that of our Afterburner models. This demonstrates that while the IOF framework itself is beneficial, our proposed training paradigm are the primary drivers of the substantial performance gains we report. For other model architectures, such as Llama, we will complete its experiment results in the revision.

Qwen2.5 7B Iteration0125710
Pass60.6765.6769.3374.6779.6783.67
Beyond-T27.6732.8734.2239.5041.9142.86
Beyond-M29.7931.5536.4840.9743.2644.51
Beyond-I21.0228.1033.8537.5437.5438.66
GPT-4o Iterations01248
Pass %82.2684.0084.3384.3385.33
Beyond-T %38.2240.1642.0244.1646.17
Beyond-M %42.0943.2844.7146.2048.01
Beyond-I %28.8930.6032.5535.0241.29

Q6 & L2: Where do the test cases used in the iterative improvement stage come from?

We use the same pre-generated test cases for both iterative improvement and evaluation. This approach is valid because the model only receives high-level execution feedback—such as pass status, running time, and memory usage—without ever seeing the specific contents of the test cases. This design prevents information leakage and ensures a fair evaluation.

Q7: When computing the efficiency metric during evaluation, did you filter out incorrect code snippets, or did you only consider correct ones?

During the efficiency evaluation, we didn’t filter out incorrect code. The monolith sandbox will return the default maximum metric values (time: 90 s, memory: 1,048,576 kb, integral: 90 * 1,048,576 = 94371840), so their percentile ranks will be 0. You can consider these efficiency metrics (time, memory, integral) as efficiency-weighted functional correctness (pass).

L1: Evaluation Metric Rationale.

Thank you for this insightful suggestion! We agree that using absolute efficiency metrics has its advantages, especially 1) it is easy to calculate; 2) it doesn't rely on the distribution of human solutions; 3) it may work if we evaluate all models on the same machine. We appreciate your perspective and we have revised the paper to clarify our rationale per your suggestion:

  • Benchmark Standardization: We follow the relative efficiency metrics used in most established code efficiency benchmarks, such as Mercury [2], EffiBench [3], EvalPerf [4] and ENAMEL [7].
  • Result Comparability: Since Venus is a code efficiency benchmark, we hope the reported score is comparable. While absolute efficiency metrics may be infeasible to compare across machines, relative metrics can handle this issue. For example, a solution that is faster than 80% of human solutions on one machine is likely to be similarly performant relative to the same set of solutions on another machine.
  • Performance Normalization: Absolute runtimes can vary by orders of magnitude across different test cases. This large range means that a model's aggregate score can be dominated by a few long-running outliers, masking subtle but significant improvements. Our relative metric normalizes performance against the distribution of human-written code, providing a more stable and fine-grained measure of efficiency improvement.
  • Training vs. Evaluation: You correctly note that we use absolute metrics internally. We do this specifically during model training, where the raw execution time and memory provide simple and effective reward signals. Since training occurs in an isolated sandbox on a single hardware setup, comparability is not a concern. Furthermore, for reinforcement learning, an unbounded absolute metric allows the model to explore solutions that surpass the efficiency of any known human code. However, for the final evaluation, the need for a normalized, comparable, and robust benchmark makes the relative metric the superior choice.

We hope our explanation can address your concern. Please ask follow-ups if you have any other questions. Thank you!

Reference

[1] Measuring coding challenge competence with apps.

[2] Mercury: A code efficiency benchmark for code large language models.

[3] Effibench: Benchmarking the efficiency of automatically generated code.

[4] Evaluating language models for efficient code generation.

[5] DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination.

[6] DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

[7] How efficient is llm-generated code? a rigorous & high-standard benchmark.

[8] LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.

评论

I thank the authors for providing a detailed rebuttal. It clears out many of my concerns. However, I am still worried about the following points:

" Q6 & L2: Where do the test cases used in the iterative improvement stage come from?

We use the same pre-generated test cases for both iterative improvement and evaluation. This approach is valid because the model only receives high-level execution feedback—such as pass status, running time, and memory usage—without ever seeing the specific contents of the test cases. This design prevents information leakage and ensures a fair evaluation. "

I do not think this approach has a fair evaluation, as you are biasing the model towards these test case signals and again evaluating based on them.

"During the efficiency evaluation, we didn’t filter out incorrect code. The monolith sandbox will return the default maximum metric values (time: 90 s, memory: 1,048,576 kb, integral: 90 * 1,048,576 = 94371840), so their percentile ranks will be 0. You can consider these efficiency metrics (time, memory, integral) as efficiency-weighted functional correctness (pass)."

I think this somewhat undermines the pure efficiency gain. In the later version of the paper, it might be good to also evaluate the efficiency gain of the correct samples.

评论

Dear Reviewer ovks,

Thank you for your insightful follow-up questions. We are pleased that our previous response resolved most of your concerns, and we appreciate the opportunity to provide further clarification.

Q6 & L2: Where do the test cases used in the iterative improvement stage come from?

You raised an important point regarding the source of test cases used during the iterative inference stage. We'd like to clarify our original process and present a new experiment to demonstrate its robustness.

In the current iterative inference stage, we ran pre-generated 100 test cases on the original solution and fed its holistic performance metrics (Passed, Time, Memory, Integral) back into the Afterburner prompt template. For each iteration, we aggregate these metrics as the evaluation results. Notably, since ‘Passed’ is True if and only if the original solution passed all the test cases, Afterburner is not able to get individual test case information and hardly overfit all of them during the inference stage. In the real-world applications, we can further eliminate the potential bias (or overfitting) by increasing the number of test cases.

<…>
## Problem Description
{problem_description}

## Original Solution
{original_solution}

## Original Performance
Passed: {original_passed} / Time: {original_time} / Memory: {original_memory} / Integral: {original_integral}
<…>

While the chance of test case leakage is rare, we agree that using different test cases for inference and evaluation is more rigorous. To address this concern, we set the existing test cases in the dataset as the public test cases. Then we follow the setup of EffiLearner [2] to generate 100 private test cases using corresponding test case generators. To avoid potential data leakage, we also filter out identical cases in the public test cases. During the iterative optimization stage, we use the public test cases to get the overhead information and optimize the inefficient code. After generating the improved code, we then use the private test cases to measure the performance of the improved code. The evaluation results are shown below:

Afterburner-GRPO (inference on public test cases and evaluation on public test cases)

Iteration012345678910
Pass %47.3350.3352.0054.5057.0058.1759.3460.5061.1861.6761.67
Beyond-T %31.2236.6938.4139.6440.8342.3543.4944.344.8245.1745.17
Beyond-M %25.1430.2134.1737.4540.7242.5544.3946.2247.6548.0548.05
Beyond-I %18.2424.8129.4430.8533.5635.4837.0938.0138.6238.9538.95

Afterburner-GRPO (inference on public test cases and evaluation on private test cases)

Iteration012345678910
Pass %46.6751.0052.6754.0056.0058.3359.6761.0062.0062.0062.00
Beyond-T %29.7335.2838.0138.4639.1040.5040.5042.6343.1943.1644.93
Beyond-M %25.4328.5932.1535.2138.6640.7343.2144.9244.9244.9445.89
Beyond-I %17.8523.2428.7629.9533.2735.1036.9937.6238.5038.5038.50

We can observe that the performance between the current evaluation (using private test cases) and the previous evaluation (using public test cases) is very similar. Their Pearson correlations (Pass: 0.9939, Beyond-T: 0.9852, Beyond-M: 0.9955, Beyond-I: 0.9986) demonstrate that our presented evaluation is not biased to the public test cases, i.e., only providing the overhead information to LLMs would not leak the test cases and cause the Afterburner-generated code to be biased to these test cases. We will update these evaluation scores using private test cases in the revision. We hope the results and explanation can address your concern.

评论

Q7: When computing the efficiency metric during evaluation, did you filter out incorrect code snippets, or did you only consider correct ones?

This is an excellent suggestion. Our original metric, which includes all generated code, is designed to provide a normalized comparison across different models, as conditioning on correctness makes the sample set for each model different. However, we agree that evaluating absolute efficiency gains on functionally correct solutions offers a more isolated view for a single model efficiency optimization.

We managed to aggregate the absolute efficiency gain on the subset of functionally correct solutions generated by Afterburner-GRPO. This analysis offers a direct view of pure efficiency improvement.

Iteration01248
Pass140153158168186
Time (S)14308129031169093549390
Time Average (S)102.2084.3373.9955.6850.48
Memory (Gb)157156141128119
Memory Average (Gb)1.121.020.890.760.64

We sincerely thank you for your constructive feedback, which has helped us strengthen the quality of our work. We will incorporate these new results and discussions into our revision, including details on the Venus dataset, model generalization performance, and different prespectives of model evaluation.

If you have any other questions, we are happy to provide further clarification needed.

最终决定

The paper proposes a test-time iterative optimization framework for code generation, where an LLM refines code in a closed-loop system using empirical performance feedback from an execution sandbox. Three training strategies are explored: SFT, DPO, and GRPO. The GRPO variant with execution feedback yields the strongest gains, improving both pass@1 and the likelihood of outperforming human submissions in efficiency.

Strengths:

  • Addresses an important and impactful problem (ovks, xncp, z3gd)
  • Implements a closed-loop pipeline that integrates code generation, efficiency feedback collection, reward formulation, and iterative refinement (xncp, z3gd)
  • Promising evaluation results with comprehensive experiments across training strategies (xncp, ovks, z3gd, UfKJ)

Weaknesses:

  • The rationale behind the evaluation metric needs further discussion, though the authors clarified during rebuttal (ovks)
  • Some dataset details were missing and the authors reasonably addressed in rebuttal (ovks)
  • Open questions remain on scaling the method to larger models (UfKJ)

This work presents a well-motivated and carefully evaluated approach for integrating performance/efficiency feedback into code generation via iterative optimization. While some methodological details and scaling considerations remain open, the contribution is solid and the findings are empirically supported.