RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
We apply reinforcement learning to ground LLMs in execution feedback for effective multi-turn code generation.
摘要
评审与讨论
This paper presents a novel method to enhance LLM code synthesis through the use of execution feedback. In contrast to existing works that use unit test feedback as binary indicators or fractions of unit test pass rates to improve code generation, the feedback in this work is provided as language descriptions, including error messages and unit test results. Additionally, the proposed method is iterative, with the model trained to self-correct its previous responses based on received execution feedback. Experimental results indicate that this approach significantly improves LLM code synthesis capabilities.
优点
- The paper is well-written and clearly presented. I appreciate the clarity and organization in the presentation of the results.
- The proposed method appears to be effective, substantially enhancing code generation performance.
缺点
-
Limited Practical Application: While effective, the proposed method seems highly constrained in its practical application. The reliance on unit tests for feedback is a significant limitation, as generating accurate unit tests for arbitrary user prompts is often as challenging as solving the problems themselves. This confines the method to specific OJ style problems, where unit tests are readily available, and cannot be trivially extended to more general user scenarios where unit tests are not available.
-
Advantage Computation: The way how the authors compute the advantages seems to be weird. The authors claim the action space is defined on the token level, but the advantages for every token in the same turn is the same. In that sense, how does the critic be updated? Can the authors intuitively / empirically explain why they do this?
-
Effectiveness of Multi-turn Feedback: The empirical success of multi-turn feedback is intriguing. It is unclear whether the iterative nature of the algorithm or the unit tests' guidance contributes more to this success. An ablation study could clarify this by training RLEF iteratively while providing intermediate/final feedback with only numerical values, avoiding specific details on failed unit tests and expected outputs. For example, assigning smaller reward signals at the end of each turn and a larger signal at the end of the episode based on whether the generated code is correct or the fraction of unit tests passed could yield insights.
Overall Assessment:
While the method shows promising empirical results, I am inclined to reject this submission due to its limited applicability and some missing crucial details and ablation results. However, I am open to revising my score if the authors can provide evidence demonstrating the method's applicability to real-world user scenarios and offer more detailed explanations about the algorithm's design and efficacy.
问题
N/A
We thank the reviewer for their valuable comments, and are grateful that they acknowledge the effectiveness of our method and "appreciate the clarity and organization in the presentation of the results".
We'd like to respond to the weaknesses pointed out to the reviewer individually:
-
Limited Practical Application: We emphasize that our paper is chiefly concerned with grounding LLMs in execution feedback, which we identify as a weakness in current state-of-the-art models. We argue that addressing this weakness is crucial when deploying LLMs as agents interacting with computer systems, which is an emerging and promising area. Extensions to applications without execution feedback are definitely interesting but clearly outside the scope of our submission. We will however update our recommendations for possible future work to encompass such scenarios.
-
Advantage Computation: We train the critic LLM on a per-turn level, i.e., given a prompt (for the response ) and we train it to predict the reward-to-go at . We will extend Appendix 1 with the value function loss (image).
-
Effectiveness of Multi-turn Feedback: We believe the experiments in Section 3.3 with random execution feedback support our conclusion that (L495ff) that better initial solutions and an increased diversity of responses are a major contributor to the performance gains, but also that we successfully obtained grounding in execution feedback as it increases the reliability of arriving at correct solutions (pass@1 vs. pass@10 in Figure 4(a)). Furthermore, we observe that the 70B model can improve its scores significantly when provided access to feedback from private tests (Appendix B.2).
Regarding the additional experiments suggested by the reviewer, we would like to note that we end an episode once public tests are passing; hence, the model would internalize that any re-prompting implies that previous solutions are considered wrong. Adjusting the reward signal as proposed would change the optimal policy to output replies that fail one of the public tests and save the correct response for the final turn. However, we will add a further 8B ablation where we replace execution feedback with "Give it another try", disable early stopping based on public test results, and evaluate the last response only (see global response).
In this paper, the authors introduce a new approach for LLMs coding abilities that utilizes execution feedback from generated code through reinforcement learning to enhance performance in code synthesis tasks. The core insight is that LLMs can effectively follow user instructions while leveraging execution feedback to identify and correct errors in the generated code. The authors conceptualize the code generation process as an interactive task, proposing an environment where LLM-based agents generate code and receive real-time execution feedback. Their reinforcement learning-based method, termed RLEF, optimizes the performance of LLMs through this interactive framework. The results demonstrate that RLEF improves the correctness of the generated code compared to existing methodologies.
优点
- This method is very intuitive, and modeling reinforcement learning as an interactive task is reasonable.
- The experimental results are good in training language models for coding.
缺点
- Considering previous methods for training LLM coding and reasoning capabilities with reinforcement learning [1][2], the innovation is limited, as they also used some specially designed reward functions.
- There is no ablation on the reward function, which is important for the paper.
- There is a lack of experiments on more models to verify the generality of the method.
- It lacks some related work on training LLMs for reasoning and coding with reinforcement learning [1][2].
[1] Coderl: Mastering code generation through pretrained models and deep reinforcement learning.
[2] Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning.
问题
See above weakness
We thank the reviewer for suggesting additional related work. "Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning" is an interesting work and indeed very closely related to "StepCoder" mentioned by reviewer QLXX in that both use ground truth solutions to form a curriculum to help with exploration. We maintain that work on curricula for RL fine-tuning is orthogonal to our submission, though. Our work does not specifically target LLM reasoning capabilities.
We interpret the remark regarding the usage of "some specially designed reward functions" as using execution as an RL reward. For context, we would like to add that agents situated in a simulated environment are the natural domain of RL algorithms. Our reward function is purposefully kept simple, and tuning it could potentially yield further improvements but is outside of the scope of this submission.
Regarding the reported "lack of experiments on more models to verify the generality of the method", we perform experiments on both 8B and 70B Llama 3.1 models, as well as the weaker 8B Llama 3.0 model. While we can sympathize with calls to experimental validation on further models as well as benchmarks we also maintain that the current set of results clearly demonstrates the efficacy of our method, and we hope to inspire further work on grounding LLMs for domains where it is crucial to take execution feedback into account.
This paper introduces a novel approach aimed at enhancing the performance of Large Language Models (LLMs) in multi-turn code generation tasks. The primary challenge addressed is the difficulty LLMs face in leveraging execution feedback for iterative code improvement, a crucial aspect of achieving reliable outcomes in multi-step code generation. The authors propose an end-to-end reinforcement learning method designed to teach LLMs to effectively use execution feedback. The paper makes three key contributions: (1) the development of a reinforcement learning framework that grounds LLMs in execution feedback, (2) demonstrated improvements in performance on code synthesis tasks, and (3) practical benefits including significant sample reduction and enhanced utilization of multi-step feedback.
优点
- The method effectively incorporates self-correction and self-refinement techniques into the RLHF framework, leading to significant sample reduction.
- It demonstrates enhanced utilization of multi-step feedback, contributing to improved performance and efficiency in iterative code generation tasks.
- The approach addresses a critical limitation of current LLMs by successfully leveraging execution feedback, providing a practical and innovative solution for complex code synthesis problems.
缺点
- The method's novelty is limited, as it mainly adapts self-correction and self-refinement techniques to the RLHF framework. Similar approaches have already been explored, such as Anthropic's work on iterative code correction[1] and DeepMind's adaptation of self-correction in RLHF for improved training and inference efficiency[2]. Additionally, the core idea of using compiler feedback as a reward signal for RLHF lacks sufficient comparative analysis with contemporary methods that integrate RLHF with compiler feedback, such as RLTF, Execution-based Code Generation using Deep Reinforcement Learning, CodeRL, StepCoder, and -Coder.
- The experiments are limited to the CodeContest dataset, with a relatively small number of validation and test samples. The paper should validate its effectiveness on more datasets, such as APPS, which is commonly used for training in RLHF+compiler feedback research.
- The paper claims that the multi-turn iterative approach of SFT performs significantly worse than RL (as mentioned in the appendix). However, self-improvement methods like critic+SFT have proven effective in code tasks. Providing the SFT and RL code as open-source could help validate this experimental conclusion.
[1] Improving Code Generation by Training with Natural Language Feedback
[2] Training Language Models to Self-Correct via Reinforcement Learning
问题
In the proposed method, when an incorrect solution is generated, the model immediately moves on to the next iteration. Could the authors explore the effect of incorporating feedback on the incorrect code before proceeding to the next iteration? How would this impact the experimental results?
We thank the reviewer for their valuable feedback and appreciate them acknowledging that our method addresses a "critical limitation of current LLMs", and highlighting the "significant sample reduction" and "improved performance and efficiency" that we achieve. We respond to the raised weaknesses as follows:
- We are thankful for the additional related work pointed out by the reviewer, and would like to draw the following comparisons:
- "CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning" is discussed in the related work section (L457ff); however, our manuscript misinterpreted this work and we will revise this part. While they train a single-turn model with RL, they also train two further models: a "critic" to gauge correctness of programs and a "repair" model that maps faulty code to respective ground truth solutions. We agree that this is a particularly relevant approach to our work, and we will include the combination of a single-turn model followed by a dedicated repair model as another baseline (see global response). We also point out that our RLEF method achieves high multi-turn performance with a single model only.
- "Improving Code Generation by Training with Natural Feedback" targets the integration of feedback from humans, which is relevant for user-facing chat applications such as ChatGPT but less so when employing LLMs as agents.
- "Training Language Models to Self-Correct via Reinforcement Learning" (SCoRe) is discussed in the related work (L468ff); this is concurrent work released 9 days prior to the submission deadline. First, their method does not incorporate execution feedback and instead relies on the model to detect its own mistakes. We believe that when execution feedback is available, such as in code generation or in many "agentic" tasks, it should be integrated for improved accuracy. Second, SCoRe requires two separate training stages whereas our RLEF method consists of a single training stage
- The main contribution "RLTF: Reinforcement Learning from Unit Test Feedback" over CodeRL (see below) is a fine-grained reward function; our approach achieves significant gains with a simpler reward function and we regard the tuning of reward functions as orthogonal to our work.
- "Execution-based Code Generation using Deep Reinforcement Learning" is discussed in the related work (L460); they include a dependence on ground truth solutions which are not required for our method
- "StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback" introduces both a curriculum based on ground truth solutions (which we do not require) and masking of code not subject to execution during optimization. We regard both techniques as orthogonal to our work, i.e., they could potentially yield further improvements when applied to our multi-turn setting.
- "B-Coder: Value-Based Reinforcement Learning for Program Synthesis" describes a value-based RL method for LLMs, and uses the public CodeRL model in their experiments.
We will update the related work section of the paper to encompass all references above.
-
We point out that the CodeContests training set and APPS test set have overlaps (cf. https://arxiv.org/abs/2203.07814 Appendix C.6) so that we cannot evaluate our existing models on APPS. Furthermore, APPS does not advertise public test cases explicitly, which would render experiments incompatible with prior work.
-
Our observation that SFT is less effective compared to RL is consistent with other findings from the literature. E.g., https://arxiv.org/abs/2404.10719 showing PPO > SFT, PPO > DPO on CodeContests. We also note that in our SFT experiment we use rollouts of the stock Llama 3.1 70b Instruct model since there is no ground truth data available for iterative code synthesis.
The paper introduces RLEF, a reinforcement learning framework that enhances code synthesis for large language models (LLMs) by utilizing execution feedback during training. RLEF allows models to iteratively refine code solutions based on execution feedback from test cases, significantly improving solve rates on competitive programming tasks and reducing sample requirements. The authors benchmark RLEF with Llama 3 models on CodeContests and demonstrate state-of-the-art performance, with generalizability to HumanEval+ and MBPP+ benchmarks.
优点
-
The experimental results show general improvement when applying RLEF with Llama 3 series models on the CodeContests dataset, demonstrating the effectiveness of the proposed method.
-
The authors demonstrate that Llama 3.1 models trained on CodeContests with RLEF can generalize to other datasets like HumanEval+ and MBPP+, especially the 70B model. This generalization capability makes the proposed method more appealing and suggests broader potential applications.
-
The behavioral analysis post-RLEF training provides valuable insights into how the model learns to handle feedback differently. The detailed error analysis (e.g., fewer repetitions, more targeted edits, reduced reliance on random sampling) gives empirical weight to claims about the improved robustness of RLEF-trained models in multi-turn setup.
缺点
-
The scientific novelty of RLEF is somewhat limited. Although it extends previous work by introducing a multi-turn setup, the core concept of using unit test feedback as a reward signal has already been proposed in the literature, such as in RLTF [1], which uses unit test feedback within an online RL framework for code generation. The primary advancement in RLEF lies in iterating on this approach by incorporating execution feedback as input text across turns, making this more of an empirical extension than a conceptual breakthrough. To clarify the novel contributions, the authors could explicitly outline how RLEF builds upon and differs from RLTF and related works. Including a citation to RLTF would also better contextualize RLEF within the existing literature, helping to position its contributions more clearly.
-
While the experiments on CodeContests show the effectiveness of RLEF, high-quality unit tests like those provided in CodeContests are hard to obtain in practice or from other data sources. The authors didn't study how the quality of the unit tests (beyond just random feedback) would affect the effectiveness of RLEF. For example, if the public unit tests only cover simple input cases, the execution feedback might struggle to find bugs, imperfections, or inefficiencies in the code, thus providing less useful feedback. This limitation might hinder the application range of this method. It would strengthen the paper to include an analysis on RLEF's sensitivity to unit test quality. The authors could consider testing the method with varying qualities of unit tests, or provide a discussion on approaches to generate high-quality tests in real-world environments where comprehensive unit tests are scarce. Specific experiments that address this limitation would add valuable depth to the study.
-
The paper’s presentation, particularly in the Method section, needs improvement. (a) The organization is overly condensed, and crucial methodological details are packed into dense paragraphs, making it challenging for readers unfamiliar with PPO or RL for code generation. Key adjustments in the RL setup are mentioned briefly without adequate justification, making reproduction difficult and potentially deterring readers from fully engaging with the methodology. Expanding on crucial areas, particularly the PPO implementation details and any task-specific adjustments, could improve clarity. Breaking down dense sections into distinct, digestible subsections would also enhance readability. (b) Additionally, the primary results section compares models across different n@k metrics with varying n and k values, which may not be intuitive for readers unfamiliar with this metric. A more consistent comparison framework or additional explanation of these metrics would improve clarity.
[1] Liu, J., Zhu, Y., Xiao, K., FU, Q., Han, X., Wei, Y., & Ye, D. RLTF: Reinforcement Learning from Unit Test Feedback. Transactions on Machine Learning Research. (2023)
问题
-
How robust is RLEF to feedback of varying quality beyond random feedback? For instance, how does the model perform if unit tests are incomplete or only cover trivial cases? How does the performance of RLEF change with different types or amounts of execution feedback? Is there an optimal amount or type of feedback for maximizing performance improvements?
-
The paper sets a fixed turn limit in the multi-turn setup. How does this choice affect model performance, and could RLEF benefit from a dynamic turn limit based on feedback content or problem complexity? How sensitive is the model to the number of feedback turns, particularly with respect to diminishing returns after a certain number of turns?
-
The paper mentions using a hybrid approach for value estimation (response-level) and policy optimization (token-level). Can you elaborate on why this approach works better than optimizing both at either the turn or token level?
We appreciate the insightful feedback of the reviewer and are happy that they recognize the effectiveness of our method and the thoroughness of our experimental analysis. Regarding the stated weaknesses of our submission, we would like to respond as follows:
-
We maintain that our method is not strictly an empirical extension to previous work and refer to our global response where we outline the planned changes to our related work section. In summary, the novelty of our paper is the combination of optimization with RL over multiple turns and integration of execution feedback in a conceptually simple and highly effective framework, requiring a single model and training pass only. As stated in the global response, we will add results with a CodeRL-like repair model (the same approach used in the later RLTF paper) as an additional baseline.
-
We agree with the reviewer that the quality of public unit tests (i.e., those used for feedback) can have a significant effect on inference-time efficiency. Notably, the public tests in CodeContests do in many cases cover only example inputs and outputs; they are usually already provided in the problem description. In Appendix B.2 we show the effect of using the more extensive private tests for execution feedback. Our original submission contained an outdated result in this section which we will update. For the 70B model we included in Table 1, feedback from private tests improves 1@3 solve rates from 37.5 to 41.2 and from 40.1 to 41.2 on the validation and test set, respectively.
As the reviewer points out, existing and future work on automatic test generation is highly relevant to our work and could help render our method more general, i.e., in cases where no tests are provided by users at inference time. While such use cases are not the main target of our submission, we will add a corresponding note to the Conclusion/Limitations section.
- (a) We view our design decisions regarding the action space for policy and value learning (L162f) as implementation details that require consideration and might be subject to the specific application when training LLMs in multi-turn dialogs with RL. As such we aimed to clearly explain our choices, but also feel that they are not central to the overall method. We will complete A.1 with details regarding the value function loss (see response to kdLS below).
(b) We justify the use of the n@k metric in 3.1 (L196ff.) and indeed we argue that using standard pass@1 metrics when allowing for multiple generations per solution is a key shortcoming of prior work. The legend of our main results (Table 1) highlights the key fact, namely that k corresponds to the sample budget.
We are further happy to answer the reviewer's questions as follows:
-
Robustness to test cases: as pointed out above, the public test cases of CodeContests typically cover only simple cases. We think that an "optimal amount" of feedback would be highly dependent on the concrete model and benchmark and would thus refrain from giving specific recommendations. In practical applications, there are also further trade-offs to consider such as the number of turns and a model's context length.
-
Effect of turn limit: We would like to point out that we employ a dynamic turn limit in that successful execution of public tests will end the dialog with the LLM. In Figure 4, we show the number of turns can be increased at inference time. We believe that a focus on compute budget is crucial, though, and in Figure 4(b) we show that increasing the inference turn limit to 5 yields compute-optimal results.
-
Hybrid approach to value estimation: We found that in preliminary experiments, optimizing the policy at the turn level (i.e., PPO with `pi(a_t|c_t) = \prod_i=1,n pi(a_{t,i}| c_t, a_{t,1}, ... a_{t,i-1})` for a response a_t consisting of n tokens) resulted in poor performance. With a token-level value reward, we found that the KL penalty biases the model to unreasonably short generations in intermediate turns. Averaging the KL penalty as described in L170ff is an effective remedy and invites modeling with a per-turn reward at the turn level. We will add an ablation to the paper where we use the current per-turn rewards but learn a per-token value function to predict the same return-to-go for each token of a response.
Thank you for the detailed response and for addressing some of my concerns through additional supporting experiments and explanations. While I appreciate the authors' efforts, several key issues remain unresolved, which I elaborate on below.
Scientific Novelty
The authors' rebuttal states that "the novelty of our paper is the combination of optimization with RL over multiple turns and integration of execution feedback in a conceptually simple and highly effective framework." However, this is essentially a rephrasing of my original comment: "it extends previous work by introducing a multi-turn setup, the core concept of using unit test feedback as a reward signal has already been proposed in the literature, such as in RLTF, which uses unit test feedback within an online RL framework for code generation."
To clarify, I am not claiming that RLEF is identical to RLTF or CodeRL. However, the incremental nature of the contribution limits its scientific novelty. This perspective aligns with similar concerns raised by Reviewer QLXX and Reviewer kiFK. Without further justification or clear differentiation from existing approaches, the work does not demonstrate a sufficient conceptual leap to address this limitation.
Dependence on High-Quality Unit Tests
I appreciate the additional experiments using private tests from the CodeContests dataset, which demonstrate the benefits of having more comprehensive unit tests at inference time. These results, however, further confirm my concern: the method's reliance on high-quality unit tests significantly limits its applicability. As noted in my original review, "if the public unit tests only cover simple input cases, the execution feedback might struggle to find bugs, imperfections, or inefficiencies in the code, thus providing less useful feedback. This limitation might hinder the application range of this method."
The authors acknowledge the relevance of automatic test generation but stop short of proposing a mechanism to address this limitation. Including a method to generate unit tests would alleviate this weakness and enhance both the novelty and practicality of the proposed approach.
Presentation and Reproducibility
The revised paper has not sufficiently addressed my concerns about the clarity and organization of the Method section. In the revised paper, I still do not see sufficient effort made to elaborate on their methods in a clearer and more reader-friendly manner. This lack of clarity may significantly hinder the reproducibility of the paper. Given the current state of the method section, I am not confident that I could reproduce the method based solely on the information provided in the paper. Moreover, the authors did not mention any plans to open-source their codebase, further compounding this issue. Thus, my concerns regarding the presentation of the method and its reproducibility remain unresolved.
While I appreciate the authors' efforts to address some concerns, the key issues regarding scientific novelty, reliance on high-quality unit tests, and presentation remain unresolved. These limitations affect both the conceptual strength and the practical applicability of the work. As a result, I maintain my original score for this review.
We thank all reviewers for their valuable feedback, and greatly appreciate that they find our submission "addresses a critical limitation of current LLMs" (QLXX) by "effectively incorporating self-correct and self-refinement techniques" (QLXX). They point out that the paper is "well-written and clearly presented" (kdLS), and they acknowledge that our detailed analysis "provides valuable insights into how the model learns to handle feedback differently" (UaD9). There is consensus among the reviewers regarding the effectiveness of our RLEF method in enhancing LLM coding capabilities.
Among the mentioned weaknesses, we identify two overarching themes: first, reviewers QLXX, UaD9, and kiFK point to a limited novelty of our method and request that we better position it in the existing code generation (with RL) and self-correcting literature
We would first like to highlight that our method integrates both execution feedback and execution reward in a conceptually simple and highly effective RL framework. To our knowledge, this has not been achieved to date. We will strengthen our submission as follows to make this clear:
- We will add a further baseline where we train a "repair" model for the single-turn 8B ablation model, encapsulating the core improvement technique in CodeRL. This paper has been mentioned by reviewers QLXX and kiFK, and the same technique is also used in RLTF (reviewer UaD9).
With this approach, we can improve the 1@3 solve rates from 10.2 to 14.8 on valid and 10.9 to 12.6 on test, but it still falls short of our multi-turn RLEF model (17.2, 16.0) (one solution from the RL-trained model, two solutions from the repair model; details will follow with paper update) - We will update our related work section to better reflect the relationship to prior work. In particular, we misrepresented CodeRL in our original write-up and will correct this. At the same time, we also point out that we discuss several of the related works cited by the reviewers as missing in our existing related work section ("Training Language Models to Self-Correct via Reinforcement Learning" and "Execution-based Code Generation using Deep Reinforcement Learning", both noted by reviewer QLXX).
- We will add two further ablations with the 8B Llama 3.2 model to validate our design decisions:
- A token-level value function (reviewer UaD9), which achieves a 1@3 solve rate of 13.1 on valid, 13.7 on test, i.e., below our original training recipe (17.2, 16.0)
- Multi-turn training but without execution feedback: we replace the execution feedback message with "Give it another try", disable early stopping based on public test results, and evaluate the last response only (reviewer kdLS). This yields a 1@3 solve rate of
11.012.2 on valid,10.410.9 on test, i.e., higher on the validation set compared to the initial instruct model (8.9) and within variance on the test set (from 10.2).
We respond to individual reviews in more detail in dedicated comments and invite all reviewers to reconsider their scores.
We have now updated the submission with a new version of the manuscript which includes the changes mentioned above. We added the new results (Table 4b in Appendix B), referred to from the end of Section 4.3. The fourth and fifth paragraphs of the related work section have been updated (L460ff, L478ff).
This paper proposes a framework for using reinforcement learning to train code models that can improve their outputs based on feedback from unit tests. The main limitations raised by reviewers are (i) the limited novelty compared to existing systems such as RLTF (and others), (ii) the reliance on effective unit tests, and (iii) the limited experimental evaluation. The paper requires significant revision before it might be considered for acceptance.
审稿人讨论附加意见
There was limited discussion during the rebuttal period; however, reviewers engaged in discussion were not convinced by the rebuttal provided by the authors.
Reject