PaperHub
4.5
/10
Rejected4 位审稿人
最低3最高5标准差0.9
3
5
5
5
4.3
置信度
ICLR 2024

Grounding Code Generation with Input-Output Specifications

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11
TL;DR

A new framework for instruction fine-tuning large language models of code in which the training is guided by execution based I/O specifications.

摘要

关键词
code generationI/O specificationsinstruction tuning

评审与讨论

审稿意见
3

The paper aims to enhance the capability of LLM by instruction tuning on datasets with I/O specifications. Namely, it adds more prompt contexts that specify required I/O requirements and leverages that to instruct-tune a model to generate results that satisfy such results.

优点

Enhancing the generation of code models from natural languages is important and a timely topic.

缺点

I am confused about the motivation of the work. It mentions that it wants to enhance the code model to generate more correct code (besides executable). From the example given, i.e., figure 1, the so-called user intention seems to be the description of names of the column, etc., which are necessary information you need to provide in the prompt to accomplish such tasks. Otherwise, the model surely will not understand the input and cannot output the variables as needed.

So, I was thinking that the baseline methods are too naive, which does not even provide such information to the LLM to generate code. But then, I found that there are no baseline methods. Now, I am quite confused: where do you get such a motivation? Is not that just poor prompting skills the paper assumes users have -- which is not realistic? Are there evidence showing that users do not want to provide enough information to the LLM but want to get desired results?

If you are talking about understanding I/O constraints, e.g., relationships among input data, code generation from examples is a type of program synthesis, and there are many baseline works.

If the so called I/O specifications are necessary I/O descriptions, I do not see why users would want to exclude them. Also, various plugins try to infer data type and descriptions from user prompts and given files. For the I/O specifications shown in Table 1, I am very confused why users would not include them in the prompt description if, for example, they want the output variable to be named as df.

问题

What is your technical contribution?

评论

Thank you for your review and the opportunity to clarify the aspects of our work.

Can you clarify the motivation behind GIFT4Code? The red IO spec in example 1 seems to be what we should expect from the user prompt.

The primary motivation behind GIFT4Code is to address a prevalent challenge in code generation with LLMs: their occasional failure to accurately follow user-provided specifications, even when these specifications are explicitly stated in the prompt. Example 1 was intended to demonstrate this issue. Despite receiving specific I/O specifications (e.g., the red I/O in Example 1), LLMs can still generate incorrect or misaligned code. This discrepancy is a significant issue in practice, where developers rely on the model to follow closely to the provided specifications for correct code generation. GIFT4Code is designed to mitigate this gap by enhancing the model’s ability to follow complex I/O specifications.

Missing necessary information (column names) in the prompt leads to naive baselines.

We acknowledged the importance of providing essential information, such as column names, in the prompt for accurate code generation. In Table 2, we provided the basic input dataframe schema (the description of names of the column) in all models including Code LLM (no spec.).

The few-shot prompting section is also considered as the baseline. Each row in this section has access to different types of specifications in the prompt. For example, the “Code LLM + I/O Summary” row means the prompt includes a natural language summary of I/O specifications. Therefore, the comparison between the few-shot prompting section (baselines) and the synthetic data fine-tuning section is fair.

What is the technical contribution of GIFT4Code, particularly in comparison to program synthesis from examples and existing plugins that infer data types and descriptions?

GIFT4Code aims to improve LLM’s ability to follow complex I/O specifications by fine-tuning on a synthetic dataset produced by itself and utilizing the execution-derived feedback as a learning signal in the form of I/O specifications. We showed that GIFT4Code is effective in three types of specifications. Compared to programming by examples, our model generated specifications extend this by more abstract, natural language-based I/O specifications. The experiments showed that GIFT4Code has better performance when the prompts incorporate the I/O summary, as opposed to I/O examples. We also considered the type based specification, which also has suboptimal performance compared to the I/O summary.

We appreciate the thoughtful critique provided in the review. It seems that some aspects of our work were not fully conveyed in the submission, leading to a misunderstanding. We hope our work can be reevaluated with the clarifications provided.

审稿意见
5

Large language models (LLMs) have demonstrated significant potential in code generation. However, the code generated by these models occasionally deviates from the user’s intended outcome, resulting in executable but incorrect code. To address this problem, the authors propose Gift4Code (grounded instruction fine-tuning for code). Gift4Code can access both a natural-language intent and an additional I/O specification. The experimental results show the effectiveness of Gift4Code.

优点

  1. The authors focus on an very important area.
  2. The improvements seem good.

缺点

  1. The idea is not novel enough.

问题

The authors focus on an very important area and the improvements seem good. However, I have a concern.

The authors claim that the code generated by these models occasionally deviates from the user’s intended outcome. However, CodeT also considers the test case, which is similar to the specification. Further, the overall idea of this paper is similar to CodeT, which also uses model to generate test cases. I know the differences between the specification and the test case. However, the overall framework of these two approaches are similar and the authors even fail to cite CodeT. If there is any difference, please add more details about it.

[1] Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J. G., & Chen, W. (2022, September). CodeT: Code Generation with Generated Tests. In The Eleventh International Conference on Learning Representations.

评论

Thank you for your review and the opportunity to clarify the aspects of our work.

How is GIFT4Code fundamentally different from approaches like CodeT that also consider test cases similar to specifications? Can you elaborate on the differences between your approach and existing methods such as CodeT?

Thank you for pointing out that CodeT should be considered as a related work. CodeT focuses on using the generated tests to conduct dual execution agreements to filter the generated solutions in the inference time, when few candidates must be submitted. This is evident by the usage of pass@k@n metric, which selects the best k samples from a total n samples. Essentially, CodeT's goal is to refine the inference process by optimizing the selection of generated code samples. This approach is fundamentally different from GIFT4Code’s objective.

GIFT4Code aims to enhance the base model’s ability to follow complex I/O specifications. This difference in approach is reflected in our use of the pass@n metric. As a comparison, CodeT doesn’t improve pass@n as shown in Table 2 in the paper, “We do not show CODET pass@100, since it is the same as the baseline pass@100”.

Moreover, in the realm of data science, the task of generating meaningful tests, as required by CodeT, can be particularly challenging. For instance, generating tests for a task like “What are the most populous cities in each country” (as shown in Figure 1) may not be feasible or meaningful. This complexity further distinguishes GIFT4Code's approach.

We will make sure to include a comparison with works that leverage execution or tests for filtering or clustering generated solutions during the inference time such as [1, 2] in the revision. We hope this response addresses your concerns.

[1]: Chen, Bei et al. “CodeT: Code Generation with Generated Tests.” ArXiv abs/2207.10397 (2022): n. pag.

[2]: Zhang, Shun et al. “Planning with Large Language Models for Code Generation.” ICLR 2023.

评论

I believe that CodeT can be applied in similar scenarios as discussed in the paper. The rationale is that the test case generation feature in CodeT can be viewed as a form of multi-task learning, aimed at enhancing performance. This aspect of multi-task learning in CodeT parallels the approach described in the paper, suggesting a broader applicability of these techniques in similar contexts. The primary distinction, however, should lie not in the specific metrics such as pass@k@n versus pass@1@n, but rather in the underlying conceptual framework. Both approaches share a common foundation but without citation, although they diverge in their specific implementations and considerations. Please add more details to the differences of these two approaches.

评论

We appreciate your insight regarding the multi-task learning aspect of CodeT, where the LLM is prompted to generate an auxiliary task such as test cases to enhance the main task such as code generation. This perspective indeed draws connections between GIFT4Code and CodeT.

The conceptual difference is GIFT4Code addresses a different, yet equally important challenge: ensuring that LLMs follow the auxiliary tasks such as specifications or tests which are already provided in the prompt. This focus is fundamentally different from CodeT, which generates auxiliary tasks (tests) as part of its sampling process. In this context, GIFT4Code could potentially address an issue that might arise in CodeT, where the LLM fails to generate code satisfying its own previously generated test cases. Hence, we see GIFT4Code and CodeT as complementary approaches, each addressing different aspects of LLM-based code generation.

While we agree that the multi-task learning framework links CodeT and GIFT4Code, we believe that categorizing GIFT4Code as lacking novelty based solely on this broader perspective is not entirely fair. The multi-task learning formulation is a common theme in LLM research due to its emergent abilities, as represented by chain-of-thought which first generates reasoning as an auxiliary task then the main task. Our contribution with GIFT4Code is in how we apply this framework specifically to improve adherence to auxiliary tasks in code generation, which is a novel and important direction in the field. We hope this clarification further illustrates more details on the differences between CodeT and GIFT4Code.

审稿意见
5

This paper introduces GIFT4Code, a novel approach designed to improve the accuracy of code generated by Large Language Models (LLMs) in the context of data science tasks. The problem addressed is that LLM-generated code can sometimes be executable but incorrect, not aligning with the user's intended outcome. GIFT4Code tackles this issue by fine-tuning LLMs specifically for code generation tasks with the natural-language-summarized input-output specification. It achieves this by using synthetic data generated by the LLM itself and providing execution-derived feedback and summarize the input-output with a "generalist" LLM. This more detailed specification is expected to concretize the user's intention so that the LLM better understands and aligns with the user's intentions when generating code. The paper evaluates the GIFT4Code approach on two data science benchmarks, Arcade and DS-1000, and the results indicate a significant improvement in the quality of generated code by appending the I/O summary.

优点

The paper addresses an interesting and important problem in code generation.

The ambiguity of human intent or instructions has been identified as one of the main reasons that hurt the code LLM's performance since the human intent is mostly high-level and always misses descriptions regarding corner cases. The input-output specification is one of the effective ways to concretize the requirements, and human developers also tend to refer to such specifications for accurate implementation. This paper targets guiding the code LLMs to efficiently capture such specified intent for better code generation, which could potentially increase the code LLMs performance in generating code better aligned with human intent. The problem, in general, is interesting and important.

The design is intuitive and smart, and the improvement over the base model is obvious

In general, Gift4Code's design makes intuitive sense, and I specifically appreciate the core design of summarizing the input-output as a natural language specification. One potential difficulty that hinders the LLM from efficiently understanding both the execution results and the input-output constraints is the complicated data structure and the concrete arithmetic values that the LLM struggles to accurately perceive. Natural-language summarization as the input-output specification seems to bridge the gap between the constraints and the semantic space that LLM could efficiently understand. Such benefits are also successfully reflected from the evaluation, in both prompting and fine-tuning setup

缺点

Unclear description regarding how to get I/O summary in the realistic setup.

When the model is deployed for inference in the realistic setup, especially when only human intent is provided, it is not clear where the I/O summary could potentially come from, and it is no more the same case as training data preparation that there are some ground-truth programs to be executed for the output and summarized by the "generalist" LLM. Will Gift4Code generate the I/O summary first, by only conditioned on the problem description, and then generate the code conditioned on its own generated I/O summary? Or the user has to provide a concrete I/O specification beyond the high-level intent? In the latter case, whether the user could generate a high-quality I/O specification as the training data remains unknown. Though the authors explain that they try to simulate noisy I/O specification at test time (Section 4.1), this is still some sort of leakage of the ground-truth code, not similar to the realistic setup when only human intent is available. I would urge the authors to explain how Gift4Code will be applied in such real-world scenario.

The evaluation is insufficient.

Unfortunately, while the design of Gift4Code seems effective, the evaluation of this paper is not sufficient.

First, the baselines are not strong enough. For the model of similar size, i.e., 62B, the authors only compare with an unknown/anonymous code LLM which is used as the starting point of Gift4Code. Given the audience could not estimate the capacity of this 62B since it is not publicly available, using this model as the only baseline seems insufficient. In addition, the 15B baselines seem to perform comparably well to the vanilla 62B Code LLM, this makes the latter, as a baseline, even weaker. I would encourage the authors to compare with at least one more publicly available model that is widely recognized as "capable", such as Llama-2-70B.

Second, it is not clear how generalizable Gift4Code is. It seems that the implementation and training data of this paper is not publically available, so it could be difficult to estimate the generalizability of the method, i.e., whether the improvement is restricted to the unknown/anonymous code LLM or it will work similarly effectively when applied to other pre-trained code LLMs. I would ask the authors to conduct similar instruct-tuning using open-source code LLMs, such as code llama-7B, 13B, and 34B.

Third, the claims that general instruct-tuning is not effective by simply comparing WizardCoder to StarCoder are not reasonable. The overall comparison is not well controlled. For example, the unknown/anonymous base code LLM, as explained in Section 4.1, is already fine-tuned on data science dataset, so the base model is already more capable in data science code generation than StarCoder. Also, Gift4Code is trained on the data science dataset again, so it is not clear whether its training strategy helps more or just the data does the job.

Lack of details in the model training, configuration, and architecture

While I am fine with the authors' choice of not releasing the artifact, the details for reproducibility are not sufficiently provided, questioning the work's reproducibility. For example, what is the architecture of the 62B base model, how many layers, hidden dimensions, types of positional encoding, etc. Also, the details of Gift4Code training, such as the learning rate, training for how long, what are the impacts of configuration, etc, are not revealed, either. Without the details mentioned above, it is nearly impossible to reproduce the results and leverage findings for other goals in the community, significantly weakening the contribution of this work. I would ask the authors to provide a detailed description of the training process, configuration, and model architecture to ensure the reproducibility of their results.

问题

  • In the realistic development scenario, the expectation is that only the human intent is given. In such a case, the ground-truth code is unknown, so the summary of execution results relying on the "generalist" LLM is not possible. In this case, where is the I/O summary from? Will the user have to further indicate the specified I/O beyond the high-level intent? Is such a requirement practical?
  • Will the author release the artifact for reproducibility?
  • If the above is not possible, could the authors provide detailed description of the training process as a reproducibility statement?
评论

Thank you for your insightful review. We appreciate the opportunity to address your concerns.

Real-world setup of GIFT4Code?

After model deployment, we anticipate that developers will provide I/O specifications in their prompts, a similar assumption used in [1]. GIFT4Code is designed to tackle a prevalent challenge: even when complex I/O specifications are provided, models often fail to align with them, leading to inaccuracies in code generation. This is a significant issue in practice, where developers rely on the model to follow closely to the provided specifications for correct code generation.

We conducted a preliminary experiment where the code LLM first generated an I/O summary and then the code solution. The pass@20 is ~49.3% which is lower than the results obtained with high quality but noisy specifications generated by the “generalist LLM” (55.5%). Acknowledging that real-world scenarios might not always offer explicit I/O summary, extending GIFT4Code to adapt to such conditions remains a key area for future work.

[1]: Shi, Kensen et al. “TF-Coder: Program Synthesis for Tensor Manipulations.” TOPLAS (2022)

The evaluation is insufficient. The baseline might be too weak.

Responding to your feedback on our evaluation process, we have conducted additional experiments to include two CodeLlama-34b models (LLama 2 70B is worse than CodeLlama 34b in coding tasks) in our evaluations. The following table presents the results from our extended evaluation on the full context Arcade benchmark:

ModelCode LLM (no spec.)CodeLlama-PythonCodeLlama-instructOurs (synthetic f.t. w. I/O Summary)
pass@2037.536.236.055.47

These results indicate that the base Code LLM, used as the starting point for GIFT4Code, is indeed a competitive model. Additionally, the comparable performance of the 15B StarCoder, 34B CodeLlama and the vanilla 62B Code LLM showed that the Arcade benchmark is very challenging, which could not be solved by just increasing the number of parameters, highlighting the unique value GIFT4Code brings to the data science domain.

Lack of details in the model training, configuration, and architecture

We appreciate your concerns regarding the reproducibility of our work. Here are some more details on the LLMs. The code LLM is a state-of-the-art large code language model fine-tuned for Python data science applications. It is a decoder-only Transformer with 62B parameters. It has 64 layers, 8192 hidden units, 32 attention heads, and uses RoPE embeddings with a 256k token SentencePiece vocabulary. The model was first pre-trained on a collection of 1.3T tokens of web documents and github code data, and was then fine-tuned on a disjoint set of 64B Python code tokens together with 10B tokens from Python Jupyter notebooks. The fine-tuning process, including the GIFT4Code procedure, can be conducted through the model tuning service on Google cloud Vertex AI. The learning rate of GIFT4Code fine-tuning is 1e-5 and other training details (steps, batch size) can be found in the experiment section.

The base model is publicly available as an API, but was only privately accessible at the time of submission. The model details are publicly available in \citet{anonymous}. We will make sure to report more model information in the final version and deanonymize the LLM citations.

How generalizable is GIFT4Code across different code LLMs?

Given limited computational resources, in this paper we only focus on improving one code LLM to follow NL instructions with I/O specifications, in line with existing works on fine-tuning LLMs using synthetic data [1,2]. We chose the base LLM that has the most promising performance in the data science domain. Nevertheless, we totally agree with you that the best strategy is to verify fine-tuning approaches across different LLMs. We leave this as important future work.

[1]: Wang, Yizhong et al. “Self-Instruct: Aligning Language Models with Self-Generated Instructions.” ACL (2022).

[2]: Taori, Rohan et al. “Alpaca: A Strong, Replicable Instruction-Following Model.”

评论

Thanks for the authors' response. Though the rebuttal provides a bit more results and explanation, my overall attitude regarding this paper still remains, so I am not changing my score at this moment. However, I do not have a strong objection if other reviewers champion this paper for its acceptance.

审稿意见
5

This paper introduces GIFT4CODE, a method to create synthetic data for the instruction fine-tuning of LLMs in code generation. The authors use a general-purpose LLM to generate NL intents and feed them to a Code LLM to generate possible code solutions. They leverage program execution to post-process synthesized solutions and generate additional I/O specifications for grounding. The instruction-tuned LLM is evaluated on ARCADE and DS1000 datasets and compared with zero-shot and few-shot learning of the base model.

优点

The paper presents a pretty clear methodology for I/O specification grounded code generation and backs it with interesting results. For example, the performance between no context and full context is insightful.

The experiment shows fine-tuning with synthetic data gives a significant boost in the model performance on ARCADE. The performance gain is somewhat modest on DS-1000 even though 500 problems in this dataset were used for training.

缺点

Bootstrapping LLMs with synthetic data generated from strong LLMs is a pretty well-known technique. The paper claims that “Unlike existing approaches, our key insight is to leverage program execution for synthetic data generation”; however using execution feedback for synthetic data generation is also known in previous work, for example Code Llama.

This work lacks comparison with some strong open-source and proprietary models such as GPT-4 and Code Llama as a baseline.

Code generation grounded with I/O specifications sounds interesting, but it is not clear if developers write I/O specifications similar to what being used in this work, especially in the generated I/O summary case.

One base model performs a little worse with TypeDesc spec than without.

问题

I wonder if the model regresses on standard benchmarks such as HumanEval and MBPP after the fine-tuning step. It would great if the authors could provide such results?

评论

Thank you for your insightful review and the opportunity to address your concerns.

How does GIFT4CODE differ from previous work? Using execution feedback for synthetic data generation is also known in previous work such as Code Llama?

We appreciate your concern regarding the novelty of using execution feedback in synthetic data generation, as also seen in works like CodeLlama. However, GIFT4Code's approach differs significantly. CodeLLama-instruct creates a self-instruct dataset consisting of self-generated questions, corresponding solutions and tests, with the execution signal serving primarily as a filter for these solutions. The execution feedback does not directly contribute to the model's fine-tuning objective. On the contrary, GIFT4Code integrates execution-derived feedback (in the form of I/O specifications) as a key learning signal. This integration establishes a direct link between the generated code and its execution outcomes, significantly enhancing the model’s ability to follow complex I/O specifications.

Furthermore, generating tests then leveraging their execution feedback in real-world data science scenarios is highly non-trivial. For instance, formulating a test for a task like “What are the most populous cities in each country” (as shown in Figure 1) is far from straightforward. We also show in the following that CodeLlama instruct and the standard CodeLLama (without fine-tuning on the self-instruct dataset) have a similar accuracy on the Arcade benchmark. It shows that the self-instruct dataset in CodeLLama is not effective in the data science domain.

Why was there no comparison with strong open-source and proprietary models like GPT-4 and Code Llama?

In response to your query about comparisons with other models, we have included additional experiments including two CodeLlama-34b models in our evaluations on the full context Arcade benchmark. We also incorporated results from CodeX for a comprehensive comparison. This shows that GIFT4Code still has an advantage over the stronger baseline.

ModelCode LLM (no spec.)CodeLlama-Python-34BCodeLlama-instruct-34BCodeX-davincci-002Ours (synthetic f.t. w. I/O Summary)
pass@2037.536.236.048.855.47

Are I/O specifications used in this work reflective of those that developers write in practice, especially in the generated I/O summary case?

We designed the few-shot prompts used in generating I/O summaries to closely mirror the types of specifications that developers typically write. The generated I/O summary shown in Listing 5 in the appendix also roughly reflects what developers will write in practice.

Could you provide results on standard benchmarks like HumanEval and MBPP after fine-tuning?

We conducted preliminary evaluations of our code LLM on HumanEval. Intuitively, fine-tuning on our data science problems leads to slight accuracy drop on HumanEval due to domain shift. Still, the model fine-tuned on the same type of I/O specifications as HumanEval (I/O examples, see above) could maintain its performance, even under domain shift. Notice that the prompt in “Fine-tuned LLM (I/O Summary)” below does not contain any I/O summary on HumanEval. We’ll add more results (MBPP) in the final version.

HumanEvalBase Code LLMFine-tuned LLM (I/O Type)Fine-tuned LLM (I/O Example)Fine-tuned LLM (I/O Summary)
pass@131.729.331.629.3
pass@10089.089.089.687.8
评论

Dear Reviewers,

Thank you for your valuable insights and suggestions. We have tried our best to address your concerns in our rebuttal. Given that the discussion period is ending soon, we were wondering if you could let us know if you have further questions or whether the authors' response have addressed your concerns. We would be happy to answer any further questions.

Best,

Authors

AC 元评审

This paper introduces GIFT4CODE, a method to create synthetic data for the instruction fine-tuning of LLMs in code generation. The authors use a general-purpose LLM to generate NL intents and feed them to a Code LLM to generate possible code solutions. They leverage program execution to post-process synthesized solutions and generate additional I/O specifications for grounding. The instruction-tuned LLM is evaluated on ARCADE and DS1000 datasets and compared with zero-shot and few-shot learning of the base model.

The novelty of using execution-derived feedback to generate synthetic data is somewhat limited (reviewer Ns97). The experimental validation could be significantly improved with baselines, other models, better comparisons (reviewers Ns97, kH34, pVfx), and some of the claims are not supported (e.g. that instruct-tuning is not effective). Overall, the paper is interesting, but it falls short on several aspects to the high bar for publication at ICLR.

为何不给更高分

N/A

为何不给更低分

N/A

最终决定

Reject