5.8

/10

Poster4 位审稿人

最低4最高7标准差1.1

4.0

置信度

正确性3.0

贡献度2.8

表达3.3

NeurIPS 2024

MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

Wei Tao,Yucheng Zhou,Yanlin Wang,Wenqiang Zhang,Hongyu Zhang,Yu Cheng

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

摘要

关键词

Code ChangeLLM AgentSoftware Evolution

评审与讨论

审稿意见

评分: 4置信度: 52024-07-08

This paper designs a Multi-Agent framework, MAGIS. Following the form of team problem solving, it designs four agents: Manager, Repository Custodian, Developer, and QA Engineer, which are used to decompose tasks, locate problem codes, generate codes, and review codes, respectively. They help the LLM to better locate problems and generate correct codes in a cooperative way. This framework can solve 13.94% of problems on the SWE-bench benchmark.

优点

This paper draws on the cooperation method of human team, assigns different roles to agents, and utilizes the powerful understanding ability of the big model for natural language, as well as the professional ability in specific fields stimulated by being assigned specific roles. Compared with simply calling the big model, the effect is improved.
This paper draws on the software engineering idea very well. For example, the Kick-off Meeting method is adopted, which can help multiple developers clarify independent tasks, ensure that there is no conflict, and can determine the order of modification.

缺点

The paper uses 11 prompts in total, and the specific prompt content cannot be found in the main text, which makes people worry whether the experimental results will strongly depend on the quality of the designed prompts. After replacing other models, the effect may be uncontrollable.
The framework is relatively complex and requires a large number of interactions with LLMs. In the end, compared with other baseline scores, whether these additional computing resources are necessary is not further analyzed in the paper.
SWE-bench clearly states that the hints field should not be used in the submission list, but from this paper, it is found that hints seem to be the key to MAGIS improvement. Is this unfair to other methods?
I noticed that there are a lot of new submissions on SWE-bench (Lite) recently. Do you need to do a new comparison? (Of course, I think it is reasonable not to compare, after all, they are working in the same period.)

Reference

[1] Opendevin: Code less, make more. https://github.com/OpenDevin/OpenDevin/, 2024.
[2] Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, et al. Coder: Issue resolving with multi-agent and task graphs. arXiv preprint arXiv:2406.01304, 2024.
[3] Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li, Fei Huang, and Yongbin Li. How to understand whole software repository? arXiv preprint arXiv:2406.01422, 2024.
[4] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement, 2024.

问题

See the questions in Weaknesses.

局限性

The authors explain the limitations of their approach in the paper.

作者回复

2024-08-07

Thank you very much for taking the valuable time to review our manuscript and thanks for your positive comments (i.e., drawing on the software engineering idea well, drawing on the cooperation method of the human team, and the improved performance). We are sorry for the confusion and unclear expression in the previous version. We have addressed each of the comments and suggestions. Please refer to our responses below for details.

Q1: The paper uses 11 prompts in total, and the specific prompt content cannot be found in the main text, which makes people worry whether the experimental results will strongly depend on the quality of the designed prompts. After replacing other models, the effect may be uncontrollable.

Thanks for your comments.

Prompt content: We will provide the full set of prompt content in the revision (part of it is shown below due to the length limit).

# Prompt <em>P</em> (Line 5 in Algorithm 2)
system_prompt = ("You are a software development manager."
                 "Your responsibility is to provide clear guidance and instructions to a developer regarding modifications or improvements needed in a specific code file. "
                 "This guidance should be based on the details provided in the issue description and the existing content of the code file.")
user_prompt = ("Review the issue description and the content of the code file, then provide specific instructions for the developer on the actions they need to take to address the issue with these files.\n"
              f"# Issue Description:\n{issue_description}\n# Code File:\n{file_content}\n"
               "Respond concisely and clearly, focusing on key actions to resolve the issue. Limit your answer to no more than 100 tokens.")

More base LLMs: Moreover, we replaced GPT-4 with other two LLMs (i.e., DeepSeek [1*] and Llama-3.1-405B [2*]), and the experimental results on SWE-bench Lite are shown below. Please note that all prompts are identical to those we experimented with on GPT-4. In the "Directly Use" setting, the prompts are sourced from SWE-bench, while prompts for other settings are designed by us.

Base LLM	Directly Use	MAGIS	MAGIS (w/o hints, w/o QA)
DeepSeek	0.33%	12.67%	11.00%
Llama 3.1	1.33%	16.67%	11.00%

The above table shows that our method has achieved a 38-fold performance improvement (DeepSeek) and a 12-fold improvement (Llama 3.1) compared to directly using these base LLMs. The performance improvement validates that our method is general and can still unlock the potential of other LLMs in solving GitHub issues.

Q2: The framework is relatively complex and requires a large number of interactions with LLMs. In the end, compared with other baseline scores, whether these additional computing resources are necessary is not further analyzed in the paper.

Thanks for your comments. Compared with direct usage, LLM-based multi-agent systems [3-5*] including ours can be more complex and need many interactions. However, these additional computing resources are considered to be necessary and worthwhile because (1) The issue-resolving task is not a task that needs immediate resolution (in contrast with coding tasks such as IDE code completion) (2) The additional computing enables the model to resolve issues that were previously unresolvable (with the resolved rate increasing from 1.74% to 13.94%, as shown in Table 2). As direct usage of LLM can resolve few GitHub issues, this paper focuses on how to better use LLM to resolve issues and improve performance.

Q3: SWE-bench clearly states that the hints field should not be used in the submission list, but from this paper, it is found that hints seem to be the key to MAGIS improvement. Is this unfair to other methods?

Thanks for your comments. We reported the score (10.28%) under the setting without hints in Table 2. This score shows that our framework performs approximately six times better than the base LLM, GPT-4, thus validating the effectiveness of our method even in the absence of hints. The fairness of comparisons with other methods depends on whether these methods can access external information. For instance, in the design of OpenDevin [3*], there is a web browser (Browsing Agent), which allows it to obtain content from the web and these hints are available in GitHub before the issue resolution process begins.

Q4: I noticed that there are a lot of new submissions on SWE-bench (Lite) recently. Do you need to do a new comparison? (Of course, I think it is reasonable not to compare, after all, they are working in the same period.)

Thanks for your comments and understanding! we are aware of the recent submissions on SWE-bench (Lite) and we compared the methods (Swe-Agent [4*], Autocoderover [5*]) that submitted their results before the submission deadline of NeurIPS 24 in Appendix D. The experimental results are shown in Table 4 on Page 18. The main difference between our method from others is that MAGIS does not use external tools such as web browsers [3*]. Our paper focuses on how to unlock the potential of LLMs for GitHub issue resolution and we conduct many empirical studies on analyzing the limitations of LLMs. Moreover, thanks for your references, we will incorporate them [3,5-7*] into discussion.

References
[1*] DeepSeek-AI. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. 2024.
[2*] Dubey, et al. The Llama 3 Herd of Models. 2024.
[3*] Wang, et al. OpenDevin: An Open Platform for AI Software Developers as Generalist Agents. 2024.
[4*] Yang, et al. Swe-agent: Agent-computer interfaces enable automated software engineering. 2024.
[5*] Zhang, et al. Autocoderover: Autonomous program improvement. 2024.
[6*] Chen, et al. CodeR: Issue Resolving with Multi-Agent and Task Graphs. 2024.
[7*] Ma, et al. How to Understand Whole Software Repository?. 2024.

2024-08-09

I appreciate the authors' response and acknowledge that it clarifies many of my initial concerns. However, I remain apprehensive about the use of the "hints" field in the results, as this could potentially limit the method's applicability. Therefore, I will maintain my current score.

In addition, I do not see the author's update in the revision.

2024-08-10

Dear Reviewer,

Thank you for your follow-up feedback and for acknowledging the clarifications we provided in our rebuttal.

We regret that our response regarding using the "hints" field has not fully alleviated your concerns. To clarify, while the "hints" field was utilized in one specific setting to illustrate potential improvements, our method remains effective without it. As shown in Table 2 of the paper, our approach achieved a score of 10.28% without the use of hints, representing a significant (~6x) improvement over the base LLM (i.e., GPT-4) score of 1.74%. This demonstrates the effectiveness of our method, even without hints.

Moreover, the experimental results presented in our rebuttal further validate the effectiveness of our method across different LLMs (DeepSeek and Llama 3.1) without relying on hints. Notably, even without QA, our method achieved a score of 11.00%, which is 33 and 8 times higher than directly using the base models DeepSeek and Llama 3.1, respectively. This demonstrates the applicability of our method, even without hints.

In summary, the primary contribution of our work to GitHub Issue resolution lies in the framework of our method, derived from our empirical studies. We will emphasize this point more clearly in our revised manuscript to avoid any potential misunderstandings.

Regarding the revision, as per the NeurIPS guidelines, we are unable to upload the updated manuscript during the rebuttal/discussion period. However, we will upload the revision as soon as the system allows. You can refer to the NeurIPS FAQ for more details: https://neurips.cc/Conferences/2024/PaperInformation/NeurIPS-FAQ.

We greatly appreciate the time and effort you have taken to review our work. Your feedback has been invaluable in improving our manuscript, and we hope that our clarifications will be taken into consideration.

Thank you again for your valuable time to review.

2024-08-11

Thanks for the author's response. In SWE-bench Lite, I saw that auto-code-rover [1] and SWE-agent [2] achieved 19.00% and 18.00% respectively. These two methods did not use the API of web search, and still had a certain performance difference with MAGIS proposed by the author (16.67%, w/o hints). What is the possible reason for this difference?

Or, is MAGIS cheaper (cost and time) than the baseline method? This indicator is evaluated in baselines, but MAGIS is missing this indicator.

[1] Zhang, et al. Autocoderover: Autonomous program improvement. 2024.
[2] Yang, John, et al. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint (2024).

2024-08-11

Dear Reviewer,

Thank you very much for your follow-up feedback. We would like to address your concerns:

Contemporaneous Work: The two papers you mentioned are contemporaneous works with our submission, in line with the conference guidelines (details shown below). Both papers were published on arXiv (8 Apr and 6 May) within two months of the NeurIPS submission deadline (22 May). As such, they fall under the category of contemporaneous work, which is why we did not include them as baselines in the main body of our paper. Meanwhile, we have already cited (AutoCodeRover[61] and SWE-Agent[59]) and discussed the comparison with these works in Appendix D (Line 718-729). Given your concerns, we will further expand our discussion in the Appendix to provide a more detailed comparison (part of it is shown below).

For the purpose of the reviewing process, papers that appeared online within two months of a submission will generally be considered "contemporaneous" in the sense that the submission will not be rejected on the basis of the comparison to contemporaneous work. Authors are still expected to cite and discuss contemporaneous work and perform empirical comparisons to the degree feasible. Any paper that influenced the submission is considered prior work and must be cited and discussed as such. Submissions that are very similar to contemporaneous work will undergo additional scrutiny to prevent cases of plagiarism and missing credit to prior work.
Reference: https://neurips.cc/Conferences/2024/CallForPapers

Performance and Cost Comparison: As of the conference submission deadline, the latest score reported by AutoCoderRover was 16.11%, as noted on page 7, second-to-last paragraph of their paper (https://arxiv.org/pdf/2404.05427v2). This score is not higher than the 16.67% we reported. Regarding SWE-Agent, it indeed reported an 18.00% score, which is higher than our ablation version (w/o hints). Thanks for your great suggestions, we reviewed the cost data on SWE-bench Lite and calculated that the average cost per instance for our method is approximately $0.41, which is significantly lower than the $1.67 reported by SWE-Agent. Considering the trade-off between effectiveness (a 1.33% difference in score) and cost (a fourfold reduction), we believe that our approach remains competitive. We acknowledge the contributions of SWE-Agent and AutoCoderRover, as they, along with our work, collectively advance the state of the art in this task.

Thank you once again for your valuable time.

审稿意见

评分: 6置信度: 32024-07-12

The paper studies the reasons behind LLMs' failures in resolving GitHub issues, identifying key factors such as locating code files and lines, and the complexity of code changes. The authors propose a novel multi-agent framework, MAGIS, comprising four specialized agents: Manager and Repository Custodian for planning, Developer for coding, and Quality Assurance Engineer for code review. Experimental results show that MAGIS significantly outperforms GPT-4 and Claude-2, solving 13.94% on the 25% subset of the SWE-Bench dataset. Ablation studies confirm the effectiveness of each agent's role in the framework.

优点

The paper tackles the challenging problem of using LLMs to resolve GitHub issues. The proposed method is intuitive and effective, achieving an eight-fold performance gain, compared to baseline GPT-4.
The paper conducts massive experiments on both the reason why LLMs fail in the process of resolving GitHub issues, and the role of each agent, providing insights for the research community.

缺点

The reproducibility of experiments for baseline models like GPT-4 and Claude is unclear. The code and data are temporarily unavailable, hindering follow-up and comparison by other researchers.
Typos: There are minor typos, such as "multi-agennt" instead of "multi-agent" on page 2, line 63, and some missing "%" signs in the resolved ratio on page 7.

问题

In Section 2, I'm a bit confused about the calculation of coverage ratio. How is $[s_i, e_i]$ defined for pure 'deleting' and 'adding' operations? If many code adjustments occur in a single file, does it ensure that each adjustment contributes independently to the coverage? I mean, if model generated code is different from the reference code, the position behind is not aligned.)
In Section 4, can you kindly provide more key implementing details, especially for the model comparison experiment in Table 2? Some helpful information may be the core instruction/input for each model, the number of interaction rounds, and whether code interpreter is allowed. Those are all related to the performance of the model.
Do you submit (or plan to submit) the results to SWE-Bench official leaderboard? How do this method differ from other agent-based methods?

局限性

Yes, the authors adequately addressed the limitations.

作者回复

2024-08-07

Thank you very much for taking the valuable time to review our manuscript and thanks for your positive comments (i.e., intuitive and effective method, massive experiments, and insights for the research community). We are sorry for the confusion and unclear expression in the previous version. We have addressed each of the comments and suggestions. Please refer to our responses below for details.

W1: The reproducibility of experiments for baseline models like GPT-4 and Claude is unclear. The code and data are temporarily unavailable, hindering follow-up and comparison by other researchers.

Thanks for your comments. We will make the code and data publicly available.

W2: Typos: There are minor typos, such as "multi-agennt" instead of "multi-agent" on page 2, line 63, and some missing "%" signs in the resolved ratio on page 7.

Thank you for carefully spotting the typo. All these typos will be corrected in the revision.

Q1: In Section 2, I'm a bit confused about the calculation of coverage ratio. How is $[s_i, e_i]$ defined for pure 'deleting' and 'adding' operations? If many code adjustments occur in a single file, does it ensure that each adjustment contributes independently to the coverage? I mean, if the model generated code is different from the reference code, the position behind is not aligned.)

We are sorry for the unclear presentation. ${[s_i, e_i]}$ indicates the range of lines in the file that have been modified, and the line numbers within this range are based on the file before the modifications are made. Therefore, ${[s_i, e_i]}$ should represent the whole file (i.e., ${[1, LastLineNumber]}$ ) for pure 'deleting' while it should be none (i.e., $\emptyset$ ) for pure 'adding'. For cases where a single file has many modifications, the line numbers are aligned based on Git, which uses the Myers diff algorithm [1*] as default. The empirical study [2*] demonstrates that the Myers diff algorithm can ensure that each adjustment contributes independently to the coverage in most cases.

Q2: In Section 4, can you kindly provide more key implementing details, especially for the model comparison experiment in Table 2? Some helpful information may be the core instruction/input for each model, the number of interaction rounds, and whether code interpreter is allowed. Those are all related to the performance of the model.

Thanks for your suggestions. We will add the specific prompt content and make the implementation clearer in the revision (part of it is shown below due to the context limit).

Specifically in Table 2, the prompts for directly using the LLMs are sourced from SWE-bench[3*] while other prompts for MAGIS in different settings are designed by us. For each GitHub issue, our method interacts with the issue only once, which means the framework only generates one final result for an issue. Moreover, the code interpreter is not allowed to be used in our method.

# Prompt <em>P</em> (Line 5 in Algorithm 2)
system_prompt = ("You are a software development manager."
                "Your responsibility is to provide clear guidance and instructions to a developer regarding modifications or improvements needed in a specific code file. "
                "This guidance should be based on the details provided in the issue description and the existing content of the code file.")
user_prompt = ("Review the issue description and the content of the code file, then provide specific instructions for the developer on the actions they need to take to address the issue with these files.\n"
            f"# Issue Description:\n{issue_description}\n# Code File:\n{file_content}\n"
            "Respond concisely and clearly, focusing on key actions to resolve the issue. Limit your answer to no more than 100 tokens.")

Q3: Do you submit (or plan to submit) the results to SWE-Bench official leaderboard? How does this method differ from other agent-based methods?

Thanks for your comments. We plan to submit the results after the anonymous review period ends. The main difference between our method from others is that MAGIS does not use external tools such as web browsers [4*]. Our paper focuses on how to unlock the potential of LLMs for GitHub issue resolution and we conduct many empirical studies on analyzing the limitations of LLMs.

References
[1*] Myers, Eugene W. An O (ND) difference algorithm and its variations. Algorithmica 1.1 (1986).
[2*] Nugroho, Yusuf Sulistyo, et al. How different are different diff algorithms in Git?. EMSE 25 (2020).
[3*] Yang, John, et al. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint (2024).
[4*] Wang, Xingyao, et al. OpenDevin: An Open Platform for AI Software Developers as Generalist Agents. arXiv preprint (2024).

2024-08-12

Thank you for your thorough responses. I appreciate the valuable insights and the relatively effective method the paper offers. However, I still have some concerns regarding what and how easily future researchers can build upon this work. Additionally, the varying levels of access to external tools (e.g., code interpreters, web resources) make it hard for fair comparisons or ablations that would help highlight the unique advantages of the proposed agent-based method. Further exploration of these issues may enhance the potential impact of this article. Given those considerations, I will maintain my original score.

2024-08-12

Dear Reviewer,

Thank you for your follow-up feedback and for recognizing the value of our work. We appreciate your support and the "weak accept" recommendation.

We would like to clarify that our proposed method does not rely on any external tools (such as code interpreters or web browsers) and the external tools are not used in the evaluation in our paper. To address your concerns about the ease of building upon our work, we will release all relevant code, and detailed implementation instructions after the paper is accepted. This will ensure that future researchers can easily reproduce and extend our empirical findings, fostering further developments in this area.

Once again, thank you for your constructive feedback.

Best regards,

The authors.

审稿意见

评分: 7置信度: 42024-07-13

This paper proposes MAGIS, a multi-agent coding framework for solving patch generation tasks. The roles consist of Manager, Repository Custodian, Developer and QA engineer, with the task of planning, file location, file editing, and review, respectively. Experiments on the SWE-bench benchmark show that performance is improved over baselines.

优点

The proposed multiagent framework is simple and effective
The proposed model is effective with and without using the hints provided in SWE-bench
Ablations demonstrate the effectiveness of the QA engineer role

缺点

My concerns are along the evaluation and comparisons:

The main experiments are limited, as only GPT-4 is measured
Baselines such as SWE-agent are missing

问题

SWE-bench Lite results (appendix D) should be in the body of the paper
Is ablation of the kickoff meeting possible? This is an interesting mechanism but it's hard to understand whether it is effective or not
Can you explain/show more about the prompts and exemplars (if exemplars are used) for each role?

局限性

yes

作者回复

2024-08-07

Thank you very much for taking the valuable time to review our manuscript and thanks for your positive comments (i.e., the simplicity and effectiveness of our method, and the effectiveness in various settings). We are sorry for the confusion and unclear expression in the previous version. We have addressed each of the comments and suggestions. Please refer to our responses below for details.

W1: The main experiments are limited, as only GPT-4 is measured

Thanks for your comments. To further validate the effectiveness of our method, we conducted more experiments on other base LLMs. The new experiments use DeepSeek (DeepSeek-V2-0628) [1*] and Llama 3.1 (405B) [2*] in addition to GPT-4 as the base model. The corresponding results are shown below,

Base LLM	Directly Use	MAGIS	MAGIS (w/o hints, w/o QA)
DeepSeek	0.33%	12.67%	11.00%
Llama 3.1	1.33%	16.67%	11.00%

Please note that all prompts are identical to those we experimented with on GPT-4. In the "Directly Use" setting, the prompts are sourced from the SWE-bench, while prompts for other settings are designed by us. The above table shows that our method has achieved a 38-fold performance improvement (to DeepSeek) and a 12-fold improvement (to Llama 3.1) compared to directly using these base LLMs. The improvement in effectiveness validates that our method is general and can still unlock the potential of other LLMs in solving GitHub issues.

W2: Baselines such as SWE-agent are missing

Thanks for your comments. The comparison with SWE-agent [3*] is discussed in Appendix D as SWE-agent is a contemporaneous work (the paper is publicly available at arXiv in May 2024). As shown in Table 4, our method achieved the resolved ratio of 25.33% on SWE-bench Lite, which is higher than the 18.00% reported by SWE-agent.

Q1: SWE-bench Lite results (appendix D) should be in the body of the paper

Thanks for your advice. We will move the Lite results to the body in the revision.

Q2: Is ablation of the kickoff meeting possible? This is an interesting mechanism but it's hard to understand whether it is effective or not

Thanks for your comments. The kickoff meeting serves as a transitional link. The minutes of this meeting are converted into a code representing the order of work (sequential or parallel) for each Developer agent moving forward. Therefore, the ablation of the kickoff meeting is not possible in our method. To make the mechanism clearer, one detailed example is provided in Figure 7 (Appendix B) on Page 17 and Figure 14 (Appendix H) on Page 23.

Q3: Can you explain/show more about the prompts and exemplars (if exemplars are used) for each role?

Thanks for your advice. We will add the specific prompt content in the revision (one of the prompt templates for the Manager agent as shown below due to the context limit).

# Prompt <em>P</em> (Line 5 in Algorithm 2)
system_prompt = ("You are a software development manager."
                "Your responsibility is to provide clear guidance and instructions to a developer regarding modifications or improvements needed in a specific code file. "
                "This guidance should be based on the details provided in the issue description and the existing content of the code file.")
user_prompt = ("Review the issue description and the content of the code file, then provide specific instructions for the developer on the actions they need to take to address the issue with these files.\n"
              f"# Issue Description:\n{issue_description}\n# Code File:\n{file_content}\n"
               "Respond concisely and clearly, focusing on key actions to resolve the issue. Limit your answer to no more than 100 tokens.")

References
[1*] DeepSeek-AI. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv preprint (2024).
[2*] Dubey, Abhimanyu, et al. The Llama 3 Herd of Models. arXiv preprint (2024).
[3*] Yang, John, et al. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint (2024).

评论- Response to authors

2024-08-13

Thank you to the authors for the detailed response. The response addressed my concerns.

审稿意见

评分: 6置信度: 42024-07-13

This paper introduces MAGIS, a novel Large Language Model (LLM)-based multi-agent framework designed to address the challenge of resolving GitHub issues in software development. The authors conduct an empirical study to identify key factors affecting LLMs' performance in this task, including file and line localization accuracy and code change complexity. Based on these insights, they propose a collaborative framework consisting of four specialized agents: Manager, Repository Custodian, Developer, and Quality Assurance Engineer. These agents work together through planning and coding phases to generate appropriate code changes. The framework is evaluated on the SWE-bench dataset, demonstrating significant improvements over existing LLMs in resolving GitHub issues. Specifically, MAGIS achieves a resolved ratio of 13.94%, which is an eight-fold increase compared to the direct application of GPT-4. The paper also provides detailed analyses of the framework's components and discusses its effectiveness in different scenarios.

优点

Originality:

The paper presents a in-depth empirical analysis of LLMs' performance in resolving GitHub issues, providing unique insights into the challenges of applying AI to complex software engineering tasks.
The study leverage software engineering metrics, such as the line locating coverage ratio, to quantify LLM performance in code change generation.

Quality:

The empirical study demonstrates rigorous methodology, examining multiple factors affecting LLM performance including file localization, line localization, and code change complexity.
The analysis employs statistical techniques to establish correlations between various complexity indices and issue resolution success, adding depth and reliability to the findings.

Clarity:

The paper clearly articulates the gap between LLMs' performance on function-level tasks versus repository-level tasks, providing a strong motivation for the study.
The authors present their empirical findings with well-designed visualizations and tables, making complex data easily interpretable.

Significance:

The empirical analysis provides crucial insights that bridge theoretical understanding of LLMs with practical software engineering challenges.
The findings directly inform the design of more effective AI-assisted software development tools, as demonstrated by the MAGIS framework.
The study lays a foundation for future research in applying AI to software engineering tasks, potentially influencing the direction of both AI and software engineering fields.

缺点

Limited dataset diversity: The study relies solely on the SWE-bench dataset, which, while comprehensive, is limited to 12 Python repositories. This may not fully represent the diversity of real-world software projects.
Computational resources and efficiency: The paper doesn't discuss the computational requirements or efficiency of MAGIS compared to simpler approaches.
Potential overfitting to SWE-bench: The framework might be tailored too specifically to perform well on SWE-bench, potentially limiting its generalizability.
Lack of user study: The paper doesn't include feedback from actual software developers on the usefulness and quality of MAGIS's solutions.
Limited discussion on prompt engineering: While the paper mentions using prompts, it doesn't delve into the specifics of prompt design or its impact on performance.

问题

Your study focuses on the SWE-bench dataset, which is limited to 12 Python repositories. Have you considered how MAGIS might perform on a more diverse set of programming languages and project types? What steps could be taken to validate the framework's generalizability beyond Python projects?
The paper doesn't discuss the computational requirements of MAGIS. Can you provide information on the runtime and resource requirements of MAGIS compared to simpler approaches? How does this impact its practical applicability in real-world software development environments?
How have you ensured that MAGIS is not overfitting to the specific characteristics of the SWE-bench dataset? Have you tested the framework on any GitHub issues outside of this dataset to validate its real-world applicability?
While you describe the roles of different agents, there's limited analysis of how their interactions contribute to overall performance. Have you conducted any ablation studies on different interaction patterns between agents? This could provide insights into which aspects of the multi-agent approach are most crucial for success.
While you mention using prompts, there's limited discussion on prompt design. Could you provide more details on your prompt engineering strategies and how they impact the framework's performance?

局限性

The authors acknowledge in Appendix K that the SWE-bench dataset, while representative, may not fully reflect the diversity of all code repositories, particularly in specialized fields or different programming paradigms. However, SWE-bench is known as one of the most challenging benchmarks, this might be ok.
In Appendix K, the authors mention the potential impact of prompt design on LLM performance and the difficulty in eliminating prompt bias completely.
The authors note in Appendix K that applying their findings to other code repositories may require further validation due to the limited sample scope.

作者回复

2024-08-07

Thanks a lot for reviewing our manuscript and thanks for your positive comments (i.e., the in-depth empirical analysis, rigorous methodology, and the well-designed visualizations). We are sorry for the confusion and unclear expression in the previous version. We have addressed each of the comments and suggestions below.

Q1: Your study focuses on the SWE-bench dataset, which is limited to 12 Python repositories. Have you considered how MAGIS might perform on a more diverse set of programming languages and project types? What steps could be taken to validate the framework's generalizability beyond Python projects?

Thanks for your comments. We acknowledge that the SWE-bench has limited diversity, which we discuss in Appendix K. While we recognize the need to evaluate MAGIS on a broader range of programming languages (PLs) and project types, we found that there are currently no available datasets in other programming languages. To evaluate our framework's generalizability beyond Python projects, we plan to construct a dataset with various programming languages and projects, then validate MAGIS by collecting pull requests, issues, and test cases from popular repositories, setting up the testing environments, and executing MAGIS to assess issue resolution. This process aims to demonstrate MAGIS's effectiveness across diverse programming languages and project types.

Q2: The paper doesn't discuss the computational requirements of MAGIS. Can you provide information on the runtime and resource requirements of MAGIS compared to simpler approaches? How does this impact its practical applicability in real-world software development environments?

Thanks for your comments. We will include the computational requirements of MAGIS in the revision. Specifically, our framework resolves each issue in approximately 3 minutes, with an average processing time of under 5 minutes per instance (as noted in Appendix E). Our method utilizes GPT-4, so any machine capable of accessing the OpenAI API can run the experiments. While LLM-based multi-agent systems [1-3*], including ours, require more computational resources and time compared to simpler approaches, this investment is justified. Our method significantly increases the issue resolution rate from 1.74% to 13.94% (as shown in Table 2), allowing us to tackle problems that were previously unresolvable. In summary, this paper emphasizes how to leverage LLMs effectively to enhance issue resolution and improve overall performance.

Q3: How have you ensured that MAGIS is not overfitting to the specific characteristics of the SWE-bench dataset? Have you tested the framework on any GitHub issues outside of this dataset to validate its real-world applicability?

Thanks for your comments. First, our method operates as an LLM-based multi-agent system, which means it does not require training on a specific dataset. This characteristic helps prevent overfitting to the SWE-bench. While we have not yet validated our method on other GitHub issues, we acknowledge the importance of this step and plan to address it in future work. We understand that broader testing will enhance the generalizability of our results. Finally, the SWE-bench dataset comprises various repositories and real-world GitHub issues, making it a reasonable basis for evaluating our method.

Q4: While you describe the roles of different agents, there's limited analysis of how their interactions contribute to overall performance. Have you conducted any ablation studies on different interaction patterns between agents? This could provide insights into which aspects of the multi-agent approach are most crucial for success.

Thanks for your comments. We recognize the importance of analyzing the contributions of different agents in our system. In Section 4.2, we conducted an ablation study for the QA agent, and the corresponding results are shown in Table 2. While we have not yet performed ablation studies specifically on different interaction patterns between agents, we analyze the contributions of each phase—planning and coding—in Sections 4.3 and 4.4. These sections provide insights into how these phases impact overall performance. We appreciate your feedback and will explore different interaction patterns in future work.

Q5: While you mention using prompts, there's limited discussion on prompt design. Could you provide more details on your prompt engineering strategies and how they impact the framework's performance?

We appreciate your interest in our prompt design. In the revision, we will include a detailed discussion of our prompt engineering strategies and provide specific examples of the prompts used. Our prompt engineering strategies adhere to established best practices [4*]. Specifically, we construct our prompt templates in markdown format, ensuring that each prompt covers all relevant information required for the communication.

References
[1*] Wang, Xingyao, et al. OpenDevin: An Open Platform for AI Software Developers as Generalist Agents. arXiv preprint (2024).
[2*] Yang, John, et al. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint (2024).
[3*] Zhang, Yuntong, et al. Autocoderover: Autonomous program improvement. arXiv preprint (2024).
[4*] Jessica Shieh. Best practices for prompt engineering with openai api. OpenAI, https://help.openai.com/en/articles/6654000 (2023).

作者回复

2024-08-07

Dear reviewers,

Thank you very much for your valuable time in providing these constructive comments and suggestions. We have addressed each of the comments and suggestions by adding more experiments or explanations:

Experiments with Different LLMs (15zT, 1wgB): We added experiments with MAGIS using other LLMs, such as DeepSeek [1*] and Llama 3.1 [2*], in addition to GPT-4 as the base model. The results validate that our method is general and can still unlock the potential of other LLMs in solving GitHub issues.
Specific Prompt Content (15zT, L511, 1wgB): We added the specific prompt content to make the implementation clearer.
Generalizability (gcvU): We added steps to validate the framework's generalizability beyond Python projects in the limitation section in Appendix K.
Computing Resources (gcvU, 1wgB): We added more discussions about the computing resources, clarifying the necessity and worthiness of the additional cost.
Evaluation and Ablation Analysis (gcvU): We included more explanation about the evaluation and the ablation analysis.
Typo Corrections (L511): We corrected all identified typos and checked the paper.
Coverage Ratio Calculation (L511): We added more explanation about the calculation of the coverage ratio and cited the relevant diff algorithm [3-4*] for clarity.
Comparison with Contemporaneous Works (L511, 1wgB): We added the discussion about the difference between our method and other contemporaneous works [5-9*] in Appendix D and E.

We hope these updates address the reviewers' concerns. We remain open to further discussion and revisions.

References
[1*] DeepSeek-AI. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv preprint (2024).
[2*] Dubey, Abhimanyu, et al. The Llama 3 Herd of Models. arXiv preprint (2024).
[3*] Myers, Eugene W. An O (ND) difference algorithm and its variations. Algorithmica 1.1 (1986).
[4*] Nugroho, Yusuf Sulistyo, et al. How different are different diff algorithms in Git?. EMSE 25 (2020).
[5*] Wang, Xingyao, et al. OpenDevin: An Open Platform for AI Software Developers as Generalist Agents. arXiv preprint (2024).
[6*] Yang, John, et al. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint (2024).
[7*] Zhang, Yuntong, et al. Autocoderover: Autonomous program improvement. arXiv preprint (2024).
[8*] Chen, Dong, et al. CodeR: Issue Resolving with Multi-Agent and Task Graphs. arXiv preprint (2024).
[9*] Ma, Yingwei, et al. How to Understand Whole Software Repository?. arXiv preprint (2024).

评论- Request for Feedback Before Discussion Deadline

2024-08-12

Dear Reviewers and Area Chair,

As the discussion deadline approaches, we kindly request all reviewers to review our responses to your comments and share any remaining concerns or questions. We highly value your feedback and want to ensure that all issues are fully addressed before the final decision.

We would also like to specifically thank reviewer 1wgB for the thorough discussion and valuable feedback provided so far. We have carefully responded to all of the concerns raised during the review and discussion phases by adding additional experiments and explanations. We hope that these clarifications might lead to a reconsideration of the current score.

We sincerely appreciate your valuable time and effort in reviewing our submission.

Best regards,

The authors

最终决定Accept (poster)

2024-09-25

This work proposes a multi-agent approach (Sec. 3) to producing pull requests that solve issues in the SWE-Bench benchmark, inspired by an empirical analysis of common failure modes (Sec. 2). The primary novelty lies in the alignment of these agents with roles that may be found in a SE environment, which is implemented via a series of prompts. As most reviewers noted, it is important that these prompts are made publicly available to facilitate future work, which the authors have promised to do in the rebuttal. Reviewers generally appreciated the methodology and the positive results, leading to the decision to accept the work.

One of the main topics of discussion among the reviewers was the use of "hints" in the results. While these are not necessarily off-limits, their use by default when other work often does not use them can paint an incorrect picture of the results. The authors are strongly encouraged to report results without hints in the main tables, and move the results with hints to a different table. This should not subtract from the value of the work given the positive results reported with hints in Tab. 2.