RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving
A novel approach; Very practical
摘要
评审与讨论
This paper proposes a novel agent, RepoMaster, that is used to solve code tasks like the ones found in MLE-Bench. RepoMaster is able to search through many GitHub repositories to use published code in its solutions. It also incorporates clever strategies to help the LLM navigate code repositories, thereby improving its performance and saving tokens. Additionally, the paper proposes a new benchmark, GitTaskBench, comprised of tasks including super resolution, style transfer, source separation and more.
优缺点分析
Strengths:
I think that the premise of the paper, i.e., how can we leverage published code to solve general machine learning tasks, is interesting. I also think that efficient agents, like RepoMaster, are of interest to the community. The components of RepoMaster, Hybrid Hierarchical Analysis, Code Exploration, and Information Selection, make a lot of sense, and they seem to work in practice, both by improving performance and saving tokens. The paper is clear and well-structured.
Weaknesses:
- How was the subset use for MLE-R chosen? Why was a subset taken and not the entire MLE-Bench?
- As for other baselines, why wasn't AIDE [1] used? It holds the current state of the art on MLE-Bench.
- I would have liked to see a more comprehensive description of GitTaskBench in the paper. Looking through the Appendix/repo, I found the tasks to be interesting, and I think a comprehensive description in the paper is warranted. I'm interested in the tasks in GitTaskBench, and would like to know more. For example, in some of the tasks, it seems like there could be several correct answers (image coloring, style transfer). How are correct solutions determined?
Less significant, but still noteworthy:
In general I like the style of the figures, but I think they're too information-dense. They would be easier to understand if they were simpler in my opinion.
The descriptions sometimes lack detail. In Table 3's description, it should specify that the ablation study was done on RepoMaster + 4o on GitTaskBench.
I think that seeing RepoMaster results on SWE-Bench would be very interesting. Many of the repos that make up SWE-Bench contain many files and lines of code, and it seems like an agent that uses Hybrid Hierarchical Analysis + Code Exploration could achieve high performance.
[1] Jiang, Zhengyao, et al. "Aide: Ai-driven exploration in the space of code." arXiv preprint arXiv:2502.13138 (2025).
问题
- How was the subset use for MLE-R chosen? Why was a subset taken and not the entire MLE-Bench?
- As for other baselines, why wasn't AIDE [1] used? It holds the current state of the art on MLE-Bench.
- I'm interested in the tasks in GitTaskBench, and would like to know more. For example, in some of the tasks, it seems like there could be several correct answers (image coloring, style transfer). How are correct solutions determined?
- In terms of GitTaskBench -- how would the results change if the underlying models were changed to o4-mini or Gemini 2.5 Pro? These are competitive cost-wise with the models used in the paper, and would probably yield scores of 70-75% on GitTaskBench, given that the current state of the art gets nearly 63%. In this case, GitTaskBench would be very close to being solved, already on release. How would you address this? I really like the premise of GitTaskBench, but practically speaking, it seems close to being solved.
[1] Jiang, Zhengyao, et al. "Aide: Ai-driven exploration in the space of code." arXiv preprint arXiv:2502.13138 (2025).
局限性
Though limitations weren't explicitly discussed, I do not think they're of particular relevance here.
最终评判理由
Initially, I felt like there may have been some missing baselines, and that the benchmark was perhaps too easy. After the authors' detailed response, in which they ran more SOTA language models, I'm more convinced that the benchmark is challenging and progress on it would be useful for real world coding. Though revisions cannot be posted, if the rebuttal is included in the final paper, and the minor weaknesses are fixed, I would find this to be a well rounded paper. I therefore raise my score to 5.
格式问题
NO
Thank you sincerely for your supportive and insightful comments. We hope the responses below fully address all of your questions and clarify any uncertainties.
To W1 & Q1:
Thank you for raising this important question. Our rationale and method for MLE-R subset selection are as follows:
1. Computational Considerations
Through preliminary testing, we found that each task requires 10+ hours (including repo search, structural analysis, codegen, and multiple iterative debugging sessions). Full evaluation across 3 different LLMs and 3 frameworks across all 72 MLE-Bench tasks would require ~6,480 hours and substantial API costs. Thus, we selected the smaller MLE-Bench-lite (22 tasks), also officially provided by OpenAI, as our test set.
2. Adopting Standardized Subset
MLE-Bench-lite is a representative, domain-diverse subset, ensuring standardized and comparable evaluation in the ML engineering area.
During use, 2 tasks (detecting-insults-in-social-commentary, the-icml-2013-whale-challenge-right-whale-redux) were excluded due to Kaggle data access issues. To maintain evaluation completeness, we replaced them with tasks (chaii-hindi-and-tamil-question-answering, tgs-salt-identification-challenge), which are comparable in domain coverage and difficulty, from MLE-Bench to maintain 22 total.
3. Representativeness Guarantee
Despite being a subset, our MLE-R still covers diverse ML domains (CV, NLP, time series, etc.), comprehensively evaluating agent repository utilization capabilities. The specific competition IDs have been listed in Appendix Tables 6-10.
We believe this choice balances feasibility and coverage while remaining comparable to prior work. In future work, we are willing to extend to more tasks for evaluation.
To W2 & Q2:
Thank you for the question regarding AIDE. First, note that the task requirements in MLE-R differ from MLE-Bench-Lite (reuse the repo, NOT generate all the code from scratch). While AIDE excels on the original MLE-Bench, its exclusion from our baselines was deliberate:
1. Fundamental Task Scope Differences AIDE is tailored for scratch code generation for Kaggle competitions with highly customized workflows to meet the specific requirements of Kaggle tasks. In contrast, RepoMaster focuses on reusing open-source repositories to solve real-world tasks E2E, covering search, understanding, code editing/generation, execution, and debugging in a constrained context.
2. Architectural Limitations
AIDE lacks the essential functions for solving tasks E2E for complex code projects, including:
- a) Structural analysis of existing codebases
- b) Cross-file dependency tracking and navigation
- c) Selective code modification and generation rather than pure code generation
3. Baseline Selection Rationale
SWE-Agent, OpenHands, and our RepoMaster are all general-purpose code agent frameworks, enabling fair comparison.
Notably, OpenHands and SWE-Agent consistently lead the SWE-bench leaderboard on Verified, making them our "must-align" strong baselines to ensure fair and representative positioning of RepoMaster.
4. Supplementary Experimental Results
Finally, to address your concerns about comparability, we conducted additional experiments with AIDE using DeepSeek V3 on the same 22 tasks on MLE-bench (instead of MLE-R, since AIDE is not designed for such tasks as noted above):
- Valid Submissions: RepoLearner 86.36% vs AIDE 63.64%
- Gold Medals: RepoLearner 13.64% vs AIDE 4.55%
Our advantage is obvious, and we will add these explanations to the paper.
Experimental comparison results:
| Model | Make Submission | Valid Sub | Median | Bronze | Silver | Gold | Total |
|---|---|---|---|---|---|---|---|
| RepoMaster | 95.45% | 86.36% | 36.36% | 4.55% | 4.55% | 13.64% | 22.73% |
| AIDE | 63.64% | 63.64% | 18.18% | 0% | 0% | 4.55% | 4.55% |
To W3 & Q3:
Hi, thank you for your interest in GitTaskBench. But we'd like to kindly remind you that our GitTaskBench is already open source and has been included in the submitted paper as reference [26]. Please see reference [26]: GitTaskBench: Anonymous GitHub repository. All tasks and code are available in the anonymous GitHub link.
Additionally, we are happy that you want to learn more. We can provide you with our evaluation methods for execution completion (EC) (finish) and task pass (TP) (correct) for these tasks:
EC measures the proportion of tasks where the agent successfully executes the target repo and generates valid, non-empty outputs in the required format (e.g., .jpg or .png for image processing tasks).
TP directly quantifies the output quality. It is determined by formulating test functions and defining task-specific success criteria using established metrics, drawing on standards recognized within the domain developer community. TP requires the agent's outputs to satisfy predefined quality standards, such as functional correctness, result completeness, or achieving specific task objectives.
Task-specific success criteria examples:
- Image coloring: CIEDE2000 colour difference ≤ 2.0 and NIQE ≤ 7.0 on the output image
- Video Style Transfer: average SSIM between input and output frames is ≥ 0.7, FID ≤ 400.
- Speech enhancement: PESQ ≥ 2.0 and a SNR ≥ 15 dB
Tasks not meeting these thresholds are marked as failures.
Due to space limits and our focus on framework design, we sincerely apologize for not adding a "comprehensive description" in this version. However, we promise that in the revised version of this paper, we will add this content to the appendix.
Thank you again for your interest!
To Else in Weakness:
About the figs
Thank you for your suggestion and appreciation. Due to space limits, we combined multiple figures and compressed their layout. We fully agree that expanding the figure area would help. If accepted, we will present clearer, more spacious figures.
About the table caption
Thank you for noting this. Currently, these experimental details are only explained in Section 4.4. We agree that Table 3 should specify that the ablation was on RepoMaster + 4o with GitTaskBench. We will clarify this in the next version.
About the results on SWE-Bench
Thank you for your suggestion and interest in our work. Your perspective on testing RepoMaster on SWE-Bench is very insightful. We agree that SWE-Bench's large, complex codebases are ideal for evaluating our Hybrid Hierarchical Analysis + Code Exploration method.
We have added SWE-Bench validation to our near-term roadmap and will submit our results upon completion. Ongoing work also includes training RepoMaster with reinforcement learning to further enhance its performance on complex repositories. Experiments are underway, and we look forward to reporting more progress soon.
To Q4:
Thank you for thoughtfully considering GitTaskBench's scalability. The Claude series is widely recognized as a leading closed-source LLM family for code. Following your suggestion, we ran additional experiments with Gemini-2.5-pro (stable).
Interestingly, on the more comprehensive and practical GitTaskBench benchmark, Gemini-2.5-pro achieved lower results than Claude 3.5 in our experiments. This suggests that Claude 3.5may already be approaching the current performance upper bound on GitTaskBench.
| Model | Execution Completion Rate (%) | Task Pass Rate (%) |
|---|---|---|
| RepoMaster + Gemini-2.5-Pro | 72.22% | 55.56% |
| RepoMaster + Claude 3.5 | 75.92% | 62.96% |
We have further evaluated the potential impact of newer models:
1. Long-tail Distribution Characteristics of Task Difficulty
GitTaskBench reflects real-world task difficulty distribution: from simple PDF parsing to complex VideoPose3D pose estimation, with exponential growth in difficulty. Even if the overall completion rate reaches 70%, the remaining 30% tasks still require deep code understanding, complex dependency management, and end-to-end task resolution capabilities. This long-tail distribution ensures the benchmark's continued relevance, where 70% is merely a passing grade.
2. Diminishing Returns of Performance Improvements
Our experimental analysis reveals a key insight. When we upgraded the model from GPT-4o to Claude 3.5:
- Task pass rate improved from 40.74% to 62.96%
However, in-depth analysis of this 22% improvement revealed:
- >50% of the improvement came from increased success rates in environment configuration and dependency installation
- <20% of the improvement came from enhanced core codebase exploration and task execution capabilities
This indicates that models still have enormous room for improvement in handling autonomous exploration and end-to-end execution of complex codebases.
3. Critical Role of Algorithm Design
Using the same Claude 3.5 model, different frameworks showed dramatic performance differences:
| Framework | Task Pass Rate (%) | Token Consumption |
|---|---|---|
| RepoMaster | 62.96% | 154k tokens |
| OpenHands | 24.07% | 3094k tokens |
Significant token efficiency difference and performance gap demonstrate that even as underlying model capabilities improve, Agent algorithm framework design and efficiency constraints remain crucial.
Referencing SWE-Bench: Gemini 2.5 Pro achieved 63.8%, Claude4 approaches 80%, indicating pure code repair tasks are being rapidly solved.
Instead, GitTaskBench, by requiring agents to handle complete codebase understanding, dependency management, error diagnosis, and end-to-end task resolution workflow, provides more comprehensive evaluation dimensions. This represents the core optimization direction for code agents in the next phase, and we will regularly introduce new tasks to maintain the challenge level.
Thanks for your response, it has addressed basically all of my concerns. I just have one question left: in the colorization task, where the input is a black and white image, is the output well defined? If someone has a blue shirt on, in black and white the shirt could be red/green/blue/etc. Could there not be several valid output images for a given input? And not just for colorization, but for stylizing images, and other tasks as well.
Hi, thank you very much for your positive feedback—we are glad that our previous response has "addressed basically all of your concerns."
We do understand your detailed question on the output definition. In fact, our evaluation has tried our best to strike a balance between assessing output quality for practical task requirements and maintaining a robust, automated evaluation that can accommodate the diverse outputs produced by various agents. To achieve this, we either design more flexible test scripts that can recognize and accept a wider range of valid outputs or define clear and specific output requirements.
While our benchmark is task-oriented but represents a complex, real-world end-to-end workflow, the execution process naturally includes a range of exploration strategies. For example, agents may select different parameters for built-in repo models or, in rare cases, perform independent code generation (“fail to follow instruction”) instead of fully utilizing the repository. As such, our evaluation focuses on the final output, intentionally allowing for diversity in results. As you pointed out, if the input shirt is black, we permit outputs in any color (red, green, blue, etc.), as long as the result is vivid and meets all practical requirements (like "CIEDE2000 colour difference ≤ 2.0 and NIQE ≤ 7.0 on the output image") specified in our test scripts. "There could be several valid output images for a given input."
Therefore, in defining the output, we focus on the output path, file naming, and ensuring the format matches the task requirements (e.g., jpg or png for image processing tasks), without imposing strict constraints on the specific content details.
Additionally, our prompts explicitly require/encourage agents to use the repository to complete the task. In principle, if the agent correctly calls the repo, the output should reproduce the repository’s intended results or those of its included models.
Thank you again for your interest and engagement! We sincerely hope our answers have further strengthened your confidence and satisfaction with our work, and we TRULY appreciate your continued feedback.
Thank you for addressing my concerns. I have raised my score accordingly.
We sincerely appreciate the time and effort you devoted to reviewing and discussing our paper. We are truly grateful for your increased confidence in our work and for your positive engagement and interaction during the rebuttal period.
Thank you again for your support and thoughtful feedback.
This paper introduces RepoMaster, a novel automated framework designed to effectively leverage code repositories for solving complex real-world tasks end-to-end. RepoMaster aims to comprehend code in a goal-oriented, human-like manner by integrating multiple advanced techniques. Specifically, it combines hybrid structural hierarchy modeling, core component identification, and context-aware code exploration with efficient information selection strategies to navigate and analyze large codebases autonomously.
RepoMaster operates in three main stages: (1) Repository Search to identify relevant repositories based on user intent, (2) Hierarchical Repository Analysis to construct a detailed structural map of the code, and (3) Autonomous Exploration & Execution to dynamically interact with the repository, perform task-specific actions, and refine its understanding through iterative exploration.
The paper demonstrates the effectiveness and efficiency of RepoMaster by conducting experiments on diverse, complex tasks drawn from the MLE-R and GitTaskBench datasets. In comparison with existing frameworks such as OpenHands and SWE-Agent, RepoMaster achieves superior performance in automating complex repository exploration and execution, validating its potential to address real-world software development challenges.
优缺点分析
The framework is inspired from human programmer behaviour to explore the unfamiliar codebases. The integration of context-aware code exploration tools with interactive execution and log filtering is a practical step for working within LLM context constraints while mimicking real developer behaviour.
The use of ASTs and graphs (HCT, FCG, MDG) provides a principled structure-aware foundation to the framework.
However their are some weaknesses:
(1) No mention of how the model adapts or generalizes to diverse languages or non-Python repositories. (2) The importance scoring seems entirely heuristic-based—lacking learning-based validation or ablation (3) How does RepoMaster deal with stale or broken dependencies found during exploration? (4) What are the assumptions on the quality of README or internal documentation? What happens when the README is missing or incorrect?
问题
Q1. Why restrict analysis to .py files only? Many practical projects involve configurations (.yaml, .json), scripts (.sh), or compiled extensions (.cpp). Does this limit applicability?
Q2. Were the weights in the importance scoring scheme manually set or learned? How are the weights (and the corresponding contribution) determined for each feature?
Q3. Did the authors consider using GNNs or unsupervised learning for structural importance prediction? Are there any negative insights or learnings from such experiments?
Q4. What is the fallback strategy when errors persist (e.g., missing dependencies, ambiguous import paths)?
局限性
I do not find an explicit discussion on the limitations of the framework. Highly encourage the authors to present a discussion on the limitations as it would help the future works to understand the challenges effectively. Answering some of the questions I have asked earlier could give you a discussion on the limitations of the framework.
最终评判理由
The paper is technically solid. The authors have shown detailed evaluation of the proposed method. However, generalization of the framework beyond non-python repositories remains to be explored.
格式问题
Haven't noticed any formatting issues
Hi, thank you sincerely for your valuable and detailed feedback. We have provided comprehensive responses to all your comments, hoping this further increases your confidence in our work and overall satisfaction.
To W1 & Q1:
We apologize for any confusion caused by our wording. In fact, our implementation is not limited to Python files—all types of files (including .yaml, .json, .sh, .cpp) are parsed for tree-based structure construction. Our information selection strategies also apply across diverse formats.
We only emphasized “Python files” SIMPLY because our selected repositories are all Python-based, and the tasks themselves are in the machine learning domain (most mainstream repositories we found in the deep learning area are Python-based, especially for ML tasks).
Specifically, our framework employs two complementary exploration strategies:
- Graph-based Exploration: For languages with rich structural information (e.g., Python, Java, C++)
- Tree-based Hierarchical Exploration: As the primary exploration method for scripting languages (.sh) or structurally simple project files (.yaml, .json)
While we chose Python for evaluation—given its dominance in deep learning/machine learning and alignment with our benchmark tasks—our core approach is designed fundamentally language-agnostic with broad applicability.
The construction of the Hierarchical Component Tree (HCT), Function Call Graph (FCG), and Module Dependency Graph (MDG) relies on Abstract Syntax Tree (AST) parsing, which applies to most mainstream languages, including C++, Java, JavaScript, etc.
In principle, our methods can be directly adapted to diverse languages and non‑Python repositories.
We acknowledge this multi-language extensibility as an important direction for future work and plan to extend the evaluation scope to other-language repositories in the next version. RepoMaster's modular design makes this extension straightforward, and we anticipate similar performance improvements across different programming languages.
To W2 & Q2 & Q3:
The importance scoring scheme was developed through extensive empirical trials and observations. We apologize for not providing a systematic, detailed validation or ablation study in the main paper—primarily because
- (1) our experiments found this aspect was not critical to end-to-end performance, and
- (2) space constraints. We will include the experimental design and additional results in the appendix. The discussion will be added to the Limitations.
The weights in our importance scoring scheme were manually set, guided by domain expertise and numerous small-scale experiments, rather than by large-scale optimization or systematic ablation. We initially explored various weight combinations and adjusted them to maximize the recall of important files during identification. However, comprehensive experimental validation showed that, while weights may affect the initial recall rate of important files, this has minimal impact on the overall end-to-end task success rate. As a result, we did not prioritize fine-grained weight sensitivity in our main study, and ultimately adopted equal weighting, which achieved satisfactory recall performance while keeping the scheme simple and interpretable.
Our experimental process included:
1. Test Set Construction and Weight Optimization
We manually constructed a test set of core file modules from several repositories, and empirically optimized the weights to improve the recall of important files within RepoMaster.
2. Feature Screening and Ablation Experiments
We systematically removed redundant or overlapping feature dimensions and conducted ablation experiments on each single feature. This led to retaining the six features described in the paper, all assigned equal weights for both simplicity and effectiveness.
GNNs or unsupervised learning for structural importance prediction
It is a direction with great complementary potential, and we also look forward to exploring the possibilities of future technical integration. However, our current considerations are mainly from the following dimensions:
a. Interpretability: Our feature-based approach clearly explains why certain modules are considered important, which is crucial for debugging.
b. Computational Efficiency: Our method runs with small time overhead, completing the entire preprocessing on average within 4.8 seconds for typical repositories (1-3k files)
c. Generalization Ability: Rule-based features have better generalization across different programming languages, including C++, Java, JavaScript, etc.
To W3 & W4:
Thank you for raising these important practical concerns. Indeed, handling stale dependencies and unreliable documentation are critical challenges for real-world repository utilization.
Handling Stale/Broken Dependencies: RepoMaster incorporates robust error recovery mechanisms specifically designed for such scenarios. As demonstrated in our GitTaskBench experiments (Section E.2), when encountering missing dependencies (e.g., "ModuleNotFoundError"), RepoMaster:
- Automatically identifies and installs missing packages through iterative error analysis;
- Adapts execution strategies when dependencies fail (e.g., switching from GPU to CPU versions);
- Leverages our hierarchical repository analysis to find alternative implementations when primary paths fail.
README Quality Assumptions: While READMEs provide valuable initial guidance, RepoMaster is explicitly designed NOT to rely solely on documentation. As highlighted in the Introduction (Lines 39–43), encountering missing or unreliable documentation is a common challenge. Our hierarchical repository analysis (Section 3.2) builds three complementary structural representations (HCT, FCG, MDG) that provide ground-truth understanding and prepare for exploratory tools, all independently of documentation quality. Ultimately, it is the combination of task understanding and effective, efficient repository exploration that matters most. This allows RepoMaster to significantly outperform baseline methods that depend primarily on README instructions or attempt to process the entire codebase, which quickly exhausts context windows.
Missing/Incorrect README: When documentation is absent or misleading, RepoMaster's exploration tools (Section 3.3.1) enable autonomous discovery through:
- Structural analysis to identify entry points and core components;
- Dependency tracing to understand component relationships;
- Iterative execution with feedback-based learning. The case study in Fig. 3 and Fig. 2 illustrates this capability: despite incomplete documentation, RepoMaster successfully completed the task through structural understanding and adaptive exploration, while baselines failed.
To Q4:
Thank you for raising this important question about error handling strategies. Our RepoMaster framework implements a systematic multi-layer fallback mechanism that can effectively handle persistent errors:
1. Robust Failure Recovery through Comprehensive Multi-Agent Architecture
To handle cases of missing dependencies, our designed hierarchical multi-agent system was equipped with specialized recovery agents:
Schedule Agent (Orchestrator)
├── DeepSearch Agent: Repository discovery
├── Code Agent: Task exploration and execution with repository
├── Issue Fix Agent: Analyzes GitHub issues for known problems/solutions
├── Dependency Agent: Handles environment setup and dependency conflicts
When the Code Agent encounters execution failures—such as dependency conflicts, ambiguous import paths, or missing functionality—the Schedule Agent coordinates the following recovery workflow:
- (i) Analyzes failure patterns to generate structured feedback
- (ii) Uses exploratory tools to inspect related files/code, extract structural information, and iteratively attempt recovery strategies
- (iii) Adjusts criteria based on the specific failure patterns (e.g., if a computer vision task fails due to a lack of GPU support, it prioritizes CPU-compatible alternatives)
- (iv) If initial attempts fail, we implement reflection-based recovery strategies, see step 2 below.
- (v) Graceful Termination: If errors persist after up to 5 attempts (configurable), the system safely ends the current task, logs the unresolved issue, and continues to the next, ensuring pipeline robustness.
All recovery steps are documented for future improvement. This adaptive multi-agent fallback both maximizes error recovery and enhances overall reliability and practical applicability.
2. Reflection Trajectory Optimization
2.1 Execution Trajectory Analysis: The system analyzes past execution trajectories to identify recurring failure patterns
2.2 Optimal Path Extraction: We extract exploration history, retaining only the most informative execution trajectories while pruning redundant or failed attempts
2.3 Context Optimization: This ensures the LLM's limited context window focuses on high-value information in the next attempt
3. Empirical Effectiveness
As demonstrated in our case study (Fig. 3), this strategy is crucial for complex tasks. For example, in the 3D pose estimation task:
- RepoMaster: Successfully recovered from multiple dependency errors through intelligent backtracking
- Baseline methods: Either failed completely or exhausted resources in futile attempts (OpenHands, ~140 iterations)
This multi-layer approach ensures RepoMaster maintains both robustness and efficiency: Achieving our reported 62.96% task success rate while using 95% fewer tokens than baselines.
To Limitation
We fully agree that an explicit discussion of the framework’s limitations would be helpful. Thanks for your valuable suggestion. We will include a detailed discussion addressing all your earlier questions in the appendix of the next version.
Thanks for the clarification in the rebuttal. I have gone through the discussion presented by the authors. The authors have not provided any experimental evidence for W1 and Q2. I would highly appreciate if some discussion can be presented in the later version of the draft. The extension beyond python might be trivial for the authors (W1), however, would add a lot of value to the framework. For Q2, I would request the authors to present the evidence from the empirical trials and observation for completeness.
Dear Reviewer 77DM,
We hope this message finds you well.
First and foremost, thank you sincerely for your valuable time and constructive feedback on our work. Your insights have been instrumental in refining our research.
We have carefully (and promptly) addressed your previous concerns in our latest responses, with additional experimental results and empirical evidence. We hope these clarifications sufficiently resolve the issues you raised. If there is anything further you would like us to clarify or expand upon, please let us know—we are more than happy to provide additional information.
Repomaster is a practical agent framework designed for solving complex, real-world tasks, with a core focus on leveraging repositories through enhanced autonomous understanding and efficient exploration. We believe this work has significant potential to benefit the community, and we are deeply committed to addressing any remaining concerns and ensuring its quality.
We greatly appreciate your expertise and feedback, and we hope our efforts and responses help to increase your confidence in our work. Please feel free to share any further thoughts or comments at your convenience.
Thank you again for your time and consideration.
Best regards,
The Authors of Paper 8772
Thanks for the detailed comments. I will maintain the positive rating.
Hi, thank you very much for your reply—we are glad to hear from you again. We assure you that we will "include a dedicated discussion on both multi-language extension and the experimental process for determining weights in the importance scoring scheme in the next version of the draft".
As you noted, while "the multi-language extension beyond Python (W1) might be trivial for this paper" (considering that the current selected repositories are mainly Python-based), we fully agree with you that "it would add a lot of value to our framework". We have observed an important trend in the community to extend original benchmarks to multiple languages, such as Multi-language SWE-Bench.
In fact, we have planned a follow-up project to develop a multi-language evaluation set for GitTaskBench, which will allow a more comprehensive assessment of our RepoMaster’s performance. However, as this is a substantial undertaking, we really regret that it can only remain an ongoing focus for our future work/paper at this stage.
For Q2, as mentioned in our previous responses, our weights were initially set to 1 for simplicity and interpretability, based on earlier empirical trials and observations following the process we described. At that time, our main goal was to observe recall and overall task phenomena and to draw related conclusions, so we did not include these experimental results IMMEDIATELY in order to save space (we apologize for this omission). For evidence, we have, in fact, conducted similar hyperparameter selection experiments using the same process, as referenced in our experimental results of the response “To Q3:” for Reviewer aqw3.
In response to your comments, we have, in fact, conducted additional tests (though, as previously noted, the analysis might still not be fully systematic yet, given the combinatorial possibilities with six weights). We apologize for not providing this "not fully systematic " data sooner, but we hope that sharing these results NOW will help clarify our approach to empirical trials and observations.
Specifically, we selected 5 repositories and 30 core files, tuned the weights to optimize important file recall, and reran 10 corresponding end-to-end tasks using DeepSeek V3.
| Dependency | Complexity | Usage | Semantic | Doc | Git | Recall Rate (%) | Execution Completion Rate (%) | Task Pass Rate (%) |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 1 | 70.00 | 60 | 30 |
| 5 | 1 | 1 | 1 | 1 | 1 | 76.67 | 60 | 30 |
| 1 | 5 | 1 | 1 | 1 | 1 | 66.67 | 50 | 30 |
| 1 | 1 | 5 | 1 | 1 | 1 | 73.33 | 70 | 30 |
| 1 | 1 | 1 | 5 | 1 | 1 | 63.33 | 50 | 30 |
| 1 | 1 | 1 | 1 | 5 | 1 | 60.00 | 60 | 30 |
| 1 | 1 | 1 | 1 | 1 | 5 | 76.67 | 60 | 30 |
| 6 | 2 | 4 | 1 | 1 | 6 | 76.67 | 70 | 40 |
| 6 | 2 | 4 | 1 | 1 | 8 | 80.00 | 70 | 30 |
| 6 | 2 | 4 | 1 | 1 | 10 | 76.67 | 60 | 30 |
- We observed that Dependency, Usage, and Dependency features did have a greater impact on the recall rate for important files compared to the other features, while overall end-to-end task completion quality was less sensitive to these weights.
- End-to-end execution of each task typically takes ~20 minutes, compared to ~4.8 seconds to retrieve important files for repositories (1–3k files).
Thus,
- "While weights may affect the initial recall rate of important files, this has minimal impact on the overall end-to-end task success rate," as described in the previous response.
- While entry-point important files provide a good starting point, the effectiveness of exploration and localization strategies in the process is even more crucial.
Thank you again for your constructive suggestions and engagement. We sincerely hope these additional explanations have further strengthened your satisfaction with our work. Please feel free to reach out if you have any further questions or need additional clarification.
Your further feedback is TRULY appreciated.
This paper proposes RepoMaster, an automated agent framework that solves complex real-world tasks by reusing GitHub repositories. It combines structural hierarchy modeling, core component identification, and context-aware code exploration to help LLMs operate efficiently under context limits. This papers also introduce a new benchmark, GitTaskBench, and show that RepoMaster outperforms strong baselines like OpenHands and SWE-Agent on both GitTaskBench and MLE-R.
优缺点分析
Strengths:
- This paper presents a comprehensive static analysis pipeline that models repositories via hierarchical code trees, function-call graphs, and module dependency graphs. It further ranks modules and classes using multiple features (e.g., centrality, complexity, usage, doc quality) to identify core components, allowing the agent to focus its limited LLM context on the most relevant parts of the codebase.
- RepoMaster achieves much higher success rates than OpenHands and SWE-Agent on multiple tasks and LLMs.
- The authors introduce GitTaskBench, a suite of diverse tasks each tied to a concrete GitHub repo, and evaluate on multiple LLMs and agents.
Weaknesses:
- Many of RepoMaster’s ideas e.g.repository graphs and context selection overlap with recent work. For example, the concurrent RepoGraph [1] system also builds a repository-level graph structure to guide LLM code agents. The CGM [2] approach similarly integrates a code graph into an LLM’s attention. RepoMaster’s novelty seems to lie mostly in the engineering of combining known techniques (AST traversal, call/dependency graphs, heuristic scoring) rather than introducing fundamentally new algorithms. As a result, its contribution feels somewhat incremental relative to these related methods.
- The experiments only compare with OpenHands and SWE-Agent, but do not include newer repository-level methods like RepoGraph [1] or other Agents works, such as Agentless [3]. Adding such baselines would better position RepoMaster within the current literature.
- The proposed GitTaskBench contains only 18 repositories and 54 tasks (line 230-231). While it covers diverse domains, the relatively small size may limit the generality of the conclusions. A larger benchmark would better demonstrate the robustness of the method.
[1] Ouyang, S., Yu, W., Ma, K., Xiao, Z., Zhang, Z., Jia, M., ... & Yu, D. (2024). RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph. arXiv preprint arXiv:2410.14684.
[2] Tao, Hongyuan, et al. "Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks." arXiv preprint arXiv:2505.16901 (2025).
[3] Xia, Chunqiu Steven, et al. "Agentless: Demystifying llm-based software engineering agents." arXiv preprint arXiv:2407.01489 (2024).
问题
- The implementation filters to Python files. Would the approach extend easily to, e.g., multi-language projects (C++, Java) on GitHub?
- Will the authors release the GitTaskBench tasks and code for RepoMaster?
- How were the thresholds (e.g. top-20 modules, top-10 classes, context window sizes) chosen? Is performance sensitive to these?
局限性
The proposed GitTaskBench contains only 18 repositories and 54 tasks (line 230-231). While it covers diverse domains, the relatively small size may limit the generality of the conclusions. A larger benchmark would better demonstrate the robustness of the method.
最终评判理由
Some of my concerns have been addressed, and I'll raise my score accordingly.
格式问题
The paper is generally well-organized and clearly formatted. Figures and tables are labeled and referenced properly.
Response to W1:
Hi, thank you for noting the connections with RepoGraph[1] and CGM[2]. We appreciate these parallel works and look forward to potential future integration.
However, RepoMaster differs fundamentally from RepoGraph/CGM in task problem definition, method design, and technical contributions:
- RepoMaster targets real, end-to-end task completion via open-source project reuse, evaluated based on MLE-R/GitTaskBench with multi-domain, verifiable tasks. NOT just single-repo code repair (SWE-bench) as in RepoGraph/CGM. Task complexity drives distinct technical approaches.
- Ours is application-driven: all methods serve practical use. RepoMaster's novelty lies NOT in code graph construction, but in using structured understanding as a tool for context-aware exploration. Core contributions are task-driven static-dynamic collaboration, autonomous exploration, information selection, enabling a closed loop in real-world E2E tasks.
Retrieve suitable repo → Understand its functionality → Configure environment → Code generation → Execute debugging → Convergence → Generate verifiable output
Key Technical Contributions
RepoMaster is designed for reusing repositories to solve real E2E tasks—requiring both repo understanding and full task execution under context constraints:
(1) Hybrid Structural Repository Mapping
AST traversal extracts modules/classes/functions, constructs three complementary structures (HCT/FCG/MDG), and identifies critical modules as task entry points under context limits.
(2) Autonomous Exploration-Execution Closed Loop
Rather than statically embedding graphs into attention, agents dynamically switch between:
- Code viewing ↔ Dependency tracking
- Code generation/modification ↔ Execution debugging
under a limited context.
Fig. 2 shows how this iterative process gradually locates and resolves practical issues like missing models and dependency errors.
(3) Context-aware Information Selection
To address the key challenge of limited context in agent applications, we propose programmer-inspired, multi-level information selection with reduction strategies for code, docs, and feedback logs.
3. Significant Performance Advantages
- Execution completion: 75.92% (vs OpenHands 48.15%)
- Task pass: 62.96% (vs 24.07%)
- Token use reduction: 95% (154k vs 3094k)
Ablation experiments (Table 3) prove all components' contribution, validating the effectiveness of our algorithmic framework.
Besides, reference CGM[2] is a concurrent work that was made public on arXiv AFTER our submission (22 May 2025).
Response to W2:
We fully agree that more recent, powerful comparisons could further clarify RepoMaster's positioning. However, for real-world E2E tasks in our benchmarks—which require comprehensive agent capabilities—many frameworks can not handle such tasks well.
RepoGraph[1] is a technique that could be integrated into general agent frameworks as a module, but lacks true end-to-end task-solving capability. Agentless[3] is an earlier "procedural framework", NOT "agent framework" (per [1]), LESS powerful/general than our baselines: OpenHands and SWE-Agent (identified as agent frameworks in [1]).
Baseline selection rationale:
1. SOTA Agent Frameworks
OpenHands and SWE-Agent consistently lead the SWE-bench Verified leaderboard (70.4%, 66.4%), while Agentless variants score substantially lower (only 50.8% task resolution rate), and RepoGraph is not even on the leaderboard (lacks full pipeline support). In [1], Agentless + RepoGraph was ONLY 29.67% on SWE-Bench-Lite, far below our baselines. Thus, we use OpenHands and SWE-Agent as "must-align" strong baselines for fair, representative positioning of RepoMaster.
2. End-to-End Task Capability
Our task setting requires full E2E execution on complete repositories.
- RepoGraph: Primarily focuses on code repair, unable to handle complete E2E scenarios
- OpenHands & SWE-agent: Currently the most-used, daily-updated, mature general Agent systems in the open-source community, long dominating the SWE-bench Verified, with proven E2E capability.
Response to W3 & Limitations:
First, we'd like to kindly remind you that our RepoMaster evaluation covers NOT only GitTaskBench but also MLE-R—a revision of MLE-Bench-Lite—comprising 22 ML Kaggle tasks and 22×3 repositories.
In total, we evaluate 76 comprehensive, real-world tasks across 120 repositories.
Second, beyond task domain diversity, the properties of our selected full-stack repositories in GitTaskBench are also highly varied and general:
- Repo Files: 7–1157 (Avg. 204)
- Intra-repo Dependencies: 33–6979 (Avg. 1242.7)
- Intra-repo Calls: 180–40,552 (Avg. 8,651)
- Functions: 25–4,915 (Avg. 1,274.8)
- Lines of Code: 0.58k–351.42k (Avg. 52.63k)
These statistics demonstrate that GitTaskBench is diverse not only in task domains, but also in the repositories themselves—a critical aspect for evaluating repository-centric tasks.
Third, we acknowledge your concern about the task/scale. However, existing comparable benchmarks at this complexity and comprehensiveness level are similarly scoped:
- Full MLE-Bench[1]: 72 ML tasks
- ML-Bench-A[2]: 55 (for script) + 13 (for code) repo-level ML-related tasks
- MLAgentBench[3]: 13 ML experiments
- SWE-Bench[4] (smaller-granularity tasks, repair program): 12 Python repos
- M3ToolEval[5] (multi-tool, multi-turn calling): 82 tasks, 5 domains While more tasks can further enhance generality, our method’s capabilities are rigorously validated on two strong, diverse benchmarks, supporting robust and generalizable conclusions.
- PaperBench[6]: 20 tasks/repos/papers.
[1] MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. 2024
[2] ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code. 2023
[3] MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. 2023
[4] SWE-bench: Can Language Models Resolve Real-World GitHub Issues? 2023
[5] Executable Code Actions Elicit Better LLM Agents. 2024
[6] PaperBench: Evaluating AI's Ability to Replicate AI Research. 2025
To Q1:
Thanks for raising this insightful question on multi-language extensibility.
First, we apologize for any confusion. Our approach parses all file types for tree-based structure construction; we only emphasized “Python files” simply because our benchmark repositories and tasks (ML/DL) are Python-based.
(Most mainstream repositories we found in the DL area are Python-based, especially for ML tasks.)
Second, our approach is designed to be language-agnostic and readily extensible. While we evaluated on Python due to its ML dominance, Abstract Syntax Tree (AST) parsing-based HCT/FCG/MDG construction also applies to other mainstream languages (C++, Java, JavaScript, etc).
Third, we use two complementary exploration strategies:
- Graph-based Exploration: For languages with rich structural information (e.g., Python, Java, C++)
- Tree-based Hierarchical Exploration: As the primary exploration method for scripting languages (.sh) or structurally simple project files (.yaml, .json)
This dual approach ensures robust performance across different language paradigms. Based on our empirical observations, RepoMaster adaptively selects the most appropriate strategy, uses:
- tree-based search for simpler repositories
- graph-based analysis for complex projects with intricate dependencies
We acknowledge this is an important direction and plan to extend the evaluation to multi-language repos in future work. RepoMaster's modular design allows straightforward extension, and we anticipate similar gains across languages.
To Q2:
Hi, thank you for your interest. Note that GitTaskBench is already open source; its specific anonymous link has been placed in the current paper as reference [26]. However, it may have been missed as it wasn’t in a footnote. Please refer to [26]! All tasks and code are available there!
RepoMaster code has also been uploaded to GitHub, and will be made public in future paper versions.
To Q3:
Thanks for your suggestion. Due to space limits, parameter selection and sensitivity analysis details were omitted from the main text. We agree they are important and will provide full experimental details in the appendix of future versions.
Our experiments show that while hyperparameters (e.g., top-k modules/classes) may affect initial file recall, their impact on E2E task success is minimal, so we did not focus on fine-grained sensitivity.
Top-k Module Selection:
- Manually built core file modules of several repositories as our test set, and tuned k for the recall of important file sets.
- gradually increased k values; top-20 recall reached 70%; larger k offered limited gains.
| Top-k | Recall Rate (%) | Relative Improvement |
|---|---|---|
| Top-5 | 46.6 | - |
| Top-10 | 60.0 | +13.4% |
| Top-20 | 70.0 | +10.0% |
| Top-30 | 73.3 | +3.30% |
| Top-20 balances recall and efficiency. |
Top-k Class Selection: k=10 set based on the avg. length of each class and the total token budget allocated for classes in the initial context.
Context Window: Empirically set as "max LLM context length÷ avg execution rounds (~8k tokens)". When context per interaction exceeds this, we perform focused information refinement. When history exceeds 2/3 max context, we extract optimal exploratory paths and retain only the most effective execution trajectories, enabling the LLM to think of better solutions for new rounds. This design reflects our empirical observation: LLM reasoning degrades once total context exceeds a threshold. Context management and trajectory optimization help maintain reasoning quality in long, complex tasks.
As discussed earlier, many existing frameworks cannot effectively handle the real-world, end-to-end tasks in our benchmark, which require leveraging large-scale repositories and comprehensive agent capabilities.
RepoGraph [1] is a technique module that must be integrated into a general agent framework, but it lacks true end-to-end task-solving ability. Agentless [3], in contrast, is an earlier procedural framework rather than a full agent framework (as categorized in [1]), and is less powerful and less general than our baselines OpenHands and SWE-Agent (both identified as agent frameworks in [1]). Furthermore, OpenHands and SWE-Agent consistently lead the SWE-Bench Verified leaderboard (70.4% and 66.6% task resolution rates, respectively). In comparison, Agentless variants perform substantially worse (50.8%), and RepoGraph does not appear on the leaderboard at all due to lacking full pipeline support. In [1], the combined Agentless + RepoGraph setup achieved only 29.67% on SWE-Bench-Lite, far below both OpenHands and SWE-Agent.
We also emphasize that Agentless is not a “newer repository-level method”. Its earliest publication date (Agentless: Demystifying LLM-based Software Engineering Agents, arXiv:2407.01489) is July 1, 2024 (latest version: October 29, 2024), which is earlier than OpenHands (arXiv:2407.16741; earliest July 23, 2024, latest April 18, 2025) and contemporaneous with SWE-Agent (arXiv:2405.15793; earliest May 6, 2024, latest November 11, 2024). On GitHub, Agentless’s latest update was 8 months ago (Dec 23, 2024), whereas both OpenHands and SWE-Agent remain under active, near-daily development. For our experiments, we used the April 2025 releases of these baselines (OpenHands: 0.33.0, SWE-Agent:v1.0.1-61-gaa4e8ea1).
While Agentless is overall less capable, with simpler and older built-in tools than our chosen baselines, we have nevertheless included it—in response to Reviewer aqw3’s suggestion—as an additional baseline, and evaluated it on our MLE-R benchmark.
Additional experiment (DeepSeek V3 model):
| Framework | Make Submission | Valid Sub | Above Median | Bronze | Silver | Gold | Total |
|---|---|---|---|---|---|---|---|
| Agentless | 40.91% | 27.27% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
| SWE-Agent | 54.55% | 36.36% | 4.55% | 0.00% | 0.00% | 4.55% | 4.55% |
| OpenHands | 63.64% | 36.36% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
| RepoMaster | 95.45% | 86.36% | 36.36% | 4.55% | 4.55% | 13.64% | 22.73% |
These results further confirm that, while Agentless may be more suited to low-cost, simpler procedural tasks, it performs worse than both SWE-Agent and OpenHands on the complex, real-world tasks in our benchmark, and falls far short of our RepoMaster framework.
Hi, thank you very much for your time and effort in reviewing our paper. We have provided detailed responses to all of your questions, especially regarding our distinctive methodological contributions (see "Response to W1"), baseline selection considerations (see "Response to W2"), and evaluation test set choices (see "Response to W3 & Limitations"). We also highlighted that GitTaskBench had been fully open-sourced when submitting this paper as noted in "To Q2".
Given the limited time remaining, we sincerely and humbly ask that you kindly review our responses at your earliest convenience. We sincerely hope our replies have addressed your concerns and helped clarify any misunderstandings. If there are any remaining questions or further points you’d like to discuss, please let us know—we would be very grateful for the opportunity to engage further and ensure all your concerns are resolved.
We appreciate your TIMELY feedback. Thank you again for your constructive input and support.
Thanks for your detailed response. Some of my concerns have been addressed, and I'll raise my score accordingly.
We are truly delighted and grateful to receive your message, and we sincerely appreciate your recognition of our work. Thank you very much for your time and consideration.
Hi, we are delighted to hear from you. We eagerly anticipate your participation in the rebuttal discussion and welcome your feedback.
We would be very grateful for the opportunity to engage further and ensure all your concerns are resolved.
Thank you very much for your time and support.
This paper proposes RepoMaster, an autonomous agentic framework designed to solve complex coding tasks by exploring and reusing existing GitHub repositories. Instead of generating repository-level code from scratch, the framework focuses on a more practical "reuse and adapt" approach. It addresses this via three-stages: (a) Repository Search: Identifies relevant GitHub repositories based on the user's task description (intent). (b) Hierarchical Repository Analysis: Before interacting with the LLM, the framework performs a static analysis of the chosen repository and constructs a Hierarchical Code Tree (HCT), a Module Dependency Graph (MDG), and a Function Call Graph (FCG) to map the codebase's structure. It then uses a scoring system to identify the most critical "core components" (files and classes). (c) Autonomous Exploration & Execution: The agent uses the pre-analyzed structural information to guide its exploration. It iteratively interacts with the codebase using specialized tools for viewing code, analyzing dependencies, and searching. To manage the LLM's context, it employs an information selection strategy that prunes code, documents, and logs to retain only the most essential information.
The authors evaluate RepoMaster on two benchmarks: an adjusted version of MLE-Bench called MLE-R and a newly introduced benchmark named GitTaskBench. The results show that RepoMaster significantly outperforms strong baselines like OpenHands and SWE-Agent.
优缺点分析
Strengths:
a) The concept of performing a static, structural analysis before dynamic exploration is innovative. The creation of hybrid structural maps (HCT, MDG, FCG) provides the agent with a foundational understanding.
b) The introduction of GitTaskBench is a good contribution to the research community.
c) The authors conduct a robust evaluation using multiple state-of-the-art LLMs (GPT-4o, Claude 3.5, DeepSeek V3) and two strong baselines. The comprehensive ablation study effectively isolates the contribution of each of RepoMaster's core components, confirming that the performance gains are directly attributable to the proposed mechanisms. The case study is a plus.
Weaknesses:
a) The success of the entire workflow is highly dependent on the initial "Repository Search" stage. This stage relies on analyzing user intent, README files, and star counts. The paper could benefit from a deeper discussion of the limitations of this search phase and how the agent might recover if an unsuitable repository is initially selected.
b) The initial "Hierarchical Repository Analysis" involves parsing all source files to build ASTs, dependency graphs, and call graphs. While effective, the paper does not discuss the computational cost or time required for this preprocessing step.
c) The module-level importance scoring aggregates six features using equal weights. This seems somewhat arbitrary. The paper would be strengthened by a sensitivity analysis on these weights or a more principled method for determining them.
问题
See the weaknesses
局限性
NA
最终评判理由
I have read the response and it partly addresses my questions. I'd like to keep my original score.
格式问题
No
Hi, thank you very much for your thoughtful and encouraging review. We truly appreciate your recognition of our key contributions—
including our innovative combination of static structural analysis and dynamic exploration, the creation of hybrid structural mappings (HCT, MDG, FCG), the introduction of GitTaskBench, and our robust evaluation methodology with comprehensive ablations and case studies.
We're especially grateful for your careful attention to detail and interest in exploring potential extensions of our work.
In this rebuttal, we've provided detailed clarifications and comprehensive supplemental information to fully address your questions. We sincerely hope these further strengthen your confidence in the rigor, contributions, and practical value of our paper, and increase your overall satisfaction.
To Weaknesses a):
Thank you for your insightful observation. We acknowledge that the repository search stage is crucial to the success of our framework. To address this concern, we have implemented several mechanisms:
1. Advanced Search Strategy Based on "Search and Think" Paradigm
We have developed a sophisticated deep search mechanism that employs an iterative "search and think" approach, rather than relying on a single search attempt. In each search iteration, the agent:
- Dynamically formulates and refines search queries based on search and analysis results
- Performs real-time relevance assessment while browsing repository content
- Maintains a ranked list of top-k candidate repositories with confidence scores
This cognitive search process mimics how human developers explore GitHub, gradually deepening understanding through active exploration rather than passive filtering.
2. Robust Failure Recovery through Comprehensive Multi-Agent Architecture
To handle cases of selecting unsuitable repositories, we designed a hierarchical multi-agent system with specialized recovery agents:
Schedule Agent (Orchestrator)
├── DeepSearch Agent: Repository discovery and ranking
├── Code Agent: Task execution with repository
├── Issue Fix Agent: Analyzes GitHub issues for known problems/solutions
├── Dependency Agent: Handles environment setup and dependency conflicts
When execution failures occur, our Schedule Agent coordinates these specialized agents.
3. Automatic Failure Recovery Mechanisms
When the selected repository is unsuitable, the Code Agent may encounter a greater range of issues, some of which may not be fully resolvable. To address this, our system employs an automatic Failure recovery mechanism: when the Code Agent faces execution failures (e.g., dependency conflicts, API mismatches, missing functionality), the Schedule Agent automatically:
- (i) Analyzes failure patterns to generate structured feedback
- (ii) Uses exploratory tools to inspect related files/code, extract structural information, and iteratively attempt recovery strategies
- (iii) Adjusts selection criteria based on failure patterns (e.g., if a computer vision task fails due to a lack of GPU support, it prioritizes CPU-compatible alternatives)
- (iv) If initial attempts fail, we implement reflection-based recovery strategies, see 4. below.
- (v) Seamlessly switches to the next candidate repository
4. Reflection Trajectory Optimization
4.1 Execution Trajectory Analysis: The system analyzes past execution trajectories to identify recurring failure patterns
4.2 Optimal Path Extraction: We extract exploration history, retaining only the most informative execution trajectories while pruning redundant or failed attempts
4.3 Context Optimization: This ensures the LLM's limited context window focuses on high-value information in the next attempt
This adaptive multi-agent approach ensures the system can recover from various failure modes, from simple dependency issues to fundamental repository mismatches, significantly enhancing the framework's practical applicability.
To Weaknesses b):
Thank you for your valuable feedback. We acknowledge that the computational cost of the preprocessing step indeed requires more thorough elaboration. We will add comprehensive experimental data on this aspect in the revision and appendix.
1. Hierarchical Repository Analysis
Our static analysis preprocessing pipeline utilizes highly optimized libraries (e.g., tree-sitter for AST parsing) and supports multi-threaded execution.
For typical repositories (1-3k files): The entire preprocessing completes on average within 4.8 seconds, including:
- AST construction: ~2.1 seconds (parallelized across CPU cores)
- Dependency/call graph generation: ~1.9 seconds
- Core component identification: ~0.8 seconds
2. Additional Processing for Complex Repositories
For repositories where core components exceed the context threshold (8k tokens), we employ an LLM-based summarization approach:
- Using Deepseek V3, this takes approximately 10 seconds
- Even for complex repositories, overall preprocessing time remains within 15 seconds
To Weaknesses c):
Thank you for the detailed suggestion.
First, we want to clarify that identifying the six features themselves—based on experimental observation and domain expertise—is fundamentally more important than their precise relative weighting.
Our empirical experiments demonstrated that although weights may influence the recall rate of initially important files, this effect does not significantly impact the final end-to-end task success rate. For this reason, we did not emphasize fine-grained weight sensitivity in our main study.
We acknowledge that, due to page constraints, we were unable to present the detailed experimental process behind our module-level importance scoring in the current manuscript. We will supplement these details, including additional experimental analysis, in the appendix of future versions.
Below, we outline our experimental design and feature selection process:
Experimental Design and Feature Selection
1. Test Set Construction & Weight Optimization
- First, we manually constructed the core file modules from several repositories as our test set
- Then we performed empirical optimization through the performance of different weight combination strategies in RepoMaster to improve the recall rate of important file sets
2. Feature Screening & Ablation
- Through experimental comparison, we removed scores from several evaluation dimensions with high overlap
- We conducted simple ablation experiments on each single dimension strategy
- Finally, we retained the six evaluation dimensions presented in the paper and gave them equal weights for both simplicity and interpretability.
In fact, this experimental process also applies to the selection and tuning of other hyperparameters used in constructing the initial context.
Regarding Weight Sensitivity Analysis
While we recognize the value of a comprehensive weight sensitivity analysis, systematically quantitatively evaluating all weight combinations would require substantial experimental cost, which we considered disproportionate to the incremental benefit for our main research focus. Nevertheless, we will include further analysis in the appendix of our revised submission.
Additionally, we want to detail that in our overall algorithm framework design, the role of the evaluated core modules is to help the agent have a comprehensive understanding of the entire repository under limited context windows. This also serves as the RepoMaster's autonomous exploration entry point for the entire repository, autonomously deciding whether to expand the search and read adjacent node files from the current file. Task-relevant code snippets or file information are incorporated into the agent’s context only after they are identified, supporting autonomous end-to-end exploration and execution.
Search → Understand → Code Generation/Editing → Execute → Debug → Convergence → Generate Verifiable Output
Thank you again for the insightful feedback. We believe these clarifications and our planned appendix additions will further strengthen the paper.
If you have any further questions or need additional clarification, we're more than happy to discuss them. We hope our responses further strengthen your confidence in the quality and contribution of our work! Thanks again.
Thanks for your detailed response. They helped address most of my concerns.
Hi, we are grateful to hear from you again, and we are especially glad to hear that "our detailed answers helped address most of your concerns." Thank you very much for your engagement with our paper and your interest in exploring potential extensions of our work.
Repomaster presents an interesting and effective code agent framework designed for application in real-world task solving by leveraging the repositories, an aspect rarely emphasized in the current research community.
We sincerely hope our previous responses have further increased your confidence in our work. Please feel free to reach out if you have any further questions or need additional clarification.
Your further feedback is TRULY appreciated. Thanks again.
The paper presents RepoMaster, an approach to identifying and reusing existing code repositories when solving complex coding tasks. It includes three stages: (1) searching for relevant GitHub repositories, (2) multiple forms of static analysis on the selected repository, and (3) an LLM agent to autonomously explore the repository using tools. The paper also introduces a new benchmark GitTaskBench. Experiments show that RepoMaster outperforms SOTA baselines (OpenHands, SWE-Agent) on GitTaskBench and (existing) MLE-R benchmarks.
Reviewers agreed on the paper's strengths:
- Reviewers appreciated the use of static analysis prior to more expensive LLM-based exploration.
- GitTaskBench is a nice artifact.
- The experimental results are quite strong.
Reviewers mentioned some weaknesses/questions, most of which appear resolved after rebuttals and discussions, but perhaps the most important remaining weakness is that the evaluated benchmark tasks are in the machine learning domain, and thus the selected repositories are all Python-based. It is plausible that the impressive results would extend beyond this domain and language, but no concrete evidence is provided. Nevertheless, the presented experimental results are sufficiently strong on their own.
All reviewers are unanimous in recommending to accept this paper.
Furthermore, I am recommending spotlight primarily due to the very impressive results, outperforming OpenHands and SWE-Agent which lead the SWE-bench leaderboard, with multiple LLMs, on two benchmarks (GitTaskBench and MLE-R).