REPOFILTER: Adaptive Retrieval Context Trimming for Repository-Level Code Completion
摘要
评审与讨论
This paper proposes a three-step framework: retrieval, filtering, and generation, to enhance the accuracy of repository-level code completion. The authors introduce two special tokens, <MC> and <EC>, during the generation process to determine whether cross-file context retrieval is necessary.
Additionally, a likelihood-based metric is employed to assess whether the retrieved context is positive, negative, or neutral for the current completion task. Harmful cross-file context is filtered out before the model proceeds with code completion, ensuring sufficient relevant information.
The effectiveness of the proposed framework is validated through extensive experiments on the RepoEval and CrossCodeLongEval benchmarks using language models like StarCoderBase, CodeLlama.
优点
-
Cross-file Code Completion Framework: The authors propose a repository-level code completion framework based on a retrieval-filtering-generation process, which improves completion accuracy by filtering out harmful cross-file contexts retrieved during the process.
-
Effective Methods: Experimental results show that this framework not only improves code completion accuracy but also reduces input length compared to other baseline methods.
-
Holistic Experimental Results: The authors conducted comprehensive experiments, including ablation studies, on multiple benchmarks to validate the effectiveness of their approach.
缺点
-
Lack of Baselines: The authors only compare their approach with RepoFormer, without considering the No-Retrieve and Full-Retrieve settings. I suggest adding more baselines based on RAG methods, such as RepoCoder, RepoHyper, RepoFuse, DraCo, CoCoGen, and HCP, for a more comprehensive comparison.
-
Time Overhead: While most repository-level code completion works do not consider or report time overhead, many real-world scenarios require low latency for code completion results. I noticed that REPOFILTER evaluates the polarity of each retrieved chunk and, when the polarity is <pos>, it further checks whether the next token is <MC> or <EC>. This might introduce additional time overhead. I recommend that the authors provide data on the total time cost for a single completion task.
-
New Code LLMs: Although the results on StarCoder and CodeLlama are promising, these models have been available for some time. Newer code models, such as StarCoder2, CodeGemma, DeepSeek-Coder-V2, and Qwen2.5-Coder, now natively support repository-level code completion. I recommend leveraging these newer models and conducting related experiments.
-
Writing: I suggest that the authors summarize their contributions in bullet points at the end of the Introduction. This will provide readers with a clearer and more direct understanding of the key contributions of the work.
References:
[1] RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation
[2] RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion
[3] REPOFUSE: Repository-Level Code Completion with Fused Dual Context
[4] DraCo: Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion
[5] CoCoGen: Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback
[6] HCP: Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs
[7] StarCoder2: StarCoder 2 and The Stack v2: The Next Generation
[8] CodeGemma: Open Code Models Based on Gemma
[9] DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
[10] Qwen2.5-Coder Technical Report
问题
- Regarding Time Overhead: See Weaknesses 2.
Q1: Lack of Baselines
A1: Our work primarily focuses on the pre-processing (adaptive retrieval) and post-processing (context filtering) of retrieved contexts, rather than the retrieval process itself. To the best of our knowledge, the only related work in this area is RepoFormer. As highlighted by the papers you mentioned, most existing research focuses on developing more accurate methods for retrieving cross-file contexts. These studies focus on different, independent modules of the framework. Notably, our method can be integrated with retriever-based approaches to jointly improve completion performance. For example, in our revised manuscript, we demonstrate how our approach generalizes to other retrievers, such as RepoCoder, to provide additional improvements.
Q2: New Code LLMs
A2:
We have added StarCoder2-7B, DeepSeekv2, and Qwen-Coder 7B to Section 6.2. Our experiments now cover a variety of model families, including StarCoder2, StarCoder, CodeLlama, QwenCoder, and DeepSeek Coder. We hope this addresses your concern.
Q3: Time Overhead
A3: Our method introduces additional time overhead during the generation phase due to the need to generate more special tokens. However, it reduces retrieval time through adaptive retrieval. Specifically, under our current implementation and hyperparameter settings, the generation time for RepoEval-Line completion increases by 61% compared to full-retrieve, 44% for API-level completion, and 23% for function-level completion. Conversely, the retrieval time is reduced by 22%, 41%, and 36%, respectively, compared to always-retrieve.
The overall per-instance time overhead for the entire system should be calculated as the additional time in the generation phase minus the time saved in the retrieval phase. However, since these two modules are typically implemented as entirely independent components, there is no standardized way to measure the total system time overhead, as it heavily depends on implementation details. For example, in API completion:
- Using a single process for retrieval and vLLM acceleration on an A100 GPU for generation, our method saves 1.5–1.6 seconds per instance compared to the baseline under the same configuration.
- With 8 processes for retrieval, the saving time is reduced to only 0.2 seconds per instance.
- Using 30 processes for retrieval and non-vLLM generation results in an overhead of 0.8–0.9 seconds per instance.
The overall time overhead is thus highly dependent on the implementation details of the two independent modules. Various factors, such as GPU/CPU type, number of processes, implementation specifics, and acceleration frameworks, significantly influence the time for each module. Consequently, there is no standard criterion for evaluating the time overhead of the entire system.
Q4: summarizing contributions in bullet points at the end of the Introduction
A4: Thank you very much for your suggestion. We have made the suggested changes in the revised manuscript, and we hope this makes our contribution clearer.
Given that the discussion period is ending in today, we would greatly appreciate it if you could provide feedback on our response. This will enable us to address your concerns as effectively as possible within the remaining time. Thanks for your consideration!
The authors present REPOFILTER, a method for repository-level code completion and infilling. The motivation lies in addressing limitations of retrieval-augmented generation (RAG) approaches, which often misidentify positive retrieved code chunks, leading to decreased performance, longer prompts, and increased latency. REPOFILTER introduces a novel training objective designed to predict the polarity of a retrieved code chunk, retaining only positive code chunks that enhance code completion accuracy. Experiments on RepoEval and CCLongEval datasets across various completion levels (line, chunk, function) demonstrate the effectiveness of this approach.
优点
-
Clarity: The paper is well-written and accessible, facilitating comprehension of the proposed method.
-
Novelty: REPOFILTER addresses a critical gap in repository-level code completion by introducing a streamlined, effective approach. The method’s training is straightforward, cost-efficient, and potentially generalizable.
-
Comprehensiveness: The authors conduct extensive evaluations on two widely used benchmarks, RepoEval and CCLongEval, covering diverse completion scenarios (line/API/chunk/function completion and left-to-right/fill-in-the-middle tasks). The analysis and ablation studies offer insights into hyperparameter settings and alternative experimental configurations.
缺点
- Limited Exploration of Combinatorial Effects: The authors treat the polarity of each code chunk individually based on its log-likelihood with the ground truth, overlooking potential interdependencies among retrieved code chunks. It would be valuable to consider whether combining two or more neutral/negtive code chunks might yield a positive impact on code completion accuracy.
- Potential Hyperparameter Tuning Bias: In Section 6.5, the authors justify selecting a threshold of 0.3 for
<MC>and<pos>, with ablation studies verifying this choice. However, these studies are conducted on the RepoEval API subset, which overlaps with the main test data. Tuning hyperparameters on test data risks introducing bias. I recommend using a separate validation dataset for tuning and then reporting test performance with the final hyperparameter choices. - Unstable Evaluation for Chunk and Function-Level Completion: Evaluation based on text similarity metrics like EM or ES may be unreliable for longer code completions, as seen in Table 1. Here, CodeLlama-13B-REPOFILTER outperforms CodeLlama-13B-RepoFormer in ES but not in unit test execution, and the same holds for CodeLlama-7B-RepoFormer versus CodeLlama-7B-REPOFILTER. Robust metrics better suited to larger code segments are needed for reliable assessment.
- Minor Issue:
- In Section 5.2, where the RepoEval dataset is introduced, the authors mention 32 repositories. This number appears to be incorrect and should be double-checked.
问题
- Masking Non-Contextual Information: At the end of Section 4.1, the authors state, "To prevent the model from merely replicating non-contextual information, we mask both the in-file and cross-file contexts during loss calculation." Could the authors elaborate on why replication occurs without masking?
- Thresholding for Informativeness: In Section 5.1, the authors use an ES threshold of 0.5 to determine informativeness for completion. Given that ES may not be strongly correlated with informativeness, would it be possible to add qualitative examples illustrating informative cross-file chunks above and below this threshold?
- Query Construction for Infill Tasks: Could the authors clarify how they construct retrieval queries in the infill setting? Specifically, does the query contain only preceding code or also include code following the target line/chunk? Further elaboration on how the infill setting enhances REPOFILTER’s performance, as suggested in Section 6.1, would improve clarity.
Thank you for your reviewing and your suggestions. Following are our responses regarding your concern:
Q1: Limited Exploration of Combinatorial Effects
A1: Our current approach independently labels each chunk's impact on completion generation without explicitly accounting for combinatorial effects between multiple chunks. This choice was driven by the already substantial computational cost of our labeling process. Considering combinatorial effects, such as evaluating pairs or groups of chunks, would significantly increase complexity, likely to an exponential degree.
Initially, we attempted to account for such effects using a leave-one-out likelihood metric, which measured the likelihood difference when a single chunk was removed from the top-10 retrieved contexts. While this inherently considered the interplay between chunks, we found it less reliable for accurately labeling chunk polarity. Factors such as inter-chunk dependencies and rank order introduced confounding influences that affected the model's generation.
Consequently, we opted for a simpler and more computationally efficient approach that evaluates chunk polarity independently. The results in Table A of the manuscript validate this method, demonstrating its effectiveness despite not explicitly accounting for combinatorial effects.
Q2: Potential Hyperparameter Tuning Bias
A2: We did not tune these hyperparameters on the test set; instead, we tuned them on the validation set of our constructed database. The test set was only used for ablation experiments to analyze the impact of the threshold setting on model performance. Moreover, to demonstrate the generalizability of this parameter setting, we have included additional ablation results for other tasks, including RepoEval-Line, RepoEval-Function, CCLongEval-Chunk, and CCLongEval-Function, in appendix E of our revised manuscript.
Q3: Unstable Evaluation for Chunk and Function-Level Completion
A3: Execution-based metrics indeed provide a more accurate evaluation compared to reference-based metrics. However, collecting such data is significantly more challenging, as benchmarks with test cases often contain only limited data, especially for repo-level code generation or completion tasks. In contrast, reference-based metrics allow us to evaluate a wider range of repositories, including those with unconventional code patterns. Moreover, metrics like Exact String Match (ES) and Exact Match (EM) can still reflect the overall accuracy of the model's generation, even if there are larger fluctuations in certain specific settings or tasks.
Therefore, covering both evaluation metrics is currently a good choice to ensure accurate evaluation while accommodating a wide range of repositories.
Q4: Masking non-contextual information
A4: Our training objective is to enable the model to effectively perform context filtering and code completion based on the given context. However, in-file and cross-file contexts are repository-dependent and not part of the learning objective. Incorporating these contexts into the loss function could cause the model to memorize irrelevant local information, which does not align with the supervised objectives. To address this, we mask these parts. We have revised this section in the manuscript to improve clarity.
Q5: Thresholding for informativeness
A5: there is no direct and effective metric to measure whether the context is sufficient for the completion target. ES can partially reflect informativeness; as mentioned in the previous response, ES can provide an overall assessment of the model's generation accuracy. When ES is high, particularly approaching 1, we can infer that the model's generation is accurate, suggesting that the context is sufficient. Conversely, when ES is low, poor generation performance may be due to task complexity, insufficient model capability, or inadequate context. If the threshold is set too high, it may retain only a small number of samples, with low complexity. On the other hand, if the threshold is too low, it may include many samples with insufficient context. Therefore, we chose a balanced, intuitive threshold to manage this trade-off as a filtering criterion.
Q6: Query Construction for Infill Tasks
A6: In the infill setting, we used 10 lines of preceding code as the query, consistent with the left-to-right setting, as our work focuses on adaptive retrieval and context filtering rather than retrieval itself. We ensured that the retrieved context was consistent across both settings. However, in the infill setting, the inclusion of subsequent code provides additional information about the target code’s intention, enabling the model to make more accurate predictions. This additional information directly impacts filtering accuracy. With a clearer understanding of the target code’s intention due to the presence of both preceding and subsequent code, the model is better equipped to filter relevant contexts. As a result, RepoFilter achieves greater improvements in the infill setting compared to the left-to-right setting, where only the preceding code is available and the intention is harder to infer.
Q7: Number of repositories of RepoEval
A7: We apologize for this factual error. We have double-checked and corrected it in the revised manuscript.
This paper introduces a new RAG fine-tuning method called RepoFilter, designed to enhance retrieval quality for code completion tasks. The core idea is to train the model to predict two specialized tokens: a polarity token after each retrieved chunk, indicating the chunk's helpfulness for completion, and an adaptive retrieval token to signal whether further retrieval is necessary. At training time, the authors use the likelihood gain on ground-truth tokens to label retrieved chunks with polarity tokens, creating full training examples in a single pass. During inference, the LM filters out any chunks classified as neutral or negative. A key finding is that most retrieved chunks are either neutral or detrimental to code completion, underscoring the need for effective filtering strategies. The evaluation demonstrates that RepoFilter achieves up to a 3% improvement in exact match over standard RAG models and significantly reduces prompt length, improving computational efficiency without sacrificing accuracy.
优点
Motivated and Simple Technique: The approach is well-motivated, leveraging easy-to-implement mechanisms for filtering retrieval contexts.
Performance and Efficiency Gains: RepoFilter delivers substantial performance, with a 3% exact match increase over baseline RAG models and prompt length reductions of over 80% in some cases. This filtering approach enhances computational efficiency while maintaining accuracy.
Low Computational Overhead: The technique requires minimal additional computation from the LM (just two extra decoded tokens per chunk) while keeping the retrieval context concise and relevant.
Generalizability: RepoFilter performs well across various benchmarks and RAG strategies, demonstrating compatibility with larger models like GPT-3.5 when applied as a plug-and-play retrieval context selector.
缺点
- It’s unclear if the baselines in this work were RAG fine-tuned. If the baseline models were not fine-tuned with RAG using a standard loss defined on middle tokens, the comparison may be unfair. Such a loss typically encourages the model to ignore irrelevant chunks, so it's plausible that there would be no additional benefit to explicitly teaching the LM model to label each chunk.
- Instead of using the likelihood gain for filtering, prior research has shown that likelihood gain can be a strong signal for training a more effective retriever (e.g., the perplexity distillation loss in [1]). This work does not compare its approach to such techniques, which could offer a more meaningful baseline.
[1] Izacard, Gautier, et al. "Atlas: Few-shot learning with retrieval augmented language models." Journal of Machine Learning Research (2023)
问题
-
Are the three baseline techniques RAG fine-tuned for this task? Models like StarCoderBase and CodeLlama are generally not RAG fine-tuned, limiting their ability to handle retrieved chunks effectively without additional training. However, RepoFilter does perform RAG fine-tuning (as described in Section 4.1). If the other baselines are not fine-tuned, the evaluation might be unfair.
-
The performance of a large model like GPT-3.5-turbo on left-to-right generation tasks is surprisingly low in Table 4--I would expect it to be at least better at the left-to-right generation task. Have the authors explored the reasons behind this underperformance?
-
Can the approach be further simplified to avoid using constrained decoding at inference? Have the authors explored alternative formats, like an end-of-chunk special token before polarity tokens?
Thank you for reviewing our paper, we greatly appreciate your feedback and suggestions. Here are our response regarding your concerns:
Q1: Are the three baseline techniques RAG fine-tuned for this task?
A1: Among the three baselines, RepoFormer was fine-tuned using the same loss function. This fine-tuning was necessary due to the introduction of new special tokens and modifications to the standard prompt format, requiring the model to adapt for effective generation under the updated setup.
For the remaining baselines, we followed the same configuration as in the original RepoFormer paper, where no-retrieve and full-retrieve baselines were not further fine-tuned. Given the similarity between the code completion task and the pretraining task, task-specific fine-tuning was considered unnecessary. Moreover, our fine-tuning dataset, derived from Stack, may overlap with the training data of existing code LLMs.
To further validate our approach, we fine-tuned StarCoder-3B using all retrieved cross-file chunks for one epoch while keeping all the configuration same as in our main experiment. The results showed no significant improvement over the full-retrieve baseline, which still underperformed compared to our framework.
| repoeval-line | repoeval-api | repoeval-func | CCEVAL-chunk | cceval-func | |
|---|---|---|---|---|---|
| EM | ES | EM | ES | ES | |
| Finetuned Starcoder-3B | 59.16 | 76.59 | 48.66 | 74.81 | 47.92 |
| Comparison | +1.91 | +1.87 | +1.39 | +2.12 | -0.42 |
| Comparison with RepoFilter | -1.34 | -2.48 | -1.93 | -2.47 | -3.43 |
Q2: Comparing with baseline of retriever trained on perplexity-gain loss.
A2: Thank you for highlighting this method. Our work primarily focuses on pre-processing (adaptive retrieval) and post-processing (context filtering) of retrieved contexts, rather than the retrieval process itself. While numerous studies, including the paper you mentioned, aim to improve retrievers for RAG-based repository-level code completion tasks, these efforts address a different and independent module of the framework (i.e., the retriever). Therefore, direct comparisons between our approach and retriever-focused methods may not be entirely meaningful, even though the loss function you referenced shares a similar motivation with ours.
Importantly, our method is complementary to retriever-based approaches and can be integrated with them to further enhance completion performance. For example, in our revised manuscript, we demonstrate how our approach generalizes to other retrievers, such as RepoCoder, by filtering retrieved contexts to provide additional improvements, as detailed in Appendix C.
Q3: Performance of GPT3.5-turbo
A3: The lower performance of GPT-3.5-turbo in left-to-right generation tasks likely results from its lack of fine-tuning for code-specific tasks, particularly repository-level completions and the associated prompt format. While GPT-3.5-turbo excels in general-purpose natural language tasks, it has not been specifically optimized for code generation, unlike dedicated code LLMs. Consequently, GPT-3.5-turbo underperforms compared to models like Code Llama in various code generation benchmarks, including HumanEval, where even Code Llama-13B achieves higher accuracy.
Due to budget constraints, we conducted a limited set of experiments with GPT-4 using the same prompt as GPT-3.5-turbo. Despite the small scale, GPT-4 achieved an exact match score of 49.5% in the left-to-right completion setting and 54% in the infilling setting. It demonstrated exceptional performance, outperforming all other models in a subset of 200 tests.
Q4: Can the approach be further simplified to avoid using constrained decoding at inference?
A4: We apologize if we misunderstood your question. Are you referring to constrained decoding as the process where the model generates special tokens exclusively from a fixed special token vocabulary, rather than the entire vocabulary? If so, we would like to clarify that in our implementation, we do indeed add an end-of-chunk token after each chunk, as described in our revised manuscript and exemplified in the generation case in Appendix F.
If our understanding of "constrained decoding" is accurate, this approach does not introduce additional complexity, as it simply involves computing the softmax over a smaller sub-vocabulary instead of the full vocabulary. Adding an end-of-chunk token helps guide the model to output a special token for polarity classification, while constrained decoding ensures that the model avoids generating unexpected outputs.
If we have misunderstood your concern, we would greatly appreciate further clarification so we can address this topic in more detail.
Thank you for your detailed responses and clarifications. I appreciate the effort you put into addressing my concerns. Below are my thoughts based on your replies:
- Regarding RAG fine-tuning, I am surprised by the result showing that fine-tuning StarCoder on all retrieved chunks yielded no significant improvement. Since StarCoder was not originally RAG fine-tuned, the loss function you used should intuitively encourage the model to be robust against irrelevant chunks. My past experience with RAG fine-tuning StarCoder has demonstrated noticeable gains in similar setups, so I wonder if the lack of improvement could be due to specifics of the fine-tuning process.
- While I understand that your method primarily focuses on filtering rather than retrieval, from a broader RAG perspective, filtering can also be interpreted as improving retrieval quality since it directly modifies the final retrieval result. In this sense, the distinction between filtering and retrieval improvement feels less meaningful. Nevertheless, I agree that your approach could complement retriever-focused methods and appreciate the clarification on its compatibility with other retrievers like RepoCoder.
- Thank you for explaining the limited scope of your GPT-4 experiments. Could you provide an estimate of the total cost for running your full set of experiments with GPT-4?
- Yes, you correctly understood my question regarding constrained decoding. I agree that adding an end-of-chunk token can help guide the model to output special tokens for polarity classification, and I think it will also potentially reduce the reliance on constrained decoding.
Thank you very much for your prompt response and for sharing your thoughts. Below is our further discussion and response to your ideas, aiming to address some of your concerns as thoroughly as possible.
-
Fine-tuning: Since our full fine-tuning adopts the same parameter settings as in our main experiments (i.e., using a smaller learning rate), this may impact the improvement as the primary goal of our loss function is to adapt the model to the new prompt format that includes special tokens. From the comparison between our method and RepoFormer, as well as the results shown in Section 6.4 of the manuscript—where our model is used solely as a filter but not as a generator—it can still demonstrate the improvement brought by the filter itself.
-
Comparing with retriever-based methods: Although our goals are to make the final context more relevant, the focus lies on entirely independent modules, which access different contexts. For example, retrievers can access cross-file chunks across the entire repository, whereas, in our experimental setup, our method can only access the top-10 retrieved chunks, which inherently introduces some unfairness to the comparison. Similarly, filtering occurs after retrieval. As shown in Table 6, the performance of our framework itself varies depending on the retriever used. Thus, our concern mainly stems from the following question: If we were to compare with retriever-based methods, which retriever should our framework use? There seems to be no reasonable way to perform a fair comparison between our method and retriever-based methods, as they are fundamentally two decoupled and independent modules. Therefore, we believe the focus should be on how the combination of the two methods can bring improvements.
-
Estimated cost: We estimate that full-set experiments would cost at least $200 due to the long input context, which is beyond our budget. Additionally, in this section, we compared different model families and sizes, which we believe sufficiently demonstrate the generalizability of our method.
Thank you for your detailed follow-up and clarifications—you've addressed most of my concerns, and I now have a clearer understanding of your approach and its scope. Based on this, I am raising my score from 5 to 6.
Thank you very much for your response and suggestions. We truly appreciate it.
This paper introduces REPOFILTER, a framework designed to improve repository-level code completion by filtering irrelevant or negative cross-file contexts in retrieval-augmented generation (RAG) setups. REPOFILTER evaluates the impact of retrieved code chunks using a likelihood-based metric, classifying chunks as positive, neutral, or negative. Extensive experiments on the RepoEval and CrossCodeLongEval benchmarks demonstrate that REPOFILTER enhances completion accuracy while reducing the length of input prompts, improving efficiency without compromising performance.
However, in my opinion, this paper has no essential innovation compared to Self-RAG. It just adjusts the quantitative standard of classification indicators and replaces model evaluation by training a probabilistic parameter classifier. However, due to the lack of baseline comparison, I cannot believe that the method in this paper has achieved SOTA. Moreover, in essence, the training method in this paper increases the training cost and introduces generalization considerations. At the same time, it may not be significantly improved compared to the existing SOTA RAG framework in terms of interpretability and effect. Therefore, the contribution of this paper is missing, and it is more like an engineering skills report in the SE field.
优点
-
Clear Motivation: The paper identifies a critical limitation in RAG-based repository-level code completion: the inclusion of extraneous or detrimental code chunks in the retrieval process. The authors present an argument for REPOFILTER’s filtering mechanism.
-
Likelihood-Based Polarity Metric: The proposed likelihood-based metric is innovative and provides a quantitative approach to evaluate the relevance of each retrieved chunk. This metric adds rigor to the filtering process, ensuring that retained chunks positively influence completion accuracy, which is a practical and scalable solution.
-
Experimental Results and Efficiency Gains: REPOFILTER consistently outperforms full retrieval and even adaptive retrieval strategies such as RepoFormer, achieving improvements in completion accuracy on the RepoEval and CrossCodeLongEval benchmarks.
缺点
-
Limited Evaluation on Diverse Repositories: While the experimental results on Python repositories are robust, the paper could further strengthen its generalizability claims by evaluating REPOFILTER on repositories in other languages. This would demonstrate the approach’s adaptability to different programming contexts and enhance its appeal to a broader range of applications.
-
Detailed Explanation of the Adaptive Mechanism: Although the framework’s adaptive nature is a strong feature, some aspects of its implementation, particularly the proof of hypothesis, could benefit from additional clarification. A more detailed explanation of the decision-making process for context sufficiency and information would help in understanding the effectiveness and robustness of the adaptive retrieval mechanism.
-
Dependency on Accurate Likelihood Estimation: The effectiveness of REPOFILTER’s polarity classification relies heavily on accurate likelihood estimations. Situations where this estimation might fail, such as repositories with highly unconventional or complex code patterns, are not discussed. This could potentially limit the framework’s performance in non-standard or heterogeneous coding environments.
问题
-
Provide Qualitative Examples: Adding examples of how positive, neutral, and negative chunks impact the completion process would provide readers with a more intuitive understanding of the framework. A case study or visualizations comparing outputs with and without REPOFILTER’s filtering would be beneficial. Also which qualitative code characteristics the code classification via this uninterpretability approach tends to favor should be discussed.
-
Extended Analysis of Filtering Mechanism: While REPOFILTER achieves efficiency by trimming context lengths, further analysis on cases where the model incorrectly labels chunks (e.g., a neutral chunk classified as negative) would add depth to the evaluation. This could be presented as an ablation study or error analysis.
-
Broader Contextual Relevance Discussion: The paper briefly discusses filtering strategies in the context of other RAG tasks. Expanding this section to consider REPOFILTER’s potential impact on domains like documentation generation or bug fixing. Besides, the baselines of related works and RAG methods for comparison are too inadequate.
Many thanks for your thoughtful feedback. Following are our response regarding your concers on our work:
Q1: Limited Evaluation on Diverse Repositories:
A1: The majority of existing repository-level benchmarks and related studies primarily focus on Python, with a notable lack of comprehensive datasets for other programming languages [1,2,3,4,5]. Consequently, our experiments and analysis are centered on Python repositories, aligning with current standards in the field. While we acknowledge that extending the evaluation to other programming languages could further demonstrate the generalizability of our approach, we believe the current scope is sufficient for this paper, given the robustness of our results across hundreds of diverse Python repositories. We have addressed these limitations and our intention to explore other languages in future work in the revised manuscript.
Q2: Detailed Explanation of the Adaptive Mechanism
A2: We have revised the sections in our manuscript that introduce our inference pipeline, aiming to enhance clarity and address your concerns. Additionally, the newly added generation case F in the appendix provides further illustration of our inference process to aid understanding.
Q3: Provide Qualitative Examples
A3: Thank you for your suggestion. We have selected a sample along with its corresponding labels, generated outputs, and explanations, which we have included at the appendix F in our revised manuscript. We hope this example provides an intuitive illustration of how our framework operates and how these samples can potentially positively or negatively influence the model's generation process.
Q4: Broader Contextual Relevance Discussion
A4: Following your suggestion, we have added a discussion section in the appendix, where we address the limitations of our work, its potential impact, and future directions. We hope this addition provides a clearer perspective on the scope and implications of our study.
Q5: Inadequate baseline
A5: Our work primarily focuses on the pre-processing (adaptive retrieval) and post-processing (context filtering) of retrieved cross-file contexts of code repositories, rather than the retrieval process itself. To the best of our knowledge, the only related work in this area is RepoFormer, as most existing research explores more accurate methods for retrieving cross-file contexts. These studies focus on different, independent modules of the framework, making direct comparisons less meaningful. Notably, our method can be integrated with retriever-based approaches to jointly improve completion performance. For example, in our revised manuscript, we demonstrate how our approach generalizes to other retrievers, such as RepoCoder, to provide additional improvements. Of course, if we have missed any closely related work that could serve as a baseline, we would greatly appreciate your pointing it out. We will include such comparisons in future revisions of our paper if possible.
Q6: Dependency on Accurate Likelihood Estimation
A6: Thank you for raising this concern. We would like to clarify that our metric is based on the difference in the model's likelihood for the ground truth between prompts with and without a specific cross-file chunk, rather than relying on the absolute value of the likelihood. Therefore, the issue you mentioned with unconventional or complex code patterns is less likely to arise. While it is true that in repositories with unconventional or complex patterns, the model’s likelihood for the ground truth might be generally lower, the presence of a cross-file chunk providing the necessary context for completion is still expected to increase the likelihood. To further mitigate potential variability introduced by complex patterns, we base our evaluation on the percentage increase in likelihood rather than the absolute value. This approach effectively normalizes for differences across varying patterns and coding styles, ensuring robust polarity labeling. While there is no direct metric to measure the accuracy of our labeling process, we have indirectly validated the accuracy of the likelihood-based metric through the model’s generation performance, as shown in Table A. The performance observed supports the effectiveness of our labeling approach in accurately identifying the polarity of cross-file chunks.
To better illustrate our approach, we used the cyclomatic complexity metric to measure the complexity of the code in our completed files. Specifically, this metric evaluates the number of linearly independent paths through the source code. We selected the file with the highest complexity score, exceeding 12, which indicates the need for possible refactoring. Below, we provide part of the preceding code from this sample, the chunk identified as positive by our metric, and the corresponding ground truth:
Preceding Code:
def wrap_iv_filter_server(worker):
"""
This function is to perform feature selection with iv_filter \
to data for server.
Args:
worker: `federatedscope.core.workers.Worker` to be wrapped
Returns:
Wrap vfl server with iv_filter.
"""
def trigger_for_feat_engr(self,
trigger_train_func,
kwargs_for_trigger_train_func={}):
logger.info('Start to execute woe_filter, which requires HE.')
self.trigger_train_func = trigger_train_func
self.kwargs_for_trigger_train_func = \
kwargs_for_trigger_train_func
self.msg_buffer['feat_dim'] = {}
# Broadcast client address and feat_engr_public_key
self.broadcast_client_address()
self.comm_manager.send(
Positive Chunk:
# # Broadcast client address and feat_engr_public_key
# self.broadcast_client_address()
# self.feat_engr_public_key, self.feat_engr_private_key = \
# secure_builder(worker._cfg).generate_keypair()
# logger.info('Sending feat_engr_public_keys to clients.')
# self.comm_manager.send(
# Message(msg_type='feat_engr_public_keys',
# sender=self.ID,
# receiver=list(self.comm_manager.get_neighbors().keys()),
# content=content))
Ground Truth:
Message(msg_type='binning',
sender=self.ID,
receiver=list(self.comm_manager.get_neighbors().keys()),
state=self.state,
content=self._cfg.feat_engr.selec_woe_binning))
From this example, it is evident that the chunk identified as positive provides a clear example of constructing and sending a Message object using the comm_manager. This serves as a reference for the method call in the model's completion target. Even for patterns deemed complex, there is a direct relationship between the positive chunk and the ground truth.
Q7: Error case and analysis
A7:
Referencing the sample mentioned above, during the generation process, the model identified the positive chunk as a neutral chunk and therefore did not append it to the final prompt. As a result, the lack of a relevant reference for instantiating the Message object led the model to generate the following output:
message=msg,
destination=client_id,
msg_type='client_address')
This does not match the ground truth. We believe the error primarily stems from the model's inability to discern the intent of the content to be completed in certain scenarios, which in turn prevents it from accurately identifying useful context.
[1]Zhang, Fengji, et al. "RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation." Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.
[2]Ding, Yangruibo, et al. "CoCoMIC: Code Completion by Jointly Modeling In-file and Cross-file Context." Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024.
[3]Cheng, Wei, Yuhan Wu, and Wei Hu. "Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion." arXiv preprint arXiv:2405.19782 (2024).
[4]Liang, Ming, et al. "REPOFUSE: Repository-Level Code Completion with Fused Dual Context." arXiv preprint arXiv:2402.14323 (2024).
[5]Li, Jia, et al. "EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories." arXiv preprint arXiv:2404.00599 (2024).
Given that the discussion period is ending today, we would greatly appreciate it if you could provide feedback on our response. This will enable us to address your concerns as effectively as possible within the remaining time. Thanks for your consideration!
The authors introduce a new likelihood based metric to evaluate the effect of retrieved cross-file chunks on code completion tasks. Using this, it was found that the majority of retrieved chunks negatively or neutrally affect code completion performance. So, the authors finetune LLMs to pre-emptively determine the utility of retrieved code chunks on code completion tasks and demonstrate that this can dramatically improve code completion performance while significantly reducing the sequence length of supplemental contexts introduced after cross-file retrieval.
优点
Useful novel contributions. The paper introduces a new method to filter retrieved cross-file chunks. Further, the authors demonstrate how their filtering strategy can be used as a plug-in approach for larger models, demonstrating generalization of utility. This approach also leads to remarkable improvements on cross-file tasks and will be of great value to researchers who are focusing on improving Code LLM performance on repository-level code completion.
Well-written paper. The paper is well-written and offers great insights - from motivating the need for a cross-file chunk filter to ablations around various design choices - making it a great contribution to the community.
缺点
Missing details. There are a number of missing details from the paper (or maybe I missed them) that need further clarification:
- Line 183: How were these thresholds set? How important are these and what is the effect of changing these thresholds?
- From line 239, it appears that the model is never shown the
<neg>token. Is this case? If so, how does the model generate the<neg>token? - It appears that there is a discrepancy between the descriptions of polarity token generation in Algorithm 1 and lines 343-349. If the polarity token is the one with maximum probability, what is the role of thresholding?
Data leakage concerns. The authors have used the Stack dataset to finetune StarCoder models and evaluate on CrossCodeEval. These are both sourced from GitHub repositories and potentially have overlaps. I recommend taking special care for deduplication of training data against evaluation datasets to address leakage concerns.
Problem scope clarifications. It would be useful to clarify the scope of problem being solved or contextualize it among other problems. For example, what is the difference between training RepoFilter and a new retriever or reranker with the same labels? This is unclear and left me confused about the advantage of training a filter over training a new retriever on the same data.
Additional experiments. This is not a major concern. I would be interested in seeing more experiments that evaluate the RepoFilter system in isolation and not through the downstream task evaluation. Specifically, what are the precision/recall metrics of the RepoFilter for scoring the <pos>, <neu>, <neg> tokens?
问题
See notes above. I have framed some of the concerns raised in a question format here.
- Were some steps taken to ensure that there is no leakage between the RepoFilter training and the benchmark datasets?
- How does RepoFilter differ or do better than training a new retriever with the RepoFilter training data?
- What are the precision/recall metrics of the RepoFilter for scoring the
<pos>, <neu>, <neg>tokens? - Can you clearly outline the training and inference algorithms for the polarity tokens? In the current manuscript, it appears a bit confused.
Thank you sincerely for your valuable suggestions. Here are our responses regarding your concerns of our work:
Q1: Missing details from the paper that need further clarification
A1: Thank you very much for pointing out these missing details. We have revised the corresponding section of the paper to make it clearer. Below, we provide a detailed explanation addressing your concerns about these details.
Threshold Setting: We determined the threshold setting based on the model's performance when provided with only positive samples and when negative samples were removed. This threshold setting directly influences the labeling accuracy. For instance, if the threshold for positive is set too low, more samples will be classified as positive, even if they are irrelevant to the completion task. Conversely, if the threshold is set too high, genuinely positive samples might be incorrectly labeled as irrelevant neutral samples. Through adjustments conducted on the validation set of our constructed dataset, we obtained results aligning with our expectations. Specifically, when the model is provided with only positive samples, its completion performance does not degrade. Furthermore, when negative samples are removed from the prompt, the model's performance improves. This confirms the accuracy of our labeling strategy. Additionally, as shown in Table A, this threshold setting is also applicable to the RepoEval dataset, further validating our hypothesis.
Generating <neg> token: Generating<neg> token: The model sequentially appends cross-file chunks to the current prompt and evaluates their polarity. If the model classifies a chunk as negative or neutral by generating a special token<neg> or <neu>, the prompt will roll back to its state before the chunk was added. As a result, the final prompt includes only positive chunks with <pos> tokens. We have also rewrite the inference section for better clarity
Discrepancy in algorithm and description: We apologize for this inconsistency. In our revised manuscript, we have standardized the expression, clarifying that the model generates a special token based on a threshold rather than simply generating the token with the highest probability.
Q2: Data leakage concern
A2: The repositories included in RepoEval were collected after the publication of the Stack dataset, and there is no overlap between the repositories in CrossCodeEval and the Stack dataset. Since our dataset is an extension of the Stack dataset, there is no data leakage between the training and test sets.
Q3: Problem scope clarification: How does RepoFilter differ or do better than training a new retriever with the RepoFilter training data?
A3: From the perspective of identifying the polarity of cross-file chunks, training a retriever often involves using surrounding code as the query, such as the preceding 10–20 lines of code or a specific number of tokens. However, determining the polarity of a chunk is not based on semantic or lexical similarity to the cross-file chunk, but rather on understanding the intention of the code. This is a more challenging task, as the limited context provided by a retriever's query is often insufficient for the model to fully grasp the nuances of this task. In contrast, our framework is designed to evaluate the polarity of chunks based on the entire prompt context. By leveraging the complete context, the model can make more informed and accurate decisions. This capability is critical for achieving better accuracy in polarity identification. Moreover, beyond identifying polarity, our framework also implements adaptive retrieval. The model can decide whether retrieval is necessary or whether the current context is already sufficient. This dynamic decision-making process cannot be achieved with a traditional retriever, which would always retrieve without assessing the sufficiency of the existing context. Considering these two factors—better context utilization for polarity identification and the ability to perform adaptive retrieval—we are motivated to train our framework rather than simply training a new retriever.
Q4: What are the precision/recall metrics of the RepoFilter for scoring the <pos>, <neu>, <neg> tokens?
A4:
In the generation process, both <neg> and <neu> tokens are treated equivalently, as neither is included in the final prompt. Additionally, when the model determines that the current context is sufficient, it stops evaluating subsequent chunks regardless of whether they are labeled as <neu> or <neg>. This makes it challenging to evaluate these two categories separately.
Following your suggestion, we primarily evaluated the precision and recall of <pos> tokens, as they are directly related to completion performance. The table below presents the precision and recall scores on RepoEval-API and RepoEval-Line for our model trained on StarCoder-3B:
| Task | Recall (%) | Precision (%) |
|---|---|---|
| RepoEval-API | 87.14 | 78.17 |
| RepoEval-Line | 84.28 | 81.70 |
The results indicate that the model achieves acceptable precision and recall, though they are not exceptionally high (e.g., 90+). We believe this reflects the inherent difficulty of identifying the polarity of a chunk, which requires a nuanced understanding of the model's intent.
Q5: Clearly outline the training and inference algorithms for the polarity tokens
A5: We have revised our manuscript to provide a clearer presentation of the training and inference algorithms. We hope this could address your concern.
Thank you for your detailed comments! My concerns are resolved, provided these are included in the paper.
I would like to increase the score to 7, but unfortunately this option is not available this time. Good luck!
Thanks very much for your positive score and valuable feedback!
We thank all the reviewers for your valuable feedback!
In response to your insights and suggestions, we have thoroughly revised our manuscript. Modifications in the updated manuscript are highlighted in orange for easy identification.
We hope these improvements and our detailed responses can address your concern. We sincerely appreciate your guidance and support.
We would be very grateful if the reviewers could kindly share any additional concerns they may have and indicate whether our responses have sufficiently addressed some or all of their concerns. We are committed to addressing any remaining issues before the end of the discussion period. Thank you for your time and consideration.
This paper presents work on repository-level code completion. It proposes a likelihood-based metric to assess the influence of each retrieved code chunk on the completion task. The effectiveness of the proposed framework is validated through extensive experiments. The strengths of this work include an interesting method and overall great empirical results. However, the current version needs to be improved with more convincing explanations about the method mechanism and experimental limitations. Therefore, it cannot be accepted in this round.
审稿人讨论附加意见
This submission received comments from 5 reviewers. All of them gave borderline ratings. The discussions and changes during the rebuttal are summarized below.
- Reviewer EjbF claimed that the main weaknesses and questions of the submission include the confusion caused by the presentation, the concern about data leakage, and supplementary experiments. The rebuttal provides both additional explanations and empirical results, which are appreciated by the reviewer.
- Reviewer HKFe’s concerns include six aspects, as reflected in the comments. The rebuttal does not well address the issue of likelihood estimation, since the rebuttal provides more of an explanation of the phenomenon rather than the mechanism, which is less convincing.
- Reviewer XV27 gave some questions about baseline fine-tuning and fairness of comparison, likelihood gains vs. alternative techniques, GPT-3.5-turbo underperformance, and the complexity of the proposed method. The rebuttal addressed these questions properly. Accordingly, the rating is increased accordingly.
- Reviewer K4oq raised concerns about combinatorial effects, hyperparameter tuning, unstable evaluation, and some other algorithm/technical details. The rebuttal mainly discussed the complexity of combinatorial effects, but did not provide more quantitative results. The other issues are considered to be addressed well.
- Reviewer Qy7Q mainly pointed out that some baselines are missing and some new LLMs should be involved. The rebuttal did not well address the questions, as also acknowledged by the reviewer. First, the improvements in several cases are marginal. Second, fine-tuning Code LLMs may degrade their original code completion performance, needing further investigation into balancing task-specific gains with generalization.
Therefore, based on the above, the submission needs more polishes to make a stronger work. The decision to reject is made.
Reject