PaperHub
5.3
/10
Rejected4 位审稿人
最低5最高6标准差0.4
6
5
5
5
3.3
置信度
正确性2.8
贡献度2.5
表达3.0
ICLR 2025

RARe: Retrieval Augmented Retrieval with In-Context Examples

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
RetrievalEmbedding modelsIn-Context Learning

评审与讨论

审稿意见
6

This paper investigates whether training with in-context examples can enhance the in-domain performance of LLM-based retriever models and improve their generalization to different unseen tasks. It introduces a simple training strategy that identifies examples using BM25 during training and performs inference with in-context examples. Experiments were conducted on the BEIR benchmark and a reasoning-intensive retrieval benchmark, RAR-b.

优点

  1. The paper explores in-context examples for retrieval, which are widely employed in decoder-only LLMs, to enable the model to incorporate the semantic and pattern information of examples for adjusting its output embeddings across various retrieval tasks with differing user intents. This approach could potentially be a step forward from instruction-based retrieval models to ICL-based retrieval models.
  2. Despite the simplicity of the training method, the paper provides an extensive analysis of how the quality, quantity, and selection of in-context examples impact the model's performance.

缺点

  1. It is unclear whether the in-context examples truly contribute during training: Since all examples are retrieved using BM25, it raises the question of whether they merely act as a form of query expansion. Deeper experimentation is needed to address this, such as retrieving only similar passages and incorporating them into the training process. As shown in Table 4, using only documents during inference sometimes yields good results; thus, what happens if doc-only is used during training as well?
  2. This paper lacks an in-depth analysis of the generated embeddings. For instance, why does adding examples lead to performance declines on some test sets, such as ArguAna and ClimateFEVER? Further attribution analysis, like exploring the impact of examples on attention patterns, could deepen the study and enhance its contributions to the research community.
  3. The training strategy is too simplistic: A more sophisticated training strategy could be explored. The current approach is too basic. For example, the impact of using in-batch negatives, a commonly employed technique in retrieval training, remains unexplored. Specifically, if in-batch negatives come from different tasks, could they help in better training the ICL capabilities due to differing examples and retrieval intents?
  4. The analytical experiments should include results on the complete BEIR benchmark or at least all out-of-domain (OOD) results.

问题

  1. In Figure 3, only part of OOD test sets from BEIR are selected. Could you supplement with the complete set to observe trends, especially since these selected test sets are relatively short?
  2. In Table 4, the performance with doc-only is also quite good for several test sets, such as CQA, NFCorpus, and Touche2020. Given that examples can be viewed as a form of query expansion, how do you explain the improvement brought by doc-only, and what is its relationship with ICL training? Can training with doc-only also bring performance gain?
  3. To truly demonstrate the ICL capabilities, evaluating more embedding tasks and scenarios, such as classification, reranking, and clustering, might be necessary. Have you considered adding other tasks from METB that can better test the ICL ability?
评论

Thank you for the review! We address your concerns below and in our updated draft (changes are highlighted in blue).

W1/Q2

In-context Learning vs. Query expansion

We agree that training and evaluating with different input formats will show more conclusively the difference caused by input formats, rather than a setting where we only change the evaluation input format.

We provide the results on training and evaluating with different formats (including doc-only) in Table 4 in the updated paper. We find that even in this setting, our proposed “Instruct + IC” i.e. Regular outperforms other settings, such as query-only or document-only input format.

W2

This paper lacks an in-depth analysis of the generated embeddings.

We appreciate the reviewer’s suggestion to conduct additional analyses. Prior literature has explored the impact of ICL example format [1], order [2], and diversity [3] on downstream performance in decoder-only LLMs. Although such analyses are beyond the scope of this work, future studies could extend this line of research by examining aspects like attention patterns to gain deeper insights into the role of ICL in retriever models. Please also see our general response on format-sensitivity of ArguAna.

W3

The training strategy is too simplistic & in-batch negatives

We believe that this simplicity is a strength rather than a limitation, as it allows for clarity in isolating the impact of our proposed approach. Moreover, we validate its effectiveness comprehensively across multiple large-scale retrieval benchmarks and provide detailed ablations and analytical experiments.

Our method builds upon the commonly used in-batch negatives strategy from prior work (e.g., LLM2Vec [4]), where each in-batch negative is randomly sampled from different tasks, as described in Line 3 of Algorithm 1.

The focus of our work is to investigate whether retriever models can effectively leverage in-context examples, laying a foundation for future studies. These could explore developing more sophisticated objectives, such as improved methods for selecting in-batch negatives, to further enhance training with in-context examples.

W4/Q1

Analytical experiments on all OOD

We have extended Figure 3 and Figure 4 with all OOD datasets from BeIR. We present these results in Figure 5 and Figure 6 in the Appendix due to space constraints. The results on full OOD datasets from BEIR on other analytical experiments are already provided in the Appendix. We observe similar trends as what we see in the previously selected datasets:

(i) Retrieved/Retrieved performs the best on average, and Retrieved during eval is second best on average. Using retrieved examples either in training or evaluation (or both) offers performance enhancements on 7/10 datasets. Note that TRECCOVID is the only dataset where training retriever checkpoints further with Instruct led to a sharp decrease in performance with respect to the base model.

(ii) We observe that more similar examples yield improvements in performance over the base model on 6/10 datasets, and do not offer any additional gains on the rest.

Q3

Evaluating more embedding tasks and scenarios, such as classification, reranking, and clustering

Our focus in this paper is to study whether the retrieval tasks in particular can be augmented with in-context examples. We have extensively evaluated this hypothesis on multiple large-scale retrieval benchmarks with detailed ablations and analytical experiments.

Generally, trends on retrieval tasks generalize to other embedding tasks as well [4, 5]. We did not evaluate mainly due to extensive compute costs (as this is not the focus of our work). We agree with the reviewer that exploring ICL for other embedding-based tasks is an interesting future work direction.

References

[1] Reframing Instructional Prompts to GPTk’s Language (Mishra et al., ACL 2022)

[2] Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity (Lu et al., ACL 2022).

[3] How Do In-Context Examples Affect Compositional Generalization? (An et al., ACL 2023)

[4] LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders (BehnamGhader et al., COLM 2024)

[5] Improving Text Embeddings with Large Language Models (Wang et al., ACL 2022)

评论

Thank you once again for your review. As the discussion period is nearing its end, we wanted to follow up to confirm that we have adequately addressed your concerns and to kindly request if you would consider reevaluating your assessment.

To summarize our updates:

  1. Query Expansion: We added results showing RARe format outperforms Doc-Only when training with Doc-Only.
  2. Training Strategy: Emphasized the simplicity of our strategy which facilitates clarity of isolating the impact of our proposed approach.
  3. OOD Experiments: Extended Fig 3 and 4 analysis to all BEIR OOD datasets.
  4. Clarified the scope of our work and highlighted analysis of embeddings and other embedding tasks as a future research direction.

We hope the additions clarify our contributions and resolve your concerns. Please let us know if you have any further questions or if there are additional issues we can address.

评论

Thank you for addressing my questions and concerns. I'm willing to raise my rating from 5 to 6.

评论

We thank the reviewer for the reassessment of our work. We are happy to know that your concerns have been addressed.

审稿意见
5

The paper explores the application of in-context learning to improve the performance of retriever models in information retrieval tasks. The authors introduce RARe, a method where in-context examples (query-document pairs related to the target query) are used during the fine-tuning of retriever models. This approach differs from direct application in LLMs where in-context examples are prepended at inference time. RARe enhances retrieval performance by up to 2.72% in nDCG across various datasets.

优点

  • While in-context learning is not new, its application to retriever models in this specific manner is new. The paper creatively adapts this technique, showing potential new directions for retrieval model improvements.

缺点

  • While the application of in-context learning to retrievers is new, it might not strike everyone as a groundbreaking shift.
  • The approach, as well as the performance gain, looks a bit incremental. There might also be a tradeoff between efficiency, accuracy, and ease of use of the method.
  • Discussion on how RARe might scale or face challenges in real-world applications beyond the benchmarks used could be expanded.

问题

See weaknesses.

评论

Thank you for the review! We address your concerns below and in our updated draft (changes are highlighted in blue).

W1

While the application of in-context learning to retrievers is new, it might not strike everyone as a groundbreaking shift.

While the application of in-context learning to retrievers may seem simple, we show that it does not function effectively out-of-the-box (Figure 3), unlike its success with generative models. We do believe that demonstrating this distinction and proposing a method for adapting in-context learning for retrievers represents a meaningful contribution to the field.

W2 & W3

Ease-of-use and real-world applications

Our approach is straightforward to use, as it involves simply pre-pending in-context examples in the form of (q, d+) pairs to the query using a lightweight retriever like BM25. However, a potential challenge in real-world applications is the availability of suitable (q, d+) pairs, similar to the requirements for in-context learning in generative models, as discussed in L520.

Efficiency-accuracy tradeoff

This aspect is discussed in our experiments in Table 6. For reviewer’s convenience, we summarize the discussion here: for very small corpus sizes (<500K documents), the performance gains from RARe may not justify the additional latency. However, in large-corpus scenarios (>4M documents), the added latency reduces, making RARe an effective solution for such retrieval tasks. In real-world scenarios, as the index size grows, the added latency of RARe will be less apparent.

Performance gain looks a bit incremental

Please see our general response.

评论

Thank you for the response. The response answers some of my questions, but I am still not convinced about the significance of the approach. Therefore, I tend to maintain my score.

评论

We thank the reviewer for their response.

We believe this approach provides meaningful insights into how retrieval can leverage in-context examples, highlighting a promising and underexplored direction in the field.

We would greatly appreciate it if the reviewer could further elaborate on their concerns regarding the significance of the approach, which would allow us to address them more effectively.

评论

Dear Reviewer 3GiC,

We have addressed your concerns with additional context and clarification in the “Follow-up and Additional Clarifications” post, emphasizing the significance of our work. As the rebuttal phase draws to a close, we wanted to ensure our response sufficiently addresses your points. We kindly request you to consider revisiting your evaluation. Should you have any further questions or additional concerns, we would be more than happy to address them promptly.

审稿意见
5

This paper explores adding in context examples in the query for retriever. It first retrieves related pairs using BM25 and then explores different methods, such as LLM based methods as well as based on existing well trained retriever methods. The idea is quite natural to explore and it is a good plus for the retrieval community with its positive results as well as its extensive ablation study.

优点

  1. Studying adding in context examples for the query is a under explored topic for retrieval community.
  2. The results are positive and the ablations are quite extensive.

缺点

1 The baselines are a bit weak where I am not sure how much value will the less than 2% improvement add. There are other leading models on BEIR benchmark and how will the proposed methods compare to those and will those method get improved after adding in context example?

问题

Why the search time for DBPedia is much lower than Quora even if DBPedia has much larger corpora in Table 6?

评论

Thank you for the review! We address your concerns below and in our updated draft (changes are highlighted in blue).

W1

I am not sure how much value will the less than 2% improvement add...

Please see our general response.

There are other leading models on BEIR benchmark and how will the proposed methods compare to those and will those method get improved after adding in context example?

Theoretically, RARe can be applied to any retriever model, as it is a straightforward approach that involves prepending (q, d+) pairs to the original query and fine-tune with this setting. However, we find that fine-tuning with publicly released datasets often hurt performances of top retriever systems in the leaderboard. For example, we tried fine-tuning one of these models (Linq-Embed-Mistral, 6th in the leaderboard now) further with the "Instruct" format (i.e. without any in-context examples) using our public data, it led to a significant decrease in performance. To compare fine-tuning with in-context examples vs. not, it seems we might need to train with their training data mixture, which is not available.

Model: Linq-Embed-MistralAverage (BEIR)
Base60.19
Instruct (FT)58.85

Thus, we chose three different base architectures – LLM2Vec, E5-Mistral, and RepLLaMA, which are top-performers among models trained on publicly available training data. We do emphasize that our focus is not solely on achieving state-of-the-art performance but on developing a conceptual/empirical understanding of the potential of incorporating in-context examples into retrievers.

Q1. DBPedia vs. Quora search time

This is because we report the total time required to search for all queries in the test set, not the time required for individual examples. At an individual level, DBPedia's example takes longer than Quora, for their corpus is bigger, as you commented. DBPedia test set has fewer queries than Quora (Table 1 in [1]). To avoid confusion, we updated Table 6 by normalizing by the number of queries, reporting the latency numbers in milliseconds per query.

[1] BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (Thakur et al., NeurIPS 2021 Datasets and Benchmark)

评论

Thank you once again for your review. As the discussion period is nearing its end, we wanted to follow up to confirm that we have adequately addressed your concerns and to kindly request if you would consider reevaluating your assessment.

To summarize our updates:

  1. We contextualized the importance of 2% performance gains on large-scale retrieval benchmarks and additionally provided statistical significance tests.
  2. We discussed how RARe can be applied to different models and mentioned experiments on standard fine-tuning (Instruct) with Linq-Embed-Mistral.
  3. We updated the efficiency table with normalized latency numbers to avoid confusion caused by differences in query set sizes.

We hope the additions clarify our contributions and resolve your concerns. Please let us know if you have any further questions or if there are additional issues we can address.

评论

Thank you for the response. I am still not convinced by the concept of adding in context examples for the retrieval given the limited improvement over the baseline.

评论

Thank you for your thoughtful feedback.

We would like to kindly re-emphasize that while the improvements may appear modest, they are significant in the context of large-scale IR benchmarks. This is particularly evident when considering the papers that introduced task-specific instructions (Instruct) [1, 2]. These works achieved comparable ranges of improvements over their respective baselines, and the approach has since been adopted by leading models on BEIR, including those used as baselines in our work.

Furthermore, there is precedence in both Machine Learning and Information Retrieval literature (cited below) where improvements of 1-3% nDCG@10 are considered impactful, since nDCG is highly sensitive to the ordering of the retrieved results [12]. For example, concurrent work [8] https://openreview.net/forum?id=wfLuiDjQ0u explores the use of random in-context examples to improve text representations, and reports an improvement of 0.97% on retrieval and 1.25% on other representation tasks, relative to zero-shot fine-tuning (i.e. Instruct).

Given these considerations, we believe our findings contribute meaningfully towards future work in this space. We are open to further suggestions to improve the clarity or framing of this aspect.

[1] One Embedder, Any Task: Instruction-Finetuned Text Embeddings (Su et al., ACL 2023)

[2] Task-aware Retrieval with Instructions (Asai et al., ACL Findings 2023)

[3] Learning List-Level Domain-Invariant Representations for Ranking (Xian et al., NeurIPS 2023)

[4] RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses (Zhuang et al., SIGIR 2023)

[5] How to Train Your Dragon: Diverse Augmentation Towards Generalizable Dense Retrieval (Lin et al., EMNLP 2023)

[6] Document Expansion by Query Prediction (Nogueira et. al, 2019)

[7] Promptriever https://openreview.net/forum?id=odvSjn416y (ICLR 2025 Submission)

[9] Fine-Tuning LLaMA for Multi-Stage Text Retrieval (Ma et. al, SIGIR 2024)

[10] Rethinking the Role of Token Retrieval in Multi-Vector Retrieval (Lee et al., NeurIPS 2023)

[11] Adversarial Retriever-Ranker for Dense Text Retrieval (Zhang et al., ICLR 2022)

[12] A Theoretical Analysis of NDCG Ranking Measures (Wang et al., JMLR 2023)

[13] Contextual Document Embeddings https://openreview.net/forum?id=Wqsk3FbD6D (ICLR 2025 Submission)

审稿意见
5

This paper employs BM25 to retrieve top-k relevant queries and their associated documents as in-context examples, enhancing query representation when using an LLM as the retrieval encoder. Extensive experiments, including comprehensive ablation studies, were conducted to demonstrate the effectiveness of the proposed method.

优点

  1. The paper is well-structured and easy to follow.
  2. Extensive experiments were conducted on recent base models and popular datasets, such as MS-MARCO, BeIR, and RAR-b, demonstrating that the proposed RARe method enhances baseline models, including Llama and other LLM-based retrievers.
  3. Detailed ablation studies investigate critical questions, such as the impact of retrieved vs random in-context examples and whether semantically closer in-context examples are more beneficial.

缺点

  1. There are no statistical significance tests to confirm that the improvements over baselines in Tables 1 and 2 are meaningful.
  2. Only a basic retriever, BM25, was applied.
  3. In Figure 3, the performance of ArguAna’s Retrieved/Random setup is worse than Random/Random, which is inconsistent with other datasets and lacks an explanation.
  4. Figure 4 appears to contradict the paper’s premise, which relies on similar queries and their associated documents to enhance query representation. When score@Top-1 improves, relative improvement drops are observed on NFCorpus and FiQA2018, without further clarification. Additionally, only relative results are reported, making it difficult to discern the actual trend of NDCG against score@Top-1.

问题

  1. What are the actual trends of NDCG in Figure 4?
  2. Can you explain why ArguAna’s Retrieved/Random setup is worse than Random/Random in Figure 3?
评论

Thank you for the review! We address your concerns below and in our updated draft (changes are highlighted in blue).

W1

There are no statistical significance tests

Please see the general response.

W2

Only a basic retriever, BM25, was applied.

We appreciate the reviewer’s suggestion to explore stronger retrievers than BM25. We chose BM25 for its efficiency, which offers notable performance despite its simplicity. Using more powerful in-context example retrievers could potentially provide even further gains on our method, which can be studied in future exploration (L526).

W3/Q2

Performance on ArguAna

Please see the general response.

W4/Q1

Figure 4 appears to contradict the paper’s premise.

We apologize for the confusion. In Figure 4, we initially reported the score@Top-1 after normalizing the scores of top 5 retrieved examples. This is because the BM-25 implementation that we use returns the (un-normalized) scores of only the retrieved documents, which are not between 0 and 1. This may have inadvertently biased the x-axis—if all retrieved examples had high scores, the score@Top-1 would appear lower. We have updated the figure by computing similarity scores on the retrieved examples using an off-the shelf model (https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), and grouping by score@Top-1. Our updated figures show improvements with higher similarity on 6/10 datasets, while the gains are less pronounced on the rest.

评论

Thank you once again for your review. As the discussion period is nearing its end, we wanted to follow up to confirm that we have adequately addressed your concerns and to kindly request if you would consider reevaluating your assessment.

To summarize our updates:

  1. Conducted statistical significance tests, and discussed the importance of 2% performance gains on large-scale retrieval benchmarks.
  2. Contextualized the use of BM25 for retrieving in-context examples, highlighting its simplicity and efficiency. Future work may explore stronger retrieval models to enhance performance further.
  3. Discussed the mismatch in prompt format for ArguAna.
  4. Clarified Figure 4 and re-plotted it. The updated figures show higher similarity gains on 6/10 datasets, with smaller improvements on the remainder.

We hope the additions clarify our contributions and resolve your concerns. Please let us know if you have any further questions or if there are additional issues we can address.

评论

Thank you for your clarification. However, I will keep my score unchanged regarding the ideas and overall performance.

评论

Dear Reviewer PxAH,

Thank you for your valuable feedback and comments. With the rebuttal phase drawing to a close, we wanted to follow up to ensure that our responses have sufficiently addressed your concerns. We kindly ask that you consider revisiting your evaluation. If you have any further questions or additional concerns, we would be more than happy to address them. Additionally, please refer to our two general responses, which include some additional clarifications and remarks.

评论

We appreciate the reviewers for their valuable comments and feedback. We are encouraged to hear from the reviewers that our paper studies an under-explored topic in retrieval models (SvxX, 3Gic), and presents a step forward towards new research directions in this space (Uz4Y, 3Gic) by creatively adapting this technique (3Gic). All reviewers highlight the extensive (Uz4Y, SvxX, 3Gic) and critical (PxAH) nature of our experiments and ablation studies, spanning multiple datasets and base architectures.

All changes in the paper are highlighted with blue font in the updated PDF. We address common concerns below:

i) Significance of performance gain: An overall improvement of 2% nDCG is statistically significant, given that the aggregate benchmarks encompass over 50,000 queries and 31 million documents. Prior work [1,2,3,4] in this domain also show relatively modest gains in this aggregate benchmark (0.3% to 1.6% on average over their respective baselines). While the improvements might seem modest in absolute terms, it is noteworthy in the context of benchmarks like BEIR.

In the updated draft, we provide statistical significance tests on each dataset for retriever and decoder-only checkpoints, respectively in Tables 7-10. In the retriever-checkpoint setting, RARe (E5-Mistral) and RARe (LLM2Vec) are statistically significant (p < 0.05) compared to their Instruct counterparts on the BEIR dataset on average. RARe (E5-Mistral) is statistically significant compared to Instruct on RAR-b on average. In the LLM-checkpoint setting, RARe (Llama-3.1-8B-Instruct) is statistically significant compared to Instruct on BEIR, and statistically significant compared to Promptriever on RAR-b.

We also emphasize that our focus is not solely on achieving state-of-the-art performance but on developing a conceptual/empirical understanding of the potential of incorporating in-context examples into retrievers.

ii) Low performance on ArguAna: We think this could be caused as the in-context examples in Arguana dataset are synthetically generated from prior work [6]. This leads to mismatch between queries in in-context examples with that of the original test queries (L403/footnote 1) i.e. the queries provided in the in-context examples are significantly shorter than the test queries. Similar behavior of format-sensitivity has been observed in LLMs [5].

References

[1] Adversarial Retriever-Ranker for Dense Text Retrieval (Zhang et al., ICLR 2022)

[2] Unsupervised Dense Information Retrieval with Contrastive Learning (Izacard et al., TMLR 2022)

[3] Rethinking the Role of Token Retrieval in Multi-Vector Retrieval (Lee et al., NeurIPS 2023)

[4] LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders (BehnamGhader et al., COLM 2024)

[5] Reframing Instructional Prompts to GPTk’s Language (Mishra et al., ACL 2022)

[6] BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (Thakur et al., NeurIPS 2021 Datasets and Benchmark)

评论

Dear Reviewers PxAH, SvxX, and 3Gic,

Thank you once again for your valuable feedback and comments which has helped us improve the paper. We wanted to follow up to confirm that we have adequately addressed your concerns and to kindly request if you would consider reevaluating your assessment.

We would like to further clarify and contextualize the significance of our work (3Gic). We do believe that demonstrating the distinction of ICL between generative models & retrievers, and proposing a method to adapt ICL for retrievers represents a meaningful contribution. Notably, we found a concurrent work that is also under review at ICLR 2025:
[1] https://openreview.net/forum?id=wfLuiDjQ0u

Similar to us, [1] studies the incorporation of in-context examples to enhance text representations (showing a performance improvement of 0.97% on retrieval and 1.25% on other representation tasks on average over zero-shot fine-tuning), highlighting the relevance of our work.

Additionally, our work has some key differences, mentioned below:

  • Our method incorporates retrieved in-context examples (obtained from BM25), as opposed to randomly selected in-context examples.
  • We experiment with multiple architectures/checkpoints and training setups (training from decoder LLM vs retriever checkpoint, i.e. Table 1 & Table 2), while [1] evaluates on multiple types of tasks.

We also conduct extensive analysis on quantity, number, and format of in-context examples, some of which were highlighted as limitations by the reviewers of [1]. In summary, we:

  • Demonstrate that providing in-context examples does not work out-of-the-box on retriever models, even when using nearest-neighbor examples (Figure 2).
  • Show that even in the absence of multi-task learning or task-specific instruction, in-context learning helps out-of-domain performance (Table 1, training from decoder LLM checkpoint on MS-MARCO).
  • Examine alternative formats of examples, such as plain query expansion with documents (Table 4).
  • Study the role in-context example similarity on performance (Figure 3, Figure 4).
  • Analyze the impact of adding negative-examples in the prompt (Table 5).
  • Quantify the efficiency-performance tradeoff of adding in-context examples (Table 6).

We kindly request the reviewers to take these points into consideration and reassess the scores assigned to our paper, especially in light of the evaluations and scores provided to the concurrent work. We are confident that our work makes a notable contribution, and we greatly appreciate your thoughtful assessment of our submission.

AC 元评审

This paper proposes a novel framework, RARe, which augments the input with in-context learning examples for retrieving relevant documents. Different from existing ICL which helps LLMs during the inference/generation process, RARe explores the benefit of ICL in producing semantic embeddings to assist retrieval. By using BM25 to select ICL examples and with contrastive training, the retrieval model shows better performances on benchmark datasets, compared with existing baselines.

Strengths:

  • The idea of exploring ICL to enhance retrieval is novel and interesting.
  • The experiments showcase the advantage of the proposed method compared with existing baselines.
  • Additional experiments such as varying in-context formats and significance test are provided during the rebuttal phase which is helpful.

Weaknesses:

  • The benefit of this approach and the potential implications in real scenarios remains uncertain. The performance improvement may not be profound, especially when compared with the increased latency. This poses the question whether it is worth spending extra efforts when doing retrieval.
  • The scope of this method is a bit limited. The ICL encoding strategy could potentially be applied to more tasks besides retrieval.
  • The performance could be sensitive to the ICL retrieval method which is not fully explored.

审稿人讨论附加意见

  • Almost all reviewers are concerned about the potential benefit of this method considering the trade off between latency and performance gain. The authors provided further discussions which did not fully address this concern.
  • Additional concerns regarding significance test and the influence of in-context learning formats are raised and addressed by the authors.
  • The scope of this approach to more diverse tasks was raised by reviewers.
最终决定

Reject