PaperHub
6.5
/10
Poster4 位审稿人
最低6最高7标准差0.5
7
6
7
6
4.0
置信度
COLM 2025

Imagine All The Relevance: Scenario-Profiled Indexing with Knowledge Expansion for Dense Retrieval

OpenReviewPDF
提交: 2025-03-18更新: 2025-08-26
TL;DR

We propose SPIKE, a dense retrieval framework that decomposes documents into scenarios.

摘要

关键词
Information RetrievalReasoning Intensive RetrievalDense RetrievalReasoningLLM

评审与讨论

审稿意见
7

This paper introduces SPIKE, a dense retrieval method aimed to improve the performance of existing dense models by augmenting the retrieval documents with scenario settings. The scenarios are composed of a main topic, key aspects, information needs, and explanations. The method works by indexing the main topic and the explanation and calculating the retrieval scores in this index during test time in addition to the original corpus. SPIKE brings notable improvements on the retrieval-heavy retrieval benchmark, BRIGHT, and the paper includes additional experiments, such as the downstream performance.

接收理由

  • The paper is well motivated and well-written. It introduces a novel method to improve existing retrieval systems as well as appropriate ablations.
  • The method brings clear improvements to both the retrieval component as well as the downstream performance.
  • The human evaluation is well designed and demonstrate a clear benefit for the method. The fine granularity of the grading scale gives more insights on how SPIKE is better than the existing method.

拒绝理由

  • The main results table could include more baselines. For example, although SPIKE is not directly comparable against reranking or query rewriting baselines from the original paper, it’s still important to understand how their performances compare given the popularity of these methods.
  • The strongest results of the method are demonstrated on the weaker models (E5 and SFT) whereas the benefit on the stronger models (GRIT and Qwen) are much smaller—only ~1 absolute point. This brings the question of whether this method will still be relevant as the community shifts to stronger retriever models.
  • The paper lacks discussions on the system implications. SPIKE requires two dense indices—one for the original document and another for the scenarios. This means that the size of the dense index doubles, which is non-trivial in practice due to the size of real-world corpora, and inferences requires searching over another index. Then, it may be even cheaper to use a re-ranking method during test time in terms of GPU memory and computational cost (if it’s expensive to search over a large index).

给作者的问题

  • I would be interested to see what the Figure 6 would look like with a larger range of α\alpha, ranging from lower end like 0.0 (completely ignoring the original document) to 1.0 (only using original document).
  • What would Table 2, training the scenario generator on BEIR documents only, look like with the stronger retriever models?
  • Have you tried using different scenario generators? Such as starting with a stronger model than the Llama 3B model? Although this is not necessary for the paper, but I’d be curious to see if the performance can be even further improved.
评论

Full Results of Figure 6 - Part2

Componentα\alphaBio.Earth.Econ.Psy.Sus.Rob.Stack.Leet.PonyAopsTheoQ.TheoT.Avg.
M+E0.025.929.915.721.317.915.915.823.417.91.214.518.518.2
0.126.430.216.222.119.116.616.524.818.01.916.219.919.0
0.227.431.516.423.019.617.116.726.718.02.417.721.219.8
0.328.132.216.823.720.317.816.028.418.13.318.523.320.5
0.428.034.017.023.919.918.316.729.418.14.320.723.721.2
0.527.934.417.523.520.418.616.829.218.66.525.825.822.1
0.626.632.917.721.421.018.916.929.118.36.222.328.921.7
0.725.933.018.220.620.618.416.229.417.57.023.428.421.6
0.825.032.018.220.120.417.914.929.515.97.524.527.121.1
0.923.928.818.118.719.616.914.128.615.17.525.226.620.3
I+E0.024.025.914.521.915.313.214.722.59.91.211.713.015.6
0.124.327.814.822.416.013.415.025.011.71.413.414.816.7
0.225.429.415.922.118.014.714.826.713.11.415.016.817.8
0.326.231.016.522.718.515.915.027.714.02.216.619.318.8
0.426.932.016.923.019.216.816.127.714.73.018.821.119.7
0.527.132.317.621.919.516.716.128.715.34.320.324.020.3
0.627.032.518.121.719.616.916.029.415.74.921.625.520.7
0.725.731.818.220.919.617.516.029.715.55.722.826.320.8
0.824.831.017.920.219.217.514.830.314.97.024.027.320.7
0.923.229.217.818.519.016.714.229.014.26.925.026.620.0
评论

Answers to Questions

About Scenario Generator

We only considered the Llama-3B model as the backbone for the scenario generator. This is because, due to resource limitations in our system, it was not possible to train larger models.
Also, due to resource and budget constraints, the teacher model was also GPT-4o, which is relatively less expensive compared to reasoning models. We also remain curious about how SPIKE’s performance might change if we were able to use different backbones or teacher models for the scenario generator.

Full Results of Table 2

Below is the expanded version of Table 2.

ModelBio.Earth .Econ.Psy.Sus.Rob.Stack.LeetPonyAopsTheoQTheoTAvg.
GRIT25.032.819.019.918.017.311.629.822.08.825.121.120.9
+SPIKE23.027.618.420.219.518.315.831.517.58.925.625.921.0
QWEN30.936.217.724.614.913.519.925.514.427.832.932.924.3
+SPIKE31.340.422.824.424.516.222.824.611.113.627.833.324.4

Full Results of Figure 6 - Part1

Below is the result table for the larger range of α.

Componentα\alphaBio.Earth.Econ.Psy.Sus.Rob.Stack.Leet.PonyAopsTheoQ.TheoT.Avg.
O1.018.826.015.515.818.516.49.828.74.87.126.126.817.9
M0.018.920.613.718.212.26.511.817.12.71.411.118.712.7
0.121.022.714.419.713.37.711.718.93.61.613.419.313.9
0.222.424.515.020.514.58.712.321.74.02.015.320.615.1
0.323.126.716.121.017.09.612.723.95.22.216.523.016.4
0.423.927.917.621.917.99.913.125.37.12.718.523.317.4
0.524.628.818.222.518.610.913.027.19.23.420.024.718.4
0.625.230.018.722.019.712.514.028.810.84.521.527.119.5
0.724.829.319.521.419.414.014.129.312.55.522.727.320.0
0.824.728.218.019.620.115.113.929.113.66.324.127.120.0
0.923.628.618.018.219.516.413.629.216.76.725.827.020.3
I0.022.922.615.117.212.510.214.619.67.31.312.013.714.1
0.123.924.416.518.914.211.114.822.38.22.013.915.615.5
0.225.226.317.720.115.212.214.724.910.02.516.016.216.7
0.326.127.717.520.616.012.614.326.212.03.418.316.917.6
0.426.828.818.020.316.712.913.727.914.44.019.918.218.4
0.526.530.017.920.717.314.714.728.715.94.920.819.119.3
0.625.931.118.520.117.615.515.529.316.75.421.822.720.0
0.724.930.618.319.318.316.715.729.416.56.022.325.220.3
0.825.129.118.818.718.616.514.429.315.76.723.725.120.2
0.923.528.618.118.118.717.114.028.714.77.225.126.120.0
E0.027.227.616.220.714.08.312.620.215.90.410.914.015.7
0.127.728.616.721.714.59.213.022.616.30.613.014.816.6
0.228.030.116.922.315.210.113.525.216.51.115.518.817.8
0.328.730.917.423.816.811.314.326.716.21.417.223.219.0
0.429.232.017.724.418.312.714.427.717.32.418.524.920.0
0.529.433.017.824.019.213.615.128.817.33.520.627.920.9
0.629.032.619.022.619.315.015.330.418.24.721.928.221.3
0.728.332.319.222.219.716.115.630.418.05.123.828.521.6
0.826.532.018.520.319.616.815.029.817.06.524.828.121.2
0.925.030.117.919.419.716.614.029.316.67.125.526.520.6
评论

Thank you for your interest in our research and for suggesting areas for improvement. We provide detailed responses to your comments and questions

About baseline

Thank you for this valuable suggestion. We agree that including more baselines, such as reranking and query rewriting methods, would help provide a more comprehensive comparison. In the revised version, we will add additional baseline results to better contextualize SPIKE’s performance relative to these widely used approaches.

Performance on Stronger Models

First, as seen in the upper tiers of actual retrieval benchmarks such as MTEB and BRIGHT leaderboards, even an improvement of 1 absolute point can lead to significant changes in model ranking. As models approach the top ranks, the performance gap between them becomes extremely narrow. Therefore, the ~1-point performance gain achieved by SPIKE on strong retrievers (e.g., GRIT, Qwen) should still be regarded as a meaningful improvement.

Furthermore, the aggregate improvements may seem limited, breaking down the results by domain highlights SPIKE’s practical value in addressing the blind spots of retrieval models.
The table below summarizes the average performance of each model on the natural language and code domains.

BaselineNatural Language Avg.Natural Language Improv.Code Avg.Code Improv.
E518.5-14.9-
+SPIKE23.727.9%20.436.5%
SFR20.5-14.7-
+SPIKE24.519.2%20.338.3%
GRIT22.9-20.2-
+SPIKE23.21.31%21.66.8%
QWEN24.9-18.3
+SPIKE29.518.8%20.712.8%

Looking at the domain-wise performance, we observe that all models consistently perform worse in the code domain compared to the natural language domain. Notably, even strong models (GRIT, QWEN) achieve already high performance in the natural language domain, but in the code domain, the baseline performance does not exceed 20, indicating a limitation of existing dense retrieval models.

In terms of performance improvement, for the natural language domain, we find that the relative gain from SPIKE drops to around 1% as the retrievers become stronger (GRIT). In contrast, for the code domain where all models continue to struggle, SPIKE consistently delivers an improvement of over 7% across all models.
This demonstrates that SPIKE can be highly effective in addressing blind spots such as the code domain, which are not well handled by existing dense retrieval models.

These results lead us to believe that SPIKE has practical value in consistently addressing the blind spots of future retrievers.

Additionally, SPIKE’s scenario generator is currently trained using reasoning traces from GPT-4o. Prior work (RANK1 11 ) has shown that substantial gains can be achieved by training LLM rerankers on the outputs of state-of-the-art reasoning models. We expect that if SPIKE’s scenario generator is further trained using even more powerful or specialized reasoning teachers, additional improvements could be realized even for strong retrievers. We plan to investigate this direction in future work.

Latency Analysis

We acknowledge that the additional scenario index introduced in SPIKE requires extra computation and may lead to a certain increase in latency. However, we would like to emphasize that the amount of additional computation incurred by SPIKE during inference does not increase proportionally with the index size (i.e., corpus size N).

In theory, introducing a scenario index would require computing scenario scores (rsr_s) for all scenarios associated with every document in the corpus. However, in practice, we do not adopt such an inefficient approach. Instead, SPIKE applies a much more optimized strategy.

Specifically, for retrieve top-k documents, SPIKE first computes the document-level scores (rdr_d) for all documents, and then, only for the top-k' documents (where k’ is hyperparameter satisfying k < k’ << N), it rapidly retrieves the corresponding scenarios using the document as a key. Scenario-level scores (rsr_s) are computed only for these retrieved scenarios (while rsr_s for the remaining scenarios are set to zero).

As a result, the additional computation at inference is strictly limited to the top-k′ documents and their associated scenario embeddings, regardless of the overall corpus size or the total number of scenarios. Therefore, even as the corpus grows, the extra latency is affected only by k', enabling efficient system operation.

Furthermore, all the experimental results reported in our paper were obtained using this retrieval optimization. In fact, we have confirmed that applying this optimization results in the same retrieval performance as the naive approach without the optimization.

[1] Rank1: Test-Time Compute for Reranking in Information Retrieval; weller et al.

评论

Thanks for the detailed reply.

I'm concerned that a majority of the performance boost comes from training SPIKE on in-domain documents for BRIGHT. Specifically, looking at the full results for the Table 2 in the reply above, the improvement with SPIKE on these stronger retrievers only brings an improvement of 0.1 on average. This raises the question of whether it's still worth it to leverage this technique when considering an OOD dataset with a strong enough retriever.

In terms of the latency analysis, I understand that only retrieving the top k' helps with the system optimization, but it would be helpful to see specific numbers to understand the differences with SPIKE. Specifically:

  • the GPU hours it requires to generate scenarios and build an index out of it
  • GPU memory required to host the indices with and without SPIKE
  • query latency when doing the second stage k' → k retrieval. Additionally, what k' did you use in the paper?
评论

We genuinely hope that our responses have adequately addressed the concerns you raised. Should there be any remaining issues, we would be more than willing to provide further clarification.

评论

Thanks for the detailed experiments and my major concerns are addressed. Excited to see the final revision with the additional discussions. I have updated my score accordingly.

评论

Thank you for taking the time to review our research!

The key factor is the amount of training data for scenario generator, not the use of in-domain documents.

The results shown in Table 2 are from a scenario generator trained only on the BEIR documents (i.e., OOD data), excluding BRIGHT from the training data used in Table 1. As a result, the number of training documents for the scenario generator (C) is nearly halved.

We believe that the main reason for the limited performance improvement of SPIKE in Table 2 is the insufficient amount of training data. To validate this, we conducted additional experiments in which we trained the scenario generator using only BEIR data, but with a much larger training dataset. (Due to time constraints, we conducted these experiments only on the Biology and Economics datasets, where performance drops were observed for stronger retrievers (GRIT and Qwen.)

Below are the main experimental results:

BiologyEconomics
GRIT25.019.0
+SPIKE(BEIR; C=4000)23.018.4
+SPIKE(BEIR+BRIGHT; C=8000)27.820.0
+SPIKE(BEIR; C=20000)30.421.7
QWEN30.917.7
+SPIKE(BEIR; C=4000)31.322.8
+SPIKE(BEIR+BRIGHT; C=8000)32.423.7
+SPIKE(BEIR; C=20000)34.623.2

The results show that even when training only on BEIR (OOD) data, the scenario generator achieves substantial performance improvements if given a sufficiently large training dataset (C). Notably, the BEIR; C=20,000 setting outperforms the setting where only 8,000 in-domain examples (BEIR+BRIGHT; setting used in our main table) are used for training.

In summary, the performance gain from SPIKE depends more on the scale of the training data than on whether in-domain data is used. Even for strong retrievers, increasing the amount of training data led to greater improvements.
These findings suggest that scaling up the scenario generator's training data can lead to substantial performance gains, even when applied to strong retrievers. We believe that leveraging even more extensive training data for scenario generator in the future could yield further improvements for strong retriever models.

Details for latency

  • Since GPU time varies depending on the system and corpus size, we report GPU time per 10K documents. Scenario generation requires 9.4 minutes per 10K documents, while encoding and indexing require 12.4 minutes per 10K documents (scenarios). Scenario generation was measured on an A100 GPU, and embedding encoding and indexing were measured on an A6000 GPU.
  • If we denote the memory required for the document index as D, the total memory required becomes D + S × D, where S is the average number of scenarios per document (S = 3 in our paper).
    (In general, with an embedding size of 4096, the index required approximately 153 MB per 10K documents (scenarios).)
  • We use k’=1000 in our paper.
  • For query latency, the average was 0.19 seconds per query without SPIKE and 0.48 seconds per query with k’ = 1000. These latency values were also measured on CPU, not GPU. (In our work, all inference experiments(except for indexing and scenario generation) were conducted on CPU.)
审稿意见
6

This work proposes the Scenario-Profiled Indexing with Knowledge Expansion (SPIKE), which a dense retrieval framework that considers Main Topic, Key Concepts, Information Needs, and Explanations. The work conducted a sufficient experiment to prove the effectiveness of SPIKE.

接收理由

  • This work provides the code for implementation.
  • The work considers both the document and the scenario for retrieval. This is relatively new for the RAG.
  • This work conducted sufficient experiments to prove the effectiveness of SPIKE. Also, this work tries different models and datasets.
  • This work further analyzes the SPIKE in Ablation Study, including scenario components, relevance weight and the generalization of scenario generator.

拒绝理由

  • The Figure 1 information is too dense. The author may consider highlighting the important information and adding more description in the caption.
  • This work is sensitive to the hyperparameter relevance weight.
  • In Figure 6, the author does not use the current new models and the backbone model is relatively old.

给作者的问题

N/A

评论

Thank you sincerely for reviewing our research and providing valuable feedback.
First, regarding Figure 1, we will revise it by highlighting the important information as you mentioned, and adding more detailed explanations to the caption.
Our responses to the additional points you raised are as follows:

SPIKE consistently outperforms the baseline across all hyperparameter settings.

While Figure 6 shows that performance varies depending on the value of the relevance weight, all configurations still outperform the baseline model without SPIKE. In fact, some configurations achieve even higher average scores than our default setting (see Appendix Table 7).
This demonstrates that SPIKE is robust to hyperparameter variations while still delivering consistent performance gains.

About the Baseline Model Used in Analysis

We selected E5-mistral-7B for our analysis (Figure 6) because it is still widely used as a strong retrieval baseline in recent work across diverse domains 11

22 .
Our intention was to show that SPIKE is beneficial not only for retrieval tasks but also for downstream applications such as RAG, and we aimed to offer insights that can generalize to a wide range of practical use cases.
Hence, we conducted our analysis on a well-established and broadly applicable baseline model.

That said, we fully agree with the suggestion that SPIKE should be analyzed with a broader set of retrieval models. We are currently conducting experiments with additional retrieval baselines and will include the full analysis in the final version of the paper.

11 ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations; Wang et al.

22 MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation; Park et al.

评论

Thank authors for the response. I will keep the score after the above response

审稿意见
7

This academic paper proposes SPIKE which is designed to tackle the challenge of finding implicitly relevant information that requires reasoning beyond simple keyword or semantic matching. SPIKE achieves this by transforming documents into structured representations (scenario) capturing the reasoning process needed to connect potential information needs to document content. The method employs a powerful large language model (LLM) to create a scenario-augmented dataset, which is then used to train a smaller, more efficient scenario generator. During the search process, SPIKE considers scenario-level relevance alongside document-level relevance, leading to improved retrieval performance, especially for reasoning-intensive tasks and non-natural language content like code. Experimental results demonstrate SPKIE not only improve retrieval quality but also enhances the user experience by providing valuable context and explanations.

接收理由

  • Improved document representation: this is achieved by expand document with multiple scenarios (main topic + explanation). This is helpful for surfacing relations that are not explicitly stated in the document or non-natural-language documents such as code.
  • Consistent improvement over baseline systems: experimental results show consistent improvement over baseline retrieval systems as well as retrieval augmented generation system. The additive improvement over reasoning query in section 4.2 further demonstrates the value of augmenting document index with scenarios.
  • Improved interpretability: providing scenario-profiled explanations makes retrieved results easier to understand and thus enhance user experience.
  • Detailed ablation study.

拒绝理由

  • Impact of scenario quality on retrieval performance: the core idea of SPIKE is to augment document by generating scenarios, it is necessary to analyse the how scenario generation quality affect retrieval performance. In addition, how to measure the quality of the generated scenarios.
  • Regarding the latency impact: computing rfinalr_{final} requires
    • rdr_d and
    • max{rs}\max\{r_s\} for ss from d.
    • Although both indexes are computed offline, the second part which needs to gather max{rs}\max\{r_s\} still requires extra computation. Analysis on the this overhead as well as the index size increment should be discussed.
评论

Thank you for your valuable feedback and suggestions to improve our work! We fully agree with your suggestions and will incorporate them into our work.

Scenario Generation Quality

We fully agree that the quality of generated scenarios can substantially influence retrieval effectiveness. While we did not introduce an explicit metric for scenario quality in this work, we conducted a human evaluation (see Figure 4) to indirectly assess generated scenario quality.

However, as you pointed out, we acknowledge that our current work lacks a thorough analysis of how scenario quality directly impacts retrieval performance. In future work, we plan to evaluate scenario quality across multiple versions of the scenario generator using human evaluations, and to analyze the correlation between these human evaluation scores and retrieval accuracy. We believe this additional analysis will provide a clearer understanding of the relationship between scenario quality and retrieval effectiveness, and we appreciate your suggestion.

Latency Analysis

We acknowledge that the additional scenario index introduced in SPIKE requires extra computation and may lead to a certain increase in latency. However, we would like to emphasize that the amount of additional computation incurred by SPIKE during inference does not increase proportionally with the index size (i.e., corpus size N). In theory, introducing a scenario index would require computing scenario scores (rsr_s) for all scenarios associated with every document in the corpus. However, in practice, we do not adopt such an inefficient approach. Instead, SPIKE applies a much more optimized strategy.

Specifically, for retrieve top-k documents, SPIKE first computes the document-level scores (rdr_d) for all documents, and then, only for the top-k' documents (where k’ is hyperparameter satisfying k < k’ << N), it rapidly retrieves the corresponding scenarios using the document as a key. Scenario-level scores (rsr_s) are computed only for these retrieved scenarios (while rsr_s for the remaining scenarios are set to zero).

As a result, the additional computation at inference is strictly limited to the top-k′ documents and their associated scenario embeddings, regardless of the overall corpus size or the total number of scenarios. Therefore, even as the corpus grows, the extra latency is affected only by k', enabling efficient system operation.

Furthermore, all the experimental results reported in our paper were obtained using this retrieval optimization. In fact, we have confirmed that applying this optimization results in the same retrieval performance as the naive approach without the above optimization.

We will include this analysis in the revision of our analysis section, and we appreciate your helpful suggestion in highlighting this important aspect.

评论

Thanks for the response.

Regarding the approximate search approach: What is the impact of kk' on retrieval accuracy, compared to exact (or exhaustive) search.

评论

The value of K' has virtually no impact on retrieval accuracy

The table below shows the retrieval performance on BRIGHT for different values of K'.

BaselineBio.Earth.Econ.Psy.Sus.Rob.Stack.Leet.PonyAopsTheoQ.TheoT.Avg.
E5-Mistral-7B18.826.015.515.818.516.49.828.74.87.126.126.817.9
+SPIKE(K'=1000)25.933.018.220.620.618.416.229.417.57.023.428.421.6
+SPIKE(K'=10000)25.733.018.220.620.618.416.229.417.27.023.528.221.5
+SPIKE (K'=N; exhaustive search)25.733.018.220.620.618.416.229.417.27.023.528.221.5

As shown in the table, retrieval accuracy is virtually unchanged regardless of the value of K', whether K' is set to 1,000, 10,000, or N (i.e., exhaustive search). This result demonstrates that the approximate search approach in SPIKE maintains accuracy while providing high efficiency.
We will also include this result in the final version of our paper. Thank you for your interest in our work.

评论

Thanks for the update.

审稿意见
6

This work proposes a document expansion approach called SPIKE, which generates scenarios related to a document through a multi-step process considering the document's topic, aspects of the topic, potential related information needs, and explanations of how the information needs are connected to the document. GPT-4o is used to generate a training dataset of scenarios, which is used to train a Llama 3B model that produces scenarios for all documents in the BRIGHT benchmark. Results on BRIGHT indicate that SPIKE improves the performance of a range of retrieval models, and improvements are also obtained in a RAG setting.

接收理由

  • SPIKE consistently improves performance on BRIGHT, which is a challenging dataset.
  • While GPT-4o is used to generate the training data, the 3B student model performs relatively well when used to expand the document collection.
  • Comparisons with existing document expansion approaches are reasonable (see Appendix A), though it would be nice to see a method using task-specific instructions. It would be nice to include the results with other document expansion approaches in the main results table rather than in an appendix, given that they are clear baselines.

拒绝理由

  • The method itself is closely connected to BRIGHT's implicit relevance, so it is not clear whether the method would harm performance on retrieval benchmarks with different types of relevance (e.g., BEIR). Methods like Promptagator (Dai et al.) have previously taken advantage of task-specific instructions, so one might expect this type of approach to also improve performance on BRIGHT.
  • In a real world setting, the query would be a better indicator of the desired relevance type than the document collection is, so there is an argument to be made for making query expansion task-specific rather than performing task-specific document expansion. Without results on different benchmarks, as mentioned above, it is not clear that it's safe to apply SPIKE to every document collection.
  • The results on BRIGHT are slightly lower than BM25 with GPT-4o CoT expansion (e.g., see Rank1, Weller et al). On one hand, the retrieval models used in this work are substantially stronger than BM25, but on the other hand a 3B model trained on data from GPT-4o is being used to expand the documents.
评论

Thank you for taking the time to review our research!

SPIKE also performs well on other document collections

Below table shows the retrieval performance of dense retrieval models on the BEIR benchmark, comparing their original performance against their SPIKE enhanced versions.
Due to time and computational constraints, we selected 5 BEIR datasets with fewer than 1,000 queries and a corpus size under 500,000 documents for our experiment.

ModelSciFactNFCorpusSCIDOCSTouche2020TRECCOVIDAvg,
E576.438.616.326.487.349.0
+SPIKE78.140.718.031.487.351.1
SFR77.741.919.929.087.651.2
+SPIKE78.142.220.832.487.252.1

Across these datasets, SPIKE consistently improves the average retrieval performance, demonstrating that its effectiveness is not limited to the BRIGHT benchmark.
These results indicate that SPIKE can be safely and effectively applied to diverse document collections, even in corpus involving different types of relevance.

SPIKE Complements Query-Side Expansion Approaches

While we acknowledge that query-side expansion can yield significant improvements in many retrieval scenarios, we emphasize that document-side expansion, as implemented in SPIKE, can effectively complement query-side expansion methods. Thanks to the offline nature of document expansion, combining both approaches allows retrieval systems to achieve enhanced performance and broader coverage with only minimal additional online latency, as demonstrated in our experiments (Please refer to Figure 3 and Table 4).

SPIKE addresses the limitations of existing dense retrieval models.

We would like to clarify that the point raised regarding BM25 + CoT query does not reflect not a limitation of SPIKE, but rather limitation of dense retrieval models in generalizing to LLM-generated queries.
As reported in the BRIGHT paper, BM25 with CoT query expansion consistently outperforms dense retrievers with the same expanded queries, even though dense retrievers outperform BM25 on original queries.

The following table from BRIGHT 11 presents a comparison of BM25 and various dense retrieval models, both before and after applying GPT-4 CoT query expansion.

BaselineAvg.
BM2514.5
+CoT Query Expansion27.0
E517.9
+CoT Query Expansion22.1
Qwen22.5
+CoT Query Expansion24.8

According to the table, dense retrieval models typically surpass BM25 in standard settings, but when CoT-based query expansion is applied, their performance drops below that of BM25.
BRIGHT 11 attributes this result to the following reason:
"possibly because BM25 can adapt to different queries, while LLM-generated queries are out-of-distribution for trained models."

This implies that the relatively strong performance of BM25 + CoT queries does not stem from limitations of the SPIKE, but rather from the inherent limitations of dense retrievers in handling such queries.

We would like to claim that SPIKE can address this limitation. By using document indexing to generate scenario representations closely aligned with LLM-generated (expanded) queries, SPIKE enables dense retrieval models to respond more robustly to these queries.

For instance, in the BRIGHT benchmark, Qwen with GPT-4o CoT query expansion achieves an nDCG@10 of 24.8, which is notably lower than the 26.5 reported for BM25 + GPT-4o CoT in Weller et al. 22 . However, when SPIKE is applied on top of Qwen + CoT, the performance increases to 26.2, effectively closing the gap with BM25 + CoT.

This evidence clearly demonstrates SPIKE's consistent ability to enhance retrieval performance across different query types. Furthermore, it suggests that as dense retrievers continue to evolve and are trained on LLM-generated queries, the benefits offered by SPIKE will become even more significant.

11 BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval; su et al.

22 Rank1: Test-Time Compute for Reranking in Information Retrieval; weller et al.

评论

Thanks for your response. The BEIR results address a key concern, and I'm increasing my review score accordingly.

评论

To all reviewers,

We would like to express our sincere gratitude to all reviewers for the thoughtful feedback and the considerable attention given to our work. The valuable comments and suggestions provided have greatly contributed to improving the quality of our work. We will do our best to reflect these insights and further strengthen our work. Thank you once again for the time and effort invested in reviewing our paper.

最终决定

This paper presents SPIKE, a dense retrieval framework that augments documents with scenario-based representations to improve reasoning-intensive retrieval tasks. All the reviewers gave a position score. The core strength lies in SPIKE's consistent improvements across various retrieval models and its novel approach to capturing implicit relevance through scenario decomposition. The method shows particularly strong gains in some domains. While the authors' rebuttal addresses some concerns about latency and provides additional experiments showing improvements with larger training datasets, the fundamental trade-offs between performance gains and system complexity remain. The work makes a solid contribution to reasoning-aware retrieval. It would be great to see some improvements on tackling the practical limitations and narrow evaluation scope in the final version.