Semi-Parametric Retrieval via Binary Bag-of-Tokens Index
A semi-parametric neural retrieval system that supports both parametric and non-parametric index
摘要
评审与讨论
This paper introduces a Semi-parametric Disentangled Retrieval (SiDR) strategy that integrates the advantages of non-parametric and parametric retrieval, achieving competitive performance through the late parametric mechanism while maintaining high efficiency.
优点
- The proposed SiDR integrates the advantages of non-parametric and parametric retrieval, achieving competitive performance while maintaining high efficiency.
- The paper provides clear explanations of the SiDR method, experiments, and analysis, making it easy to understand.
缺点
- In Table 1, the performance of SiDR in parametric retrieval scenario does not surpass ANCE, even with the late parametric mechanism, yet the authors still report it as the best result with the bolded values.
- In Table 2, under the non-parametric retrieval scenario, SiDR fails to surpass BM25 in most of datasets.
- The late parametric mechanism is not novel, and the experiments should include more comparisons to various methods in this category.
- Because of the comparison results with SiDR and ANCE/BM25, the experimental analysis should include an examination of the reasons for SiDR's better performance in the late parametric scenario.
问题
- How does SiDR compare with other late parametric methods, and what are the reasons for its success in the late parametric scenario?
We thank the reviewer for their thoughtful feedback and for acknowledging the clarity of our explanations. We would like to address the concerns and questions below:
Weakness 1: In Table 1, the performance of SiDR in parametric retrieval scenario does not surpass ANCE, even with the late parametric mechanism, yet the authors still report it as the best result with the bolded values.
RW1: Thank you for pointing this out to us! We have corrected it in our revised manuscript (Table 1).
To further address the concern about our underperformance compared to ANCE, we conducted additional experiments. While ANCE uses in-training retrieval for negative mining, it requires periodic index rebuilding (every 10k batches), adding approximately 32 hours (8 hours per re-build * 4 times) for re-indexing in our setup. To match this increased training complexity, we used SiDR_full to build an embedding index once to retrieve negatives, and continue training for another 80 epochs. Below is the performance on NQ, including the full training and indexing costs.
| Model | Top1 | Top5 | Top20 | Cost |
|---|---|---|---|---|
| ANCE | - | 70.7 | 81.4 | 32h (for indexing only) |
| SiDR_full | 49.1 | 69.3 | 80.7 | 9h |
| SiDR_full (re-index for neg) | 53.7 | 72.2 | 86.0 | 25h |
These results demonstrate that our method, with added training complexity, achieves better performance than ANCE. Notably, our entire training process requires less time than ANCE's re-indexing alone.
Weakness 2: In Table 2, under the non-parametric retrieval scenario, SiDR fails to surpass BM25 in most of datasets.
RW2: Thank you for sharing your concern. In our analysis in Appendix D, we demonstrate that learning plays a critical role in our effectiveness over BM25. The observed underperformance on BEIR can be attributed to the fact that there was no training performed for these datasets. Additionally, most BEIR datasets are labeled using a BM25-like retrieval system, making them inherently challenging to surpass - a concern also noted in Section 6 of BEIR's own paper. Our experiments consistently show that SiDR_beta outperforms BM25 in-domain, with larger improvements observed as the corpus size increases (e.g., Wiki21m has 21 million texts, significantly larger than those in BEIR).
Weakness 3: The late parametric mechanism is not novel, and the experiments should include more comparisons to various methods in this category.
RW3: In Appendix E, we include various late parametric methods that combine BM25 or SiDR_beta with more recent SOTA bi-encoder retrievers. Our findings indicate that using BM25 as the first-stage retriever, followed by a stronger SOTA retriever for re-ranking, can outperform SiDR_beta (m=100). However, SiDR_beta can also serve as the first-stage retriever, consistently achieving better performance.
Weakness 4: Because of the comparison results with SiDR and ANCE/BM25, the experimental analysis should include an examination of the reasons for SiDR's better performance in the late parametric scenario.
RW4: Thank you for raising this point. We have addressed your concerns as following:
-
Regarding the comparison to ANCE: In our RW1 above, we justified the effectiveness of our approach compared to ANCE in terms of cost, and conducted experiments showing that similar cost investment in our method can lead to better results than ANCE.
-
Regarding the comparison to BM25: In Appendix D, we demonstrate that the significant improvement over BM25 is primarily due to the learned query term weights. We attribute SiDR_beta's superior performance to its strong learning capabilities, which enable more effective contextualized query term weighting when the document side uses a BoT index.
-
For late parametric scenario, please refer to our RQ1 below.
Question 1: How does SiDR compare with other late parametric methods, and what are the reasons for its success in the late parametric scenario?
RQ1: Late parametric methods are two-stage approaches that utilize a first-stage retriever with a non-parametric index (e.g., BM25 or SiDR_beta), followed by a bi-encoder to rerank and cache the document embeddings. The success in the late parametric scenario (e.g., SiDR_beta (m=100)) can be attributed to the advantages at each stage over other alternatives.
- In the first stage, SiDR_beta outperforms BM25 largely due to the learned query term weights, which are contextualized and learned from tasks, as demonstrated in Appendix D.
- In the second stage, SiDR_full outperforms other neural reteriever with comparable training cost (e.g., DPR, VDR).
- In Appendix E, we show that using a stronger retriever in the second stage can lead to better results (e.g., BM25 + E5 outperforms SiDR_beta (m=100)).Yet, SiDR_beta remains an effective choice as the first-stage retriever, consistently delivering better performance than BM25 (e.g., SiDR_beta + E5 outperforms BM25 + E5).
Thanks for the responses of the authors, which have addressed my concerns. The proposed semi-parametric architecture is interesting and important for balancing the efficiency and the effectiveness of information retrieval.
We sincerely thank Reviewer YeDM for their thoughtful feedback and for recognizing the value of our proposed semi-parametric architecture in balancing efficiency and effectiveness in information retrieval.
The paper focuses on sparse neural retrieval using a SPLADE-like model. It proposes to train a neural sparse bi-encoder V for both queries Q and documents D in such a way that it works well when computing similarity in the standard bi-encoder fashion, i.e., as the inner product of V(Q) and V(D) as well as through the inner product of V(Q) and the binary bag-of-tokens representation of a document.
This can be quite useful since it permits using a neural encoder together with non-parametric document representations (that do not required encoding) for effective and efficient retrieval.
Initially I had several concerns regarding the fairness of comparison to BM25 (multi vs single-thread and running BM25 on GPU) as well as to a possibility of re-ranking using a stronger model. However, additional experiments and clarifications resolved the issues.
There are also some clarity issues (see the weaknesses section), which I think could be resolved or mitigated in the final version. The paper is somewhat hard to read, but it's understandable with some effort.
I would like to emphasize that the final version of authors need to compare against the fast implementation of BM25 (i.e., using PISA or similar) and make sure they make it clear that the comparison is vs single-thread BM25 (this is only for efficiency evaluation, no need to re-run all experiments).
Last, but not least, that BM25 with re-ranking is a strong baseline is definitely worth highlighting and connecting it to prior work, e.g.: Leonhardt, Jurek, et al. "Efficient neural ranking using forward indexes." Proceedings of the ACM Web Conference 2022. 2022.
优点
See the summary for more details.
- An interesting approach
- A semi-parameteric index retrieval is easy to carry out on a GPU
- A substantial evaluation using the BEIR datasets (plus additional QA datasets)
- Promising results
缺点
After discussion with authors, I have come to a conviction that the paper is generally solid, but presentation can be improved.
There are several examples of where the paper is hard to read:
- No explanation for the need of VDR (I still didn't quite get your explanations in the rebuttal).
- The whole section of revisiting MLM is very confusing. You say "We provide insights into the consistencies between semi-parametric alignment and masked language model pre-training", but this is already a pretty confusing phrase. To begin with you probably meant to say "connections". This is a small things, but a lot of such small things here and there complicate reading. Compare, e.g., how this was introduced in the SPLADE paper: We describe in Section 3.1 how the Masked Language Modeling (MLM) head of Pre-trained Language Models can be used to represent tokens in a sequence as vectors in the vocabulary space.
- Then the whole section "PARAMETRIC AND NON-PARAMETRIC REPRESENTATION" is a bit of a slog to read. You need to explain what you mean by "parametric" vs "non-parameteric" vs "semi-parametric" representations. Moreover, when you talk about parametric representations it's IMHO better talk about *learned token weights vs token computed using a hand-crafted rule (as in BM25). Don't get me wrong parametric vs non-parametric isn't wrong per se, but the distinction between parametric and non-parametric ML is a bit blurry. Plus, in the context of learned representations it is not frequently used IMHO.
BoT appears almost out of the blue. I think it was only mentioned in the MLM-section. IMHO, it's better to start by explaining that a document/query can be represented by a bag of words or tokens. In some cases these tokens are actual document/query tokens and in some cases they are "predicted" by the encoder model. Moreover, tokens can be weighted and non-weighted. Weights can come either from a model or from token-counting (as in BM25). This would have been a much less confusing explanation IMHO. Then, you can explain how weights are coming from an MLM head of a BERT model. This would have been much more logical, IMHO.
问题
N/A
We thank the reviewer for appreciating our idea and for providing valuable feedback and suggestions. We would like to address the concerns and questions below:
Weakness 1: The paper is somewhat confusing and hard to follow (even for people with deep expertise on various retrieval approaches).
RW1: Thank you for your feedback. We value your input and would appreciate it if you could point out specific sections or aspects that you find confusing, so we can address them in detail.
Weakness 2 (1): The proposed semi-parametric approach SIDR_beta does not uniformly outperforms BM25 and on average it is worse on BEIR.
RW2(1): Thank you for sharing your concern. In our analysis in Appendix D, we demonstrate that learning plays a critical role in our effectiveness over BM25. The observed underperformance on BEIR can be attributed to the fact that there was no training performed for these datasets. Additionally, most BEIR datasets are labeled using a BM25-like retrieval system, making them inherently challenging to surpass - a concern also noted in Section 6 of BEIR's own paper. Our experiments show that SiDR_beta consistently outperforms BM25 in-domain, with larger improvements observed as the corpus size increases (e.g., Wiki21M has 21 million texts, compared to thousands to millions in BEIR).
Weakness 2 (2): What happens if we take BM25 and re-rank using VDR or SPLADE or other neural bi-encoder.
RW2(2): In Appendix E, we include more late parametric baselines that combine BM25 or SiDR_beta with a more recent SOTA bi-encoder retriever. Our findings indicate that using BM25 as the first-stage retriever, followed by a stronger SOTA retriever (e.g., E5) for re-ranking, can outperform SiDR_beta (m=100). However, SiDR_beta can also serve as the first-stage retriever, consistently achieving better performance to BM25 at first-stage.
Weakness 3 (1): Latency comparison with BM25 is unfair. We are comparing CPU and GPU while claiming that BM25 cannot run on GPU. This is not true. GPUs are worse at sparse inner product compared to dense ones, but they can still do it.
RW3(1): Thank you for sharing the repo. To address your concern, we have conducted experiments using the implementation from the referenced repo. To aviod OOM issue (which also noted in the issue: https://github.com/jxmorris12/bm25_pt/issues/5), we had to reduce the corpus size from 21 million to 1 million. Below are the search latency for 3.6k queries search on a 1-million corpus:
| Search | |
|---|---|
| BM25 (CPU) | 6s |
| BM25 (GPU) | 2s |
| SiDR_beta (GPU) | 0.03s |
While it is true that BM25 can be executed on a GPU, the improvement in latency is limited. This is because inverted indexes primarily involve memory access and integer operations, which are generally not well-suited for GPUs. We have updated these results in Table 3 and Appendix B accordingly.
Weakness 3 (2): Moreover, the efficiency comparison needs to ensure we use a comparable CPU in terms of the price AND we use all threads. It is not clear to me if this is the case: the listed system uses 4 A100 GPUs which cost about 4x compared the Intel Xeon Platinum 8358. Are four GPUs used during retrieval?
Question 1: How many threads does the retrieval use? Does retrieval on GPUs use all 4 GPUs or a single one?
RW3(2) & RQ1: Thank you for your suggestion and for introducing us to PISA, which we found to be convenient and efficient. To clarify, in our latency study, we used a single GPU for the GPU setting and a single thread for the CPU setting. We have included clarification in our revised manuscript (Table3 and Appendix B).
To better compare CPU and GPU indexing options in terms of cost, we performed indexing using all 64 threads on a CPU. For this comparison, we utilized BM25 from PISA and a rust-based tokenizer for SiDR_beta, both of which provide built-in multi-threading support.
| Device | Indexing Latency | |
|---|---|---|
| BM25 | single CPU, 64 thread | 0.1h |
| SiDR_beta | single CPU, 64 thread | 0.1h |
| SiDR_beta | single CPU, 1 thread | 0.5h |
| SiDR_full | single GPU | 23.7h |
Our results show that using all CPU threads significantly reduces indexing latency, validating our claim (lines 445-447) that the BoT index benefits from better parallelism due to the use of CPU. In terms of price, a single A100 GPU is roughly four times the price of an Intel Xeon Platinum 8358 CPU. Among existing neural IR systems, SiDR_beta with the BoT index can complete the indexing of Wikipedia corpus at only 0.001 times the price (0.1h on a CPU versus 23.7 hours on a GPU).
Weakness 3 (3): Another unfairness "point" is that GPU has more limited memory than a CPU system, which further complicates comparison. You cannot see the difference when testing on tiny BEIR datasets, but this can be an issue when much larger datasets are used.
RW3(3): Thank you for sharing the concern with us!
-
Regarding datasets size: we have conducted experiments on Wiki21m corpus, which consists of 21 million text chunks -- a scale significantly larger than those in the BEIR benchmark. To the best of our knowledge, this is the largest corpus used in mainstream IR research.
-
Regarding GPU memory usage: we agree with your observation and would like to highlight that the binary nature of our approach offers a distinct advantage compared to other neural options, as shown in Table 5. For 21 million text chunks, the BoT index requires only 10GB of GPU memory, which is manageable for most devices. In contrast, dense embedding indexing methods demand over 30GB, while sparse lexical embeddings require even more.
Weakness 4: The paper uses a seemingly SPLADE-like bi-encoder VDR, but it is not clear why it was chosen over SPLADE.
RW4: Our learning objective is directly modified from VDR. SPLADE excels at handling sparse representation learning with a FLOPs regularizer, while the cross-modal version of VDR uses a bag-of-tokens query to supervise sparse image representation learning from scratch. Our scenario, which involves aligning a query embedding with a bag-of-tokens document representation, aligns more closely with the VDR objective, making it a more suitable choice for our purpose.
Question 2: L146-150 the discussion of in-training retrieval is not understandable without explaining that in-training retrieval is required to mine hard negatives.
RQ2: Thank you for the suggestion! We have revised our manuscript to clarify that in-training retrieval is for mining hard negatives (line 148).
Weakness 2 (1): The proposed semi-parametric approach SIDR_beta does not uniformly outperforms BM25 and on average it is worse on BEIR.
RW2(1): Thank you for sharing your concern. In our analysis in Appendix D, we demonstrate that learning plays a critical role in our effectiveness over BM25. The observed underperformance on BEIR can be attributed to the fact that there was no training performed for these datasets. Additionally, most BEIR datasets are labeled using a BM25-like retrieval system, making them inherently challenging to surpass - a concern also noted in Section 6 of BEIR's own paper. Our experiments show that SiDR_beta consistently outperforms BM25 in-domain, with larger improvements observed as the corpus size increases (e.g., Wiki21M has 21 million texts, compared to thousands to millions in BEIR).
Fair point, I will update the review.
Weakness 2 (2): What happens if we take BM25 and re-rank using VDR or SPLADE or other neural bi-encoder.
RW2(2): In Appendix E, we include more late parametric baselines that combine BM25 or SiDR_beta with a more recent SOTA bi-encoder retriever. Our findings indicate that using BM25 as the first-stage retriever, followed by a stronger SOTA retriever (e.g., E5) for re-ranking, can outperform SiDR_beta (m=100). However, SiDR_beta can also serve as the first-stage retriever, consistently achieving better performance to BM25 at first-stage.
Thanks a lot: I did look at the results. To clarify my initial concern. SIDR_full is two-stage system. However, it looks like SIDR_beta is a stronger retriever than BM25 when SIDR_beta is finetuned in-domain. You can highlight that t that SiDR_beta + a more recent model OUTPERFORMS BM25 + a more recent model.
Weakness 3 (1): Latency comparison with BM25 is unfair. We are comparing CPU and GPU while claiming that BM25 cannot run on GPU. This is not true. GPUs are worse at sparse inner product compared to dense ones, but they can still do it.
RW3(1): Thank you for sharing the repo. To address your concern, we have conducted experiments using the implementation from the referenced repo. To aviod OOM issue (which also noted in the issue: https://github.com/jxmorris12/bm25_pt/issues/5), we had to reduce the corpus size from 21 million to 1 million. Below are the search latency for 3.6k queries search on a 1-million corpus:
....
While it is true that BM25 can be executed on a GPU, the improvement in latency is limited. This is because inverted indexes primarily involve memory access and integer operations, which are generally not well-suited for GPUs. We have updated these results in Table 3 and Appendix B accordingly.
Point taken. I am not sure I agree with your interpretation though. BM25 is a sparse matrix multiplication. These operations aren't super great for a GPU but a GPU can do a decent job still. Anyways, we don't have capacity to figure out if the referenced implementation is suboptimal or not. You made a comparison and it showed your approach to be superior. I also took note that your method requires much less memory since token representations are binary.
Weakness 3 (2): Moreover, the efficiency comparison needs to ensure we use a comparable CPU in terms of the price AND we use all threads. It is not clear to me if this is the case: the listed system uses 4 A100 GPUs which cost about 4x compared the Intel Xeon Platinum 8358. Are four GPUs used during retrieval?
Question 1: How many threads does the retrieval use? Does retrieval on GPUs use all 4 GPUs or a single one?
RW3(2) & RQ1: Thank you for your suggestion and for introducing us to PISA, which we found to be convenient and efficient. To clarify, in our latency study, we used a single GPU for the GPU setting and a single thread for the CPU setting. We have included clarification in our revised manuscript (Table3 and Appendix B).
Thanks a lot. A single-thread CPU is a HUGE no-no. So, basically all BM25 latencies should have been divided by 32 or even more (if one used PISA instead of Pyserini/Lucene)? I am still confused about the updated Table 3. It says BM25 takes 40 seconds but 2 minutes overall? This only convinces me that the paper needs a thorough revision. These last-minute updates are not reliable.
To better compare CPU and GPU indexing options in terms of cost, we performed indexing using all 64 threads on a CPU. For this comparison, we utilized BM25 from PISA and a rust-based tokenizer for SiDR_beta, both of which provide built-in multi-threading support. Don't these results show BM25 indexing cost is <= SIDR indexing cost? BTW, a lot of people don't care much about indexing costs as long as they are reasonable.
Addressing Response 3: "BM25 as first stage & latency"
Following W2
Thanks a lot: I did look at the results. To clarify my initial concern. SIDR_full is two-stage system. However, it looks like SIDR_beta is a stronger retriever than BM25 when SIDR_beta is finetuned in-domain.
To clearify:
- SiDR_full is a one-stage system that operates like any neural retriever, using embeddings as an index and performing inner product of embeddings for search.
- SiDR_beta (m=100) is a two-stage system. It uses a BoT index for the first-stage retrieval (allowing for quick-start on unindexed data) and generates embeddings for the top-m passages for re-ranking. These passage embeddings can be cached and reused.
However, it looks like SIDR_beta is a stronger retriever than BM25 when SIDR_beta is finetuned in-domain. You can highlight that t that SiDR_beta + a more recent model OUTPERFORMS BM25 + a more recent model.
Thanks for the nice suggestion! We have highlighted this in Apeendix E.
Following W3
So, basically all BM25 latencies should have been divided by 32 or even more (if one used PISA instead of Pyserini/Lucene)?
In our non-isolated environment, we observed a 6x speedup for BM25 when utilizing all threads, and both BM25 and SiDR_beta were evaluated under similar conditions and within a comparable timeframe. Regarding these results, we would like to respectfully clarify two points:
-
For the comparison between BM25 and SiDR_beta: While both BM25 and SiDR_beta use tokenization-based indexing, they are expected to benefit similarly from implementation optimizations. Our primary focus is on presenting the methodology rather than competing specific implementation differences.
-
For the comparison between BM25 (SiDR_beta) and neural retrievers: We suggest readers to use single-thread latency as a baseline to estimate the potential improvements achievable with multi-threading in their specific setups. As multi-threading performance can vary significantly depending on factors such as hardware configuration and system resource availability, single-thread testing is often used in prior IR research [1,2,3] to provide a consistent and comparable baseline.
I am still confused about the updated Table 3. It says BM25 takes 40 seconds but 2 minutes overall? This only convinces me that the paper needs a thorough revision. These last-minute updates are not reliable.
To clarify a misunderstanding: In Table 3, we updated the BM25 (GPU) search time, which takes 40 seconds. The value in parentheses refers to CPU time, while the value outside the parentheses refers to GPU time. This means that if the BM25 search is conducted on a CPU, it takes 2 minutes, but if conducted on a GPU, it takes 40 seconds.
Latency Comparison Summary
We sincerely appreciate the reviewer’s thoughtful concerns and points. Given the breadth and depth of the comparisons discussed, we believe it is beneficial to provide a clear and well-rounded conclusion to ensure clarity for future readers.
-
Through significant discussion of the comparison, including CPU and GPU costs, thread usage, and library selection, the key conclusion is that BM25-like tokenization-based index enjoys better scalability on CPUs and lower costs. This perspective aligns with our motives, as SiDR is a neural retriever that supports tokenization-based index, allowing it to benefit from similar advantages to BM25 in comparison to other neural retrievers.
-
Both SiDR_beta and BM25 utilize tokenization-based indexing with comparable complexity—a level significantly lower than that of neural retrievers. As our method is still in its early stages, we are grateful for the opportunity to benchmark it against widely-used and actively maintained advanced BM25 libraries. While such comparisons are valuable for benchmarking, we hope the broader context of the innovations and learnability introduced by our method will also be taken into account.
Reference
[1] Lin, Sheng-Chieh, and Jimmy Lin. "A dense representation framework for lexical and semantic matching." TOIS, 2023.
[2] Shen, Tao, et al. "Lexmae: Lexicon-bottlenecked pretraining for large-scale retrieval." ICLR, 2023.
[3] Mackenzie, Joel, Andrew Trotman, and Jimmy Lin. "Wacky weights in learned sparse representations and the revenge of score-at-a-time query evaluation." arXiv, 2021.
Regarding datasets size: we have conducted experiments on Wiki21m corpus, which consists of 21 million text chunks -- a scale significantly larger than those in the BEIR benchmark. To the best of our knowledge, this is the largest corpus used in mainstream IR research.
Note taken. However, 21 million text chunks is still a (relatively) small-size collection. GPUs has less memory, often much less memory compared to a good CPU server. Imagine you shard your 10B document collection to make sure you can search it in fully. You may end up needing 10x more GPU machines of comparable cost. At the same time, if you "throw" 10x more CPU machines at the problem you will be able to afford 10x higher throughput. So, suddenly there is a 10x boost for the CPU solution that your current analysis doesn't include.
To clarify: I do not suggest a fully-fledged comparison that would involve sharding and full cost-analysis: I think it's too much for the paper. However, I think you folks need to make a disclaimer (even in the abstract) that latency is compared under the assumption that both BM25 and SIDR indexes fit into a CPU/GPU memory.
And another disclaimer of course would be about using a single-thread CPU. Because realistically a high-end GPU server would have 100 cores (200 with hyperthreading) and even a modest one would have 30-50 full cores.
Our learning objective is directly modified from VDR.
I see that there's a reason, but I still don't fully comprehend it. The paper should be careful about explaining it though.
UPDATED to correct my MLM use mistake and to edit details on the introduction of BoT (but the description is still confusing, see my comment about how it's done in the SPLADE paper).
Weakness 1: The paper is somewhat confusing and hard to follow (even for people with deep expertise on various retrieval approaches).
RW1: Thank you for your feedback. We value your input and would appreciate it if you could point out specific sections or aspects that you find confusing, so we can address them in detail.
There are several examples:
- No explanation for the need of VDR (I still didn't quite get your explanations in the rebuttal).
- The whole section of revisiting MLM is very confusing. You say "We provide insights into the consistencies between semi-parametric alignment and masked language model pre-training", but this is already a pretty confusing phrase. To begin with you probably meant to say "connections". This is a small things, but a lot of such small things here and there complicate reading. Compare, e.g., how this was introduced in the SPLADE paper: We describe in Section 3.1 how the Masked Language Modeling (MLM) head of Pre-trained Language Models can be used to represent tokens in a sequence as vectors in the vocabulary space.
- Then the whole section "PARAMETRIC AND NON-PARAMETRIC REPRESENTATION" is a bit of a slog to read. You need to explain what you mean by "parametric" vs "non-parameteric" vs "semi-parametric" representations. Moreover, when you talk about parametric representations it's IMHO better talk about *learned token weights vs token computed using a hand-crafted rule (as in BM25). Don't get me wrong parametric vs non-parametric isn't wrong per se, but the distinction between parametric and non-parametric ML is a bit blurry. Plus, in the context of learned representations it is not frequently used IMHO.
BoT appears almost out of the blue. I think it was only mentioned in the MLM-section. IMHO, it's better to start by explaining that a document/query can be represented by a bag of words or tokens. In some cases these tokens are actual document/query tokens and in some cases they are "predicted" by the encoder model. Moreover, tokens can be weighted and non-weighted. Weights can come either from a model or from token-counting (as in BM25). This would have been a much less confusing explanation IMHO. Then, you can explain how weights are coming from an MLM head of a BERT model. This would have been much more logical, IMHO.
Thank you for taking the time to provide such a detailed and thoughtful response to strengthen our paper!
Addressing Response 1: "index size etc..."
Note taken. However, 21 million text chunks is still a (relatively) small-size collection. GPUs has less memory, often much less memory compared to a good CPU server.
We agree, and this is indeed an advantage of our method compared to existing neural retrievers. SiDR can utilize tokenization-based indexing, enabling it to leverage the efficiency and scalability of CPUs. This makes it a more practical solution for the mentioned limited-resource scenarios involving large corpora.
To clarify: I do not suggest a fully-fledged comparison that would involve sharding and full cost-analysis: I think it's too much for the paper. However, I think you folks need to make a disclaimer (even in the abstract) that latency is compared under the assumption that both BM25 and SIDR indexes fit into a CPU/GPU memory.
Thank you for the thoughtful clarification! We have clarified this in our latency comparson section (line 431).
And another disclaimer of course would be about using a single-thread CPU. Because realistically a high-end GPU server would have 100 cores (200 with hyperthreading) and even a modest one would have 30-50 full cores.
Point taken. We acknowledge that using a single-thread CPU does not reflect the realistic capabilities of modern servers, and we discuss this further in Response 3.
Addressing Response 2: issues with presentation
Given the tight rebuttal timeline, we have made adjustments to address your concern while avoiding substantial changes at this stage.
No explanation for the need of VDR (I still didn't quite get your explanations in the rebuttal).
Existing learned lexical retrievers (e.g., SparTerm [1], Splade [2], LexMAE [3], VDR [4]) do not vary significantly in architecture but differ primarily in their training processes. Specifically, VDR [4] aligns the non-parametric bag-of-token representation of text with the parametric embedding of images, an objective that significantly differs from other works (e.g., Splade) and has demonstrated effectiveness.
We build on VDR with a direct modification of its learning objective and refer readers to the VDR paper for detailed explanations of its architecture and loss design. Adapting Splade with a VDR-like objective might achieve similar results but could cause confusion about the differences between this approach and VDR.
The whole section of revisiting MLM seems to be unnecessary.
Section 3.1 aims to present the consistency between the MLM objective (Equation 1, line 186) and our semi-parametric alignment objective (Equation 2, line 223). This consistency is crucial for minimizing the optimization gap between pre-training and fine-tuning, as the tuning objective must be carefully designed rather than arbitrarily defined.
You write: "We inherit the VDR encoder (Zhou et al., 2024), which extends the conventional MLMs architecture with three modification.", but is it really an MLM architecture?
To the best of our knowledge, the VDR encoder (Zhou et al., 2024) is based on the MLM bert-base-uncased encoder, as explicitly stated in their paper.
I think the same approach is used in decoder-only LMs and the objective is the next-token-prediction. To reiterate, I don't see a connection and I suggest to not brand it as an MLM architecture.
We understand your point and have revised the statement to loosen the claim, framing decoder-only LLMs as a variant of MLM where the masked token is always the next token in the sequence (lines 159–161).
Then the whole section "PARAMETRIC AND NON-PARAMETRIC REPRESENTATION" is a bit of a slog to read. You need to explain what you mean by "parametric" vs "non-parameteric" vs "semi-parametric" representations. Moreover, when you talk about parametric representations it's IMHO better talk about *learned token weights.
However, in my opinion, it's much better to make a connection to the BoT representations rather than to an MLM objective. You can start by explaining that a document/query can be represented by a bag of words or tokens.
Thank you for the suggestions! We have clarified these concepts more explicitly at the beginning of Section 3.2 (lines 192–195).
Reference
[1] Bai, Yang, et al. "SparTerm: Learning term-based sparse representation for fast text retrieval." arXiv 2020.
[2] Formal, Thibault, Benjamin Piwowarski, and Stéphane Clinchant. "SPLADE: Sparse lexical and expansion model for first stage ranking." SIGIR 2021.
[3] Shen, Tao, et al. "Lexmae: Lexicon-bottlenecked pretraining for large-scale retrieval." ICLR 2023.
[4] Zhou, Jiawei, et al. "Retrieval-based Disentangled Representation Learning with Natural Language Supervision." ICLR 2024.
This paper extends the VDR encoder (which also seems to be under review), aligning masked term prediction and bag of tokens representations. The paper shows that hybrid retrieval outperforms either vector based or token based retrieval alone, and the same parametric representation used in vector based retrieval can be "aligned" with bag of tokens to perform non-parametric retrieval.
优点
The paper is well motivated. The problems identified with constructing and refreshing vector based document indices are consistent with prevalent challenges in the industry. While most SOTA retrieval systems, especially those used in the industry, are hybrid, re-using representations across the two forms of retrieval seems novel. There is a good set of baselines, though the most relevant baselines are those that perform hybrid retrieval with vector based and token based approaches. The paper will also benefit from inclusion of more qualitative analysis, especially on how the aligned BoT query representations tackle classical IR problems like polysemy and synonymy.
缺点
One of the strengths of bag-of-token based retrieval is that it is easily interpretable. The paper misses an opportunity to demonstrate how going from a parametric representation (which is considered semantic retrieval) allows us to tackle the standard problems in IR such as polysemy and synonymy. How do the lack of weights (a la BM25) on the document side affect retrieval?
A nit: since the paper uses a hybrid retrieval system, comparison against either vector retrieval alone is not an apples to apples comparison. For example, in Table 1, this system seems to handily outperform BM25. But BM25 doesn't have the benefit of query side synonyms. While the index is non-parametric, the query is not (it's an "aligned version" of the parametric embedding). This doesn't invalidate the results. Rather it provides a channel for further investigation into the kinds of representations that are produced.
Comparison against other hybrid retrieval systems (Gao et al 2021, Kuzi et Al 2020) would make the paper a lot stronger. (These are ones that come to mind, but there may be more).
Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, and Jamie Callan. Complement lexical retrieval model with semantic residual embeddings. In European Conference on Information Retrieval, 2021. URL https://api.semanticscholar.org/CorpusID:232423090
Saar Kuzi, Mingyang Zhang, Cheng Li, Michael Bendersky, and Marc Najork. Leveraging semantic and lexical matching to improve the recall of document retrieval systems: A hybrid approach, 2020. URL https://arxiv.org/abs/2010.01195.
问题
BM25 as a token retrieval baseline has stood the test of time, being competitive for three decades. Why do the authors think SiDR_beta handily outperform BM25? What can we learn from this? Could this be because it's not an apples to apples comparison? Are there other underlying factors?
Why did the authors not consider hybrid retrieval systems a more natural baseline?
We thank the reviewer for recognizing the motivation of our work and the relevance of the challenges we address. We would like to address the concerns and questions below:
Weakness 1: The paper misses an opportunity to demonstrate how going from a parametric representation (which is considered semantic retrieval) allows us to tackle the standard problems in IR such as polysemy and synonymy. How do the lack of weights (a la BM25) on the document side affect retrieval?
RW1: This is an interesting point! The advantage of semantic (learned lexical) retrieval over BM25 lies in its ability to learn term weights and perform term expansion. In Appendix D, we conducted a comprehensive ablation study on the impact of term expansion and learned terms on both the query and index sides in our method, with the results summarized below.
| Model | Query Expand | Query Weight | Document Expand | Document Weight | Accuracy |
|---|---|---|---|---|---|
| BM25 (bert-base-uncased) | ✗ | ✓ | ✗ | ✓ | 21.9 |
| (Below) Ablation of SiDR_full on doc side | |||||
| SiDR_full | ✓ | ✓ | ✓ | ✓ | 49.1 |
| SiDR_full (w/o weight at doc) | ✓ | ✓ | ✓ | ✗ | 33.1 |
| SiDR_full (w/o expand at doc) | ✓ | ✓ | ✗ | ✓ | 38.9 |
| SiDR_full (w/o expand at doc, training) | ✓ | ✓ | ✗ | ✓ | 43.1 |
| SiDR_beta | ✓ | ✓ | ✗ | ✗ | 39.8 |
| (Below) Ablation of SiDR_beta on query side | |||||
| SiDR_beta (w/o weight at query) | ✓ | ✗ | ✗ | ✗ | 14.5 |
| SiDR_beta (w/o expand at query) | ✗ | ✓ | ✗ | ✗ | 34.3 |
| BoT overlap | ✗ | ✗ | ✗ | ✗ | 14.2 |
- When our index goes from parametric to bag-of-tokens (BoT), both term weighting and expansion play crucial roles on the document side.
- Interestingly, removing either component individually results in worse performance compared to removing both (i.e., SiDR_beta). This occurs because our training objective is designed to align query embeddings specifically with the BoT index. However, when training is adjusted to accommodate these variations, they can outperform SiDR_beta, highlighting the importance of consistency between training and the downstream task.
- When using a BoT index, we demonstrate that query term weighting is more important than term expansion. This indicates that our method learns more meaningful query term weights for tasks compared to BM25.
Weakness 2 (1): A nit: since the paper uses a hybrid retrieval system, comparison against either vector retrieval alone is not an apples to apples comparison. Comparison against other hybrid retrieval systems (Gao et al 2021, Kuzi et Al 2020) would make the paper a lot stronger.
RW2(1): We would like to clarify the difference between late parametric and hybrid retrieval. Late parametric methods are two-stage approaches that utilize a first-stage retriever with a non-parametric index (either BM25-like or SiDR_beta), followed by a second-stage bi-encoder (not cross-encoder) to rerank and cache the document embeddings.
In Appendix E, we provide further discussion and include more stronger late-parametric baselines. Our findings indicate that using BM25 as the first-stage retriever, followed by a stronger SOTA retriever for re-ranking, can outperform SiDR_beta (m=100). However, SiDR_beta can also serve as the first-stage retriever, consistently achieving better performance, albeit with an additional cost for embedding the query.
Weakness 2 (2): For example, in Table 1, this system seems to handily outperform BM25. But BM25 doesn't have the benefit of query side synonyms. While the index is non-parametric, the query is not (it's an "aligned version" of the parametric embedding).
RW2(2): Thank you for sharing this observation. Cost-effectiveness is indeed a critical factor when evaluating systems. Although SiDR_beta requires query embedding, its costs are significantly lower than those of existing neural retrievers. For instance, searching on NQ test splits, SiDR_beta requires only 0.01% of the embedding effort to achieve a 17.1% improvement over BM25, whereas existing neural retrievers require 100% of the embedding effort to achieve improvements of 26.4% to 35.2%. More details could be found in Appendix E.
Question 1 (1): BM25 as a token retrieval baseline has stood the test of time, being competitive for three decades. Why do the authors think SiDR_beta handily outperform BM25? What can we learn from this?
RQ1(1): In Appendix D, we demonstrate that the significant improvement over BM25 is primarily due to the learned query term weights. We attribute SiDR_beta's superior performance to the great learning capabilities of neural models and the foundational pre-trained MLMs, which allow for more effective contextualized query term weighting.
Furthermore, we would like to emphasize that with the growing use of LLM and RAG to handle more complex tasks, BM25's relevance may fall short. Our key advantage over BM25 lies in the ability to learn complex relevance, without the prohibitive costs of neural IR methods. Our goal is to support emerging scenarios where BM25 struggles and existing neural IR methods are too costly to deploy, rather than entirely replace BM25.
Question 1 (2): Could this be because it's not an apples to apples comparison? Are there other underlying factors?
RQ1(2): We understand your concern. To ensure fairness, in Appendix D, we conduct ablation studies to evaluate the effects of term weighting and expansion on both the query and document sides. Given that these term weights require neural embeddings, we further demonstrate in Appendix E that SiDR_beta achieves high cost-effectiveness compared to existing neural retrievers, requiring only 0.01% of the typical embedding effort. We hope this addresses your concern.
This paper main contribution is a retrieval model where documents are represented by the presence (boolean) of tokens (i.e. a binary bag of tokens). The advantage of such an approach is that the index has to be computed just once (it does not depend on parameters). However, the precision of such a model being low, the authors show that it can be successfully trained with a dense model (used as a second-stage ranker) - and combined with it at inference time. Experiments are conducted on the Wiki21m and BEIR benchmarks and retrieval latency is evaluated - showing that this approach offers a good compromise between effectiveness and efficiency.
优点
The main strength of the paper lies in the static representation of documents. As far as I know, this is the first model that relies on a simple index structure, that of representing the documents as a binary bag-of-token. This allows for potentially fast retrieval (although this can be debated, see weaknesses) of potential candidates that have then to be re-ranked.
缺点
One of the main weaknesses is related to the number of collisions (i.e. number of documents for one token) increases with the size of the collection. It is unclear how this approach performs when the number of documents, their length, or both, increase. It would be important thus to investigate using the model on a larger collection (e.g. MS-Marco).
Another point is that the authors state (l. 329) that SOA training "techniques are orthogonal to the retrieval model and have not been applied in our works.". It is not clear if this is truly the case here, and no experiments have been conducted to check the potential of e.g. simple techniques like knowledge distillation.
Other things:
- The authors propose to use the elu1p (p. 4) - it is not clear why softplus has not been used, the function is quite close to this and more "standard".
- The negative mining proposed for the Wikipedia collection is very specific to this collection (l. 292-293). In my opinion, this invalidates the results reported on the Wikipedia collection (table 1 p. 7) and would justify the use of the dismissed SOA training techniques.
问题
-
what is the purpose of section 3.1? I guess the argument is around the relationship between MLM and vocabulary expansion - see e.g. https://dl.acm.org/doi/10.1145/3634912
-
Unless I miss something, there is no justification of the second par of , i.e. since it is never used. And no ablation shows the importance of this factor
-
What is in table 1? I could not find this model in the VDR paper.
-
I do not agree on the difference in efficiency (lines 499-506) between your model and BM25. While it is true that leveraging GPU is useful in your case, BM25 could actually be implemented on GPU, so the reported speedup does not mean much.
Question 1: What is the purpose of section 3.1? I guess the argument is around the relationship between MLM and vocabulary expansion.
RQ1: In addition to discussing the relationship between MLM and vocabulary expansion, Section 3.1 lays the foundation for later sections that explore the consistency between our semi-parametric alignment (lines 218-219) and the upstream MLM pre-training (lines 183-184). We believe that minimizing the gap between the MLM pre-training objectives and the downstream semi-parametric alignment objective is crucial for tuning.
Question 2: Unless I miss something, there is no justification of the second part of L_semi−para, i.e. L(VBoT(q),Vθ(p)) since it is never used. And no ablation shows the importance of this factor.
RQ2: Thank you for pointing this out! We provide experiments on removing the second term in the semi-parametric loss (e.g., L_semi−para), and present the top-1 retrieval accuaray results below:
| Model | NQ | TQA | WebQ |
|---|---|---|---|
| SiDR | 49.1 | 56.2 | 40.2 |
| SiDR (w/o second term) | 48.7 | 56.0 | 38.7 |
| SiDR | 39.8 | 50.4 | 32.1 |
| SiDR (w/o second term) | 37.1 | 46.8 | 30.4 |
The results indicate that removing the second term results in a performance drop of 0.7 for the full parametric search and 2.6 for the beta search. This aligns with findings in the appendix I in [2], where the authors find the beneficial impact of this term for text retrieval.
Question 3: What is VDR_β in table 1? I could not find this model in the VDR paper.
RQ3: Apologies for the confusion. VDR_beta refers to directly using the VDR text-to-text model to perform beta search (i.e., perform search on a bag-of-tokens index). We have revised the manuscript to clarify this point (line 323).
Question 4: I do not agree on the difference in efficiency (lines 499-506) between your model and BM25. While it is true that leveraging GPU is useful in your case, BM25 could actually be implemented on GPU, so the reported speedup does not mean much.
RQ4: To address your concern, we conducted experiments using an open-source project (https://github.com/jxmorris12/bm25_pt/tree/main/bm25_pt) that supports BM25 search on GPUs. To enable BM25 GPU search in memory, we had to reduce the corpus size from 21 million to 1 million. Below are the latency results for 3.6k queries search on a 1-million corpus:
| Search | |
|---|---|
| BM25 (CPU) | 6s |
| BM25 (GPU) | 2s |
| SiDR_beta (GPU) | 0.03s |
While BM25 can be executed on a GPU, the latency improvement is limited. This is because inverted indexes primarily rely on memory access and integer operations, which are generally not well-suited for GPUs. We have updated these results in Table 3 and Appendix B accordingly.
To better assess our comparison with BM25, we have made the following efforts:
- We have included the BM25 GPU setting for comparison in Table 3, with additional details provided in Appendix B.
- In Appendix D, we conduct an ablation study on term weighting and term expansion, showing that the major improvement of SiDR_beta over BM25 comes from the learned query term weights when using a BoT index.
- In Appendix E, we compare different retrieval methods (semi-parametric vs. fully parametric) to demonstrate SiDR_beta's cost-effectiveness, requiring only 0.01% of the embedding effort compared to existing neural IR.
Reference
[1] Xiong, Lee, et al. "Approximate nearest neighbor negative contrastive learning for dense text retrieval.", ICLR, 2021
[2] Zhou, Jiawei, et al. "Retrieval-based Disentangled Representation Learning with Natural Language Supervision." ICLR, 2024.
Thanks for all the answers to my questions and sorry for having overlooked table 2.
I still find that the question of the number of tokens (which is related to the length of the text) is quite important and explains the negative performance on MS-MARCO (compared to other baselines). I still think your approach is quite interesting, but feel that the ablation experiments (490-497) are not enough on this side – and it would have been enlightening to have more information on it.
Overall, the other answers clarified my doubts or misunderstandings, and I thus raise my overall rating.
We sincerely thank Reviewer PzGy for their insightful feedback and for raising the overall rating of our work. The reviewer’s suggestions regarding text length (number of tokens per text) and the corresponding ablation experiments are highly valuable, and we will carefully investigate these aspects to further improve our study moving forward.
We thank the reviewer for their meticulous review and constructive feedback! We would like to address the concerns and questions below:
Weakness 1: It is unclear how this approach performs when the number of documents, their length, or both, increase. It would be important thus to investigate using the model on a larger collection (e.g. MS-Marco).
RW1: Thank you for sharing these concerns with us.
-
Regarding the number of documents: Our benchmark already includes the MS-MARCO collection (Table 2) and a larger Wiki21m collection (Table 1), which consists of 21 million text chunks, compared to MS-MARCO’s 8.8 million. To the best of our knowledge, this is the largest corpus used in mainstream IR research.
-
Regarding text length: In our ablation study (lines 490-497), we experimented with a reconstructed version of the Wikipedia collection, using text chunks of varying lengths while controlling the total number of texts. The results show a performance drop from 39.8% to 37.5% when the downstream text length is inconsistent to the training collection. However, despite this drop, using the BoT index still outperforms the BM25 baseline with performance 22.7%.
Weakness 2: Another point is that the authors state (l. 329) that SOA training "techniques are orthogonal to the retrieval model and have not been applied in our works.". It is not clear if this is truly the case here, and no experiments have been conducted to check the potential of e.g. simple techniques like knowledge distillation.
RW2: We agree that the original claim might have been an overstatement and have weakened our claim in the updated version (lines 329-332). We categorize them as advanced baselines due to their significantly higher training costs associated with the additional techniques. Future work may explore their integration with our model to assess if there are potential benefits.
Weakness 3: The authors propose to use the elu1p (p. 4) - it is not clear why softplus has not been used, the function is quite close to this and more "standard".
RW3: While both functions map input values to (0, +inf), softplus is more prone to vanishing gradients, particularly for large negative values, as its gradient decreases towards zero. In contrast, elu1p maintains a non-zero gradient across the entire input range, which contributes to more stable gradient flow during training. We have conducted experments for substituting elu1p with softplus, and observed this change leads to a performance drop of 0.8% on the Wiki21m benchmark.
| Model | NQ | TQA | WebQ |
|---|---|---|---|
| SiDR (elu1p) | 49.1 | 56.2 | 40.2 |
| SiDR (softplus) | 48.2 | 55.1 | 39.7 |
Weakness 4: The negative mining proposed for the Wikipedia collection is very specific to this collection (l. 292-293). In my opinion, this invalidates the results reported on the Wikipedia collection (table 1 p. 7) and would justify the use of the dismissed SOA training techniques.
RW4: We would like to justify our use of in-training retrieval based on training costs. SiDR is designed to support searching on a fixed index, making in-training retrieval inexpensive, adding at most one additional hour of training, as noted in Section 4.3. In contrast, other SOTA techniques incur significantly higher training costs. For example, ANCE [1] requires index rebuilding every 10k batches, adding 32 hours (8 hours per re-build * 4 times) of indexing with our computational resources, while E5 needs pre-training on millions of weak-positive pairs, which takes a week—both are significant compared to our total training time of just 9 hours.
We conducted additional experiments using SiDR, where we built the embedding index once to retrieve negatives, and continued training with these negatives for another 80 epochs, resulting in a total training time of 25 hours. Below are the results on the NQ dataset, with our costs including the full training and indexing periods.
| Model | Top1 | Top5 | Top20 | Cost |
|---|---|---|---|---|
| ANCE | - | 70.7 | 81.4 | 32h (for indexing only) |
| SiDR_full | 49.1 | 69.3 | 80.7 | 9h |
| SiDR_full (index once) | 53.7 | 72.2 | 86.0 | 25h |
These results demonstrate that our method, with added training complexity, achieves better performance than ANCE. Notably, our entire training process requires less time than ANCE's re-indexing alone.
We thank all the reviewers for their thoughtful and constructive feedback! We are pleased they found the work is well-presented, the problem we address is important, and the methods we proposed are well motivated.
Scope and Contribution
In this work, we present SiDR, a semi-parametric design bi-encoder retriever that supports both embedding-based index, as in neural retrievers, and tokenization-based index, similar to BM25.
Based on reviewer feedback, we highlight two key features of our work to better outline its scope and contributions:
-
SiDR supports tokenization as index, with indexing complexity comparable to BM25 and the ability to learn. As RAG gain broader adoption, retrieval systems are applied to more complex tasks where BM25 falls short in capturing nuanced relevance and neural retrievers can be costly to deploy. SiDR offers a cost-efficient and learnable alternative for such scenarios.
-
SiDR enables search on a fixed index within its training loop, addressing a common challenge of co-training retrievers with LLMs. While training retrieval-based LLMs in an end-to-end manner offers benefits, it requires costly and frequent re-indexing of datastores during the training. SiDR addresses this issue by naturally supporting a bag-of-tokens index that remains fixed throughout training.
Major Updates
Here, we briefly summarize the major updates made to the manuscript in response to reviewer feedback.
- Appendix B: we include a BM25 baseline with search performed on GPU.
- Appendix D: we present an ablation study on term weighting and term expansion on query and document side.
- Appendix E: we present more late-parametric baselines, with cost-effectiveness analysis.
Common Questions
We address some common questions below and will provide more detailed responses in the individual responses.
- Further Comparson to BM25 (in terms of success factors, cost-effectiveness, GPU/multithread utilization).
Given that our method uses a simple representation (i.e., bag-of-tokens) for the index, Reviewer PzGy and Reviewer 14wh have raised thoughtful concerns regarding comparisons with BM25, while Reviewer MXa4 and Reviewer YeDM have expressed interest in understanding the factors behind its effectiveness. To better address these concerns:
- In Appendix D, we present an ablation study on term weighting and term expansion in our method, showing that the learned query terms are the primary driver of improvement over BM25 when using a BoT index. The study also emphasizes the critical importance of training-inference consistency.
- In Appendix E, we present a cost-effectiveness analysis of our method compared to existing neural IR frameworks. While using a BoT index requires additional cost for embedding queries, this embedding effort is minimal—approximately 0.01% of what is required by existing neural systems.
- We include further comparisons with BM25, focusing on GPU utilization, multi-thread utilization, and different sources of implementation, to address each concern individually in the responses.
- Comparison of Training Techniques from a Training Complexity Perspective.
Reviewer PzGy raised the point that our in-training retrieval techniques justify the use of other SOTA training techniques (referred to as advanced baselines in our paper), and Reviewer YeDM noted our underperformance compared to ANCE, which uses the same in-training retrieval for negatives. We would like to justify these points from a training complexity perspective.
- We provide a comparison of our in-training retrieval with other SOTA techniques in terms of training cost. SiDR supports searching on a fixed index, making in-training retrieval inexpensive, adding at most one hour of training as noted in Section 4.3. In contrast, based on our estimates, ANCE's index rebuilding every 10k batches adds 32 hours with our computational resources, and E5's pre-training on millions of weak-positive pairs takes weeks—both are significant compared to our current training period of just 9 hours.
- We compare SiDR to ANCE in terms of effectiveness at a comparable training cost. We rebuilt SiDR_full parametric index once to mine hard negatives for continual training, and the results show that, with a similar investment in training, SiDR achieves better performance than ANCE.
- Additional Late-parametric Baselines.
Reviewers MXa4 and YeDM suggested incorporating additional baselines to strengthen the paper, which we fully agree with. In Appendix E, we provide additional baselines of late-parametric methods, along with a cost-effectiveness analysis. Our findings indicate that using BM25 as the first-stage retriever, followed by a stronger SOTA re-ranker (e.g., E5), can outperform SiDR_beta (m=100). However, SiDR_beta offers a distinct advantage by consistently achieving better in-domain performance as the first-stage retriever compared to BM25.
Hi, thanks a lot for the thoughtful answer. In my experience, at least with Lucene, the BM25 retrieval speed scales linearly with the number of cores as long as the index is "warmed up" (i.e., memory cached) or the index is fully loaded into memory. But it's ok to present results for using a single CPU thread as long as it's clear and proper disclaimers are made.
We sincerely thank Reviewer 14wh for their detailed review and thoughtful feedback. The reviewer’s insights on scalability and device considerations have provided valuable guidance for shaping our future implementations. We deeply appreciate their constructive input and support.
This paper introduces a new, efficient retrieval method, namely “semi-parametric disentangled retrieval framework (SiDR)”. The key idea is to have two encoders: (1) a VDR-based neural encoder (called parametric), and (2) a simple bag-of-words representation (called nonparametric). The neural encoder is trained in a way that these two encoders would align to each other. At test time, retrieval is done by using parametric encoder for the query and nonparametric encoder for the passage, optionally with reranking using parametric encoder for both. This brings benefits of both worlds, with indexing as efficient as nonparametric retrieval like BM25 and expressivity that is comparable to neural retrieval. Results on open-domain QA datasets and BEIR show the effectiveness of the method.
All reviewers acknowledged that this is the first model where the index building is as efficient as BM25 while the expressivity and performance is comparable to dense retrieval, and the benefit of this model is clear given the widely known importance of constructing and refreshing retrieval indexes.
Minor concerns were raised on the writing (e.g., method description and the section revisiting MLMs) and justification of certain modeling choices such as the use of elu1p for the parametric encoder.
审稿人讨论附加意见
Additional experiments suggested by reviewers were added to the paper.
Accept (Poster)