EHI: End-to-end learning of Hierarchical Index for Efficient Dense Retrieval
This work proposes a novel end-to-end method called EHI that learns both the embeddings and the approximate nearest neighbor search structure using provided training data.
摘要
评审与讨论
This paper tackles an important technique: the end-to-end optimization of (first stage) neural information retrieval (IR) systems that takes into account the indexing part, i.e. that optimize for search directly. The approach consists in associating with a query or a document both a dense representation and (the indexing part) an identifier sequence. Each identifier sequence corresponds to a subset of documents, hence allowing to search efficiently and effectively.
A specific loss is designed, based on two contrastive losses (one for identifier sequences and one for dense representations), as well as an "intra-leaf similarity" that tries to to spread out documents within the possible sequences. The authors also benefit from the tree of possible identifier sequences to provide strong negatives.
Experiments are conducted three datasets (SciFact, FIQA, and MS Marco) showing improvements over a set of baselines including ColBERT, ANCE, Tas-B and dense models (with HNSW index).
优点
The proposed end-to-end loss is interesting (even though it misses the inherent probabilistic nature of sequence generation), and results are promising. The negative sampling procedure is also of interest when training neural IR-dense models.
缺点
Related works and baselines
The main weakness of the paper is missing related works (both in the discussion and in the reported results).
First, there is a work very closely related that is based on product quantization:
- Zhan et al. (ACM WSDM 2022) Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval. https://doi.org/10.1145/3488560.3498443
There are also missing references to (more efficient) sparse neural IR models (which are also end-to-end):
- Lassance et al. (SIGIR 2021). Composite Code Sparse Autoencoders for First Stage Retrieval. https://doi.org/10.1145/3404835.3463066
- Formal et al. (SIGIR 2021). From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. https://doi.org/10.1145/3477495.3531857
The authors compare to ColBERT but there was since then ColBERTv2 that performs much better.
Altogether, the paper misses strongly related approaches, and do not compare to state-of-the-art results on 1st stage rankers (either dense or sparse).
Document structure
The document is not well structured - many figures and tables cited in the main text are in the appendix; the first 8 pages should be self-contained, which is not the case.
Indexing length
There is no discussion about an important aspect: the generated identifier sequence is inherently a probabilistic process. However, this is not taken into account in the loss formulation.
Negative sampling
The authors propose to leverage their indexing tree to sample better negatives, which is one of the advantages of their model. However, this aspect is only mentioned passing-by, and the effect of the sampling is not studied. This is both a problem (since this might be the main explanation for the better performance: the authors should try to index using HNSW to measure this effect) and an opportunity (this paper could focus on this aspect, and separate the better learning strategy to the indexing/end-to-end problematic).
问题
- Please provide comparisons with state-of-the-art approaches
- The ablation study (for negative sampling) should be given
- Please make the paper self-contained. Appendices should only be used to provide more details about specific apects
The authors compare to ColBERT but there was since then ColBERTv2 that performs much better.
- Ideally, we would like to compare against models which are representation-based similarity models that output single vectors. These are the models where ANNS can be readily used, and EHI helps alleviate the limitations arising from this disjoint training paradigm. ColBERT being a late-interaction similarity model does not output a single vector representation, and was only used a standalone baseline. Note that it is not readily possible to add an ANNS structure on top of the ColBERT architecture. ColBERT-v2 follows a similar paradigm, and has a 6x storage cost as it stores one vector per token instead of one vector per passage [1]. We would like to stress that EHI (distilbert) can achieve performances similar to ColBERT-v2 with 2% drop in MRR with a model 1/2x smaller and achieves this with more than 80% reduction in latency, and no additional storage costs over the ANNS.
[1] Wang, Liang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. "Simlm: Pre-training with representation bottleneck for dense passage retrieval." arXiv preprint arXiv:2207.02578 (2022).
Altogether, the paper misses strongly related approaches, and do not compare to state-of-the-art results on 1st stage rankers (either dense or sparse).
- We have added many SOTA baselines including ColBERT, SGPT, cpt-text, ANCE, DyNNIBAL and compare against these approaches in Appendix D. We have compared against other SOTA results such as DSI, NCI in this regime, which are our closest related works and have compared against their baselines as well. Some of the works you mention here which focus on sparse representations and remove the use of ANNS is tangential to our work as further elaborated in our general comment. If the reviewer believes we have missed any other strongly related approaches, please do let us know.
There is no discussion about an important aspect: the generated identifier sequence is inherently a probabilistic process. However, this is not taken into account in the loss formulation.
- We respectfully disagree with this. Both indexing similarity and intra-leaf similarity losses focus entirely on this identifier sequence. We define them formally in mathematical formulations, defining path embedding, expressing how each loss component affects training in the ablations, and further expanding into this “path embedding”/ “indexer sequence” in Appendix F.
The hard -ve sampling aspect is only mentioned passing-by, and the effect of the sampling is not studied.
- We do discuss this in ablations (see Figure 3c). Note that even without the use of negative samples, EHI outperforms the exact search R@100 metric by ~2%, and thus helps bolster the idea that training both encoder and indexer in an end-to-end fashion is indeed helpful. We further expand on this negative sampling mechanism in Appendix F.1.
We hope that the rebuttal clarifies the questions raised by the reviewer. We would be very happy to discuss any further questions about the work, and would really appreciate an appropriate increase in the score if the reviewer’s concerns are adequately addressed to facilitate acceptance of the paper.
Thanks for your answers that clarified the use of strong negatives – I had overlooked Figure 3c – and the probabilistic formulation – captured in Eq. (4) that should be clarified. What is in that case?
I however can't entirely agree with several points, mostly related to the comparison with other works which are more competing approaches that complementary or orthogonal, and as such, the comparison in the paper should be made more explicit in terms of efficiency/effectiveness tradeoff (since this is one of the interest in your paper).
Comparison with state-of-the-art retrievers
I disagree with your justification for ColBERT – if you add ColBERT to the table, why not use ColBERTv2 which performs better (there is no architectural change, ColBERTv2 employs a better training procedure with stronger negatives as in your work)?
For SPLADE, here again, I find your justification a bit too wavy – they are not orthogonal but competing approaches, and as such it should be part of your baselines. In both works, you can vary the threshold (e.g. see A Static Pruning Study on Sparse Neural Retrievers, https://dl.acm.org/doi/abs/10.1145/3539618.3591941). Also, the fact that baselines are missing is more for completeness in many cases, as your approach is indeed competitive (but this does not prevent you from reporting and discussing related works).
Relationship with RepCONC
Here again, works are not complementary but rather competing. RepCONC indeed tries to build a better ANN index, but in another way as you do, i.e. using product quantization (which is learned alongside the model).
We would like to thank their reviewer for their follow up comments, and are glad some of their concerns have been resolved. Please refer below to further clarifications to their respective questions.
the probabilistic formulation –captured in Eq. (4) that should be clarified. What is in that case?
- Please refer to Eq. (2) for the definition of which is a triplet loss. We tried to place this base function used across all three loss terms at the very beginning in Section 3.5, and are open to suggestions if the reviewer finds the formulation not highlighted.
Comparisons against ColBERT, and RepCONC
- As suggested, we have added comparisons against RepCONC, ColBERT-v2, as well as CCSA in the updated draft as suggested. We have also polished our draft to involve these works in the related works section for completeness as suggested. We would like to again stress on two points:
- RepCONC uses a disjoint ANNS in training as highlighted in Section 3.5. This algorithm would face similar limitations such as misalignment and query distribution mismatch as highlighted in our Introduction. Thus, we believed, that RepCONC is orthogonal to EHI. However, as suggested by the reviewer, we have compared against RepCONC and report our results in Table 1, Table2.
- We have also added ColBERT-v2 in Table 1, Table 2 but would like to stress the fact that one would need to design a non-trivial ANNS to work with ColBERT-v2. Furthermore, unlike most of the baselines compared against in our work, ColBERT-v2 uses distillation from a cross-encoder, and also uses re-ranking whereas other baselines compared in this work along with EHI and other closely related works such as DSI [1], NCI [2] do not.
[1] Yi Tay et al. Transformer memory as a differentiable search index. [2] Yujing Wang et al. A neural corpus indexer for document retrieval.
Comparisons against SPLADE
- We would like to thank the reviewer for pointing out works such as Static pruning on sparse neural retrievers, and acknowledge that it is possible for sparse IR models such as a pre-trained SPLADE model to operate at different compute budgets through this method (albeit trained disjointly leading to models similar to DE + ANNS). We have added comparisons against SPLADE in Table 1 and Table 2, and have acknowledged these works in the main text as well in the updated draft for completeness.
We hope that the rebuttal clarifies the questions raised by the reviewer and finds that our comparisons against other works such as RepCONC, ColBERT-v2, CCSA, and SPLADE, as many of these algorithms also work on the efficiency paradigm of retrieval, help place EHI in this broader literature. We would be delighted to discuss any further questions about the work. We would appreciate an appropriate increase in the score if the reviewer’s concerns are adequately addressed to facilitate acceptance of the paper.
We thank the reviewer for the kind comments and questions. We are happy to hear that the reviewer finds the overall loss, and idea interesting. However, we feel that some clarifications are necessary. Please refer below for the responses to specific questions.
General Clarification
We believe there is some mismatch between core essence of our work, what the reviewer has correctly captured in their summary, and some of the follow-up questions and weaknesses. The reviewer is correct in their summary that we attempt to learn both the encoder and ANNS structure in an end-to-end training paradigm. However, most of the weaknesses and questions are orthogonal ie., most works cited utilize product quantization (but still have a disjoint ANNS model trained), or are sparse neural IR models which aim to reduce flops by creating sparser embeddings. The former is complementary since it does not have an e2e pipeline, and a disjoint ANNS is still being trained. The latter essentially tries to remove ANNS completely, and still achieve lower FLOPS due to the sparse latent representations. For these works, the model uses a regularizer to reduce FLOPS, and this leads to a model which is trained to use fewer flops. If a practitioner wishes to reduce the FLOPS further than this, they would need to train another model from scratch. However, EHI has this elasticity, where once trained, one can only work with the beam size hyperparameter during inference to control the compute budget the practitioner wants. Thus, these two domains of work (EHI, and sparse neural IR), although both working on the efficiency aspect, achieve this in completely different ways, where one trains ANNS e2e and the other removes ANNS altogether.
work very closely related that is based on product quantization
- We believe that this work [1] is complementary to the work of EHI for the following reason. Our work is end-to-end learning of both the embedding space as well as the indexer (encoder as well as the ANNS). The indexer itself is the ANNS structure, and we do not need another disjoint clustering approach. RepCONC as introduced in the paper quantizes document embeddings, and uses a disjoint k-means step to generate n clusters as explicitly stated in their paper Section 3.5: “Besides PQ, RepCONC employs the inverted file system (IVF) to accelerate vector search. After quantizing document embeddings, RepCONC uses k-means to generate 𝑛 clusters. Each document embedding belongs to the nearest cluster and is stored in the corresponding inverted list.”. However, we would also like to point out that RepCONC achieves an MRR@10 value on passage and TREC-DL task of 34.0% and 66.8% which is significantly lower than EHI. We have also added these numbers in our comparisons in the Appendix.
[1] Zhan et al. (ACM WSDM 2022) Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval.
There are also missing references to (more efficient) sparse neural IR models (which are also end-to-end)
-
We believe there has been some misunderstanding about the term “end-to-end” as used in this paper (please refer to General comment). For instance, [1] which introduces CCSA explicitly mentions in the very first line in their paper that : “We propose a Composite Code Sparse Autoencoder (CCSA) approach for Approximate Nearest Neighbor (ANN) search of document representations based on Siamese-BERT models.” As CCSA directly uses the document representations based on the Siamese-BERT model, it is not end-to-end. It is also unclear why the reviewer mentions “more efficient” here, when the performance (MRR@10) of CCSA as reported in Table 2 of their paper of MSMARCO dev and TREC-DL benchmarks are 28.9% (MRR@10) and 58.3% (nDCG@10), which is significantly worse in comparison to EHI, and other benchmarks presented in the Appendix.
-
Secondly, [2] aims to learn sparse representations of the representations to facilitate fewer FLOPS. We believe this to also be orthogonal, as each model of SPLADE++ could lead to one specific flop usage, and is not as malleable as EHI where the practitioner has the option to improve performance, by increasing the budget. Again, it is not clear how the reviewer compares efficiency here between the two models, where one - EHI learns an encoder and ANNS in an end-to-end fashion, whereas the latter - SPLADE++ removes this ANNS structure and replaces it with sparse representations for learning. Furthermore, it is not clear how the KMR curve of this SPLADE++ looks in comparison to EHI. For instance, they use the metric of FLOPS, but what would be the FLOPS if one uses the dense representation for retrieval?
[1] Lassance et al. (SIGIR 2021). Composite Code Sparse Autoencoders for First Stage Retrieval. [2] Formal et al. (SIGIR 2021). From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective.
We thank the reviewer for their prompt response.
the loss function. Could you then justify why using a max-margin in that case? Did you experiment with cross-entropy?
- We have fixed the typo in the mentioned equation and will ensure these typos do not occur in the final revision. Max-margin is necessary since, for a given query, we wish to send relevant documents to the same bucket as the query but push the irrelevant documents away from the same bucket, thus achieving load balancing. We initially tried cross-entropy loss but note that no labels are available for the downstream clustering. Forcing a query and relevant document to reach the same bucket leads to a suboptimal solution where every query and document, regardless of content, could be indexed to the same bucket and achieve exact-search-like metrics, and would not be efficient. Thus, triplet-loss-like losses are more suited for this indexing.
Thanks for updating the tables – it would be better to have the discussion and actual results of the SOA systems in the main text, rather than in the Appendix; as such, the paper reads as if it were hiding results under the carpet. Given the positioning of the paper, I think it would be better to argue why such an approach is worth pursuing – and I think this is true since the efficiency-effectiveness tradeoff for dense models is understudied.
- We thank the reviewer for their insightful comments. We have added a few lines to the related works sections citing the works suggested by the reviewer but have not been able to incorporate much content in the related sections expanding further on the results of these models and why efficiency-effectiveness tradeoffs are worth pursuing due to page limitations as stated in our previous comment. As suggested, we have also incorporated some of the comparisons against SPLADE in the main text. We will expand on this further in the final version as it requires significant re-writing to accommodate. As suggested by the reviewer PLmq, we are planning to move the Tables to the Appendix anyway, but we will highlight these related works for completeness. We want to stress that we have no intentions of sweeping these results under the carpet and are happy to include them.
SPLADE-Distill and other related models are based on distilbert-base-uncased, and are thus made of 66M parameters
- We thank the reviewer for pointing this out and apologize for the oversight. We have updated our draft to showcase the same.
For completeness, you also have approaches like "Learning to Tokenize for Generative Retrieval", which are working worse than your approach, but have the advantage of being fully discrete methods (rather than mixing dense and ID sequences).
- Thank you for the suggestion. We were aware of this work and noted that the MS MARCO dataset used in the paper is different from ours. The original MS MARCO dataset, as used in our paper, has 8.8M documents, but the one used in the work mentioned above only has 320K documents, as mentioned in Table 1 of their paper.
I will raise my rating to borderline since I think that the discussion resolved some of my concerns (but not all, I still think this paper can be improved a lot in terms of how the related works are discussed and compared with – I still don't agree with the way SOA results are put in the Appendix rather than in the main text)
- We hope the current main text is to their liking. As suggested by Reviewer PLmq, we plan to push the Tables to the Appendix and directly refer to the current Table 5, and Table 6 in the main paper. We would further expand in our final version and highlight the SOA methods you suggested.
We have edited the draft manuscript and have further highlighted SOA results in the main paper. Due to page constraints, we have not been able to expand further on the related works section, and will do so in the final version of the paper as it requires significant re-writing.
We are glad to find that the only remaining concern of the reviewer stands with acknowledgment of some works in the main paper for completeness instead of the Appendix. We have edited our manuscript to reflect the same. Again, we would appreciate an appropriate increase in the score that reflects the same and facilitates acceptance of the paper.
We want to thank the reviewer for their time, elaborate discussions, and insightful comments, improving the quality of the paper. We have clarified various questions the reviewers asked and added comparisons against other works such as ColBERT-v2, SPLADE, RepCONC, and CCSA, as suggested by the reviewer for completeness. We will also attempt to elaborate on these methods in our main paper in the final draft and have yet to be able to do this since it requires some significant re-writing to adhere to the page constraints. These have indeed been fruitful and helped strengthen our paper.
Since the author-reviewer discussion period is ending soon, we would appreciate an appropriate increase in the score that reflects the same and facilitate acceptance of the paper
The loss function
I ask the question about since, unless I am wrong, is a probability distribution. Using eq. (2) as the loss seems weird in that case (by the way, there is a missing parenthesis in the first eq. p. 5) –– and the same holds for the loss (5). Could you then justify why using a max-margin in that case? Did you experiment with cross-entropy?
Comparison with other models
Thanks for updating the tables – it would be better to have the discussion and actual results of the SOA systems in the main text, rather than in the appendix; as such, the paper reads as if it were hiding results under the carpet. Given the positioning of the paper, I think it would be better to argue why such an approach is worth pursuing – and I think this is true since the efficiency-effectiveness tradeoff for dense models is understudied.
Other things
- SPLADE-Distill and other related models are based on distilbert-base-uncased, and are thus made of 66M parameters
- For completeness, you also have approaches like "Learning to Tokenize for Generative Retrieval", which are working worse than your approach, but have the advantage of being fully discrete methods (rather than mixing dense and ID sequences).
- I will raise my rating to borderline since I think that the discussion resolved some of my concerns (but not all, I still think this paper can be improved a lot in terms of how the related works are discussed and compared with – I still don't agree with the way SOA results are put in the appendix rather than in the main text)
We would like to thank the reviewer for the elaborate discussions and the insightful questions. We believe that we have addressed all the concerns raised by the reviewer. We have also updated our manuscript and revised them as suggested by the reviewer in their recent response - resolving the last remaining concern the reviewer mentioned.
Since the author-reviewer discussion period is ending very soon, we would appreciate an appropriate increase in the score that reflects the same and facilitate acceptance of the paper. Again, thank you very much.
- This paper proposes focuses on the task of learning representations (i.e. dense embeddings) for queries and documents/items jointly with an approximate nearest neighbor search index. This paper attempts to overcome limitations of existing approaches which build an approximate nearest neighbor search index after learning query and item embeddings.
- The proposed approximate nearest neighbor search index is a tree-based data structure in which each internal node contains a set of trainable parameters, and each query/document is routed to a set of leaf nodes using parameters at each internal node. The parameters of this tree-based data structure as well as parameters of query/document encoder models are learnt jointly.
优点
- The idea of jointly learning a tree-based data structure with the parameters of the encoder model is interesting.
- The motivation for jointly learning the kNN search index with the parameters of the encoder is clear.
- The paper presents several ablations and analysis that help better understand various design choices.
缺点
-
Potentially missing baselines and related work
- Missing comparison with existing work that learns a kNN index jointly with the encode model. Some examples are
- Gupta, Nilesh, et al. "ELIAS: End-to-End Learning to Index and Search in Large Output Spaces." NeurIPS 2022
- Gupta, Gaurav, et al. "BLISS: A Billion scale Index using Iterative Re-partitioning." KDD 2022.
- The training strategy for baseline dual-encoder models is not clear. Some recent papers propose negative mining strategies that overcome the limitations of having to maintain an up-to-date kNN index during training. Here are some examples (some are already cited by the authors but not sure why a direct comparison is omitted):
- Monath, Nicholas, et al. "Improving Dual-Encoder Training through Dynamic Indexes for Negative Mining." AISTATS 2023.
- Dahiya, Kunal, et al. "NGAME: Negative Mining-aware Mini-batching for Extreme Classification." WSDM 2023.
- Hofstätter, Sebastian, et al. "Efficiently teaching an effective dense retriever with balanced topic aware sampling." SIGIR 2021.
- Missing comparison with existing work that learns a kNN index jointly with the encode model. Some examples are
-
Overall, it is not clear is jointly learning kNN index and encoder model is useful to get better negatives during training or does it have advantages beyond just better hard negative mining. For this reason, a comparison with related work mentioned above can serve to further strengthen the paper.
-
While the paper proposes a trainable nearest neighbor search index and jointly learns the parameters of the encoder models and kNN index, it has some limitations similar to existing methods such as
- The documents need to be re-indexed every few epochs to mine hard negative documents for each query. It is not clear is this step is cheaper than building index with off-the-shelf libraries like ScANN and FAISS.
- While the index contains trainable parameters at each internal node, the basic structure of the tree index i.e. height and branching factor is fixed.
-
Joint training of encoder and nearest neighbor search index may not work very well in zero-shot settings where encoder models are trained on one domain and used on a new domain (with a new set of documents/items and without any training data).
-
The encoder model is initialized using a variant of
distill-bertwhich is already fine-tuned using labelled data from MS-MARCO dataset. And MS-MARCO is one of the main large-scale evaluation dataset. Thus, it may not be fair to compare with baselines (previous papers) that use models such asdistill-bert,bertorRoBERTawhich are pre-trained only using self-supervised objectives.
问题
-
How are baseline dual-encoder models trained? What negative mining strategy is used.
-
Why is
distill-bertfurther finetuned on MS-MARCO dataset used instead of vanilladistill-bertmodel?distil-bert-cos-v5model is finetuned on MS-MARCO as well as trained via distillation using teacher models such as cross-encoders. It may not be possible to have similar finetuned models for other domains or in general.
-
What is cost of training proposed joint model and how does it compare with training dual-encoders separately with existing hard negative mining strategies?
-
How is load balancing enforced/achieved? What part of the indexing or training step encourages better load balancing
-
In Sec 4.3, why is using beam-size = branching factored referred to as exact search? Are all documents visited in this setting? Also, how does beam-size = 0.1*branching_factor ensure that we search up to 10% documents? My understanding is that beam-size=b means that we end up at b leaf nodes and then exhaustively rank all documents in those leaf nodes. So unless the tree is of height = 1, setting beam-size=branching factor can not mean that we are exhaustively searching over all documents.
-
What does this line in Sec 3.6 mean?
- “… we now have a learned mapping of d × l, where d is the corpus size, and l is the number of leaves.”
-
How does using off-the-shelf ANNS indexing method with the encoder model learnt with proposed approach work?
-
How is percentage of visited documents controlled in ScANN and FAISS-IVF methods? Also, why are graph-based kNN search methods not compared with?
-
Discussion around inference latency
- I understand that it might not be possible to compete with highly-optimized kNN search methods such as ScANN and FAISS but I think it is important to highlight any important differences in proposed method vs existing kNN indexing methods to better understand if one method is more scalable or amenable to parallelization-based optimization than others.
How does using off-the-shelf ANNS indexing method with the encoder model learnt with the proposed approach work?
- We thank the reviewer for this insightful question. We notice that off-the-shelf ANNS indexers on top of the encoder model learnt through EHI performs worse in comparison to the complete EHI module. At similar compute budgets of visiting 1% of documents,, EHI outperforms the SCANN baseline by 4.6% (nDCG@10) on MSMARCO TREC-DL benchmark, and 1% on the nDCG@10 and MRR@10 metric on the MSMARCO benchmark. We have added our findings along with KMR curves in Appendix E.8.
How is the percentage of visited documents controlled in ScANN and FAISS-IVF methods? Also, why are graph-based kNN search methods not compared with?
- Scan and Faiss-IVF have arguments for how many beam sizes to visit similar to the EHI formulation we coded. For each beam size, we compute how many documents were visited by looking at the index structure. We have compared against graph-based ANNS such as HNSW and refer to Table 1. Unfortunately, we have not been able to plot a KMR curve for these graph-based approaches, and could not plot them on the same figures as SCANN, FAISS-IVF, etc.
I think it is important to highlight any important differences in proposed method vs existing kNN indexing methods to better understand if one method is more scalable or amenable to parallelization-based optimization than others.
- This is absolutely true. Algorithms such as SCANN do use CPU parallelization in both index training as well as evaluation. In our code, we scale EHI for GPU parallelization. Once the index is built, we note all EHI can offer similar parallelization benefits as SCANN or FAISS, as one only needs to compute the index of query, and similar CPU parallelization etc. is possible. In terms of pure latency times, we believe other optimizations and enhancements such as quantization were used in SCANN to speed up retrieval, and believe EHI can also be scaled to with such optimizations.
We hope that the rebuttal clarifies the questions raised by the reviewer. We would be very happy to discuss any further questions about the work, and would really appreciate an appropriate increase in the score if the reviewer’s concerns are adequately addressed to facilitate acceptance of the paper.
Thank you for such a detailed response. I still feel that some of the concerns remain unaddressed.
(1) Use of distil-bert-cos-v5 model.
I am not sure why authors say that this is trained using self-supervised training objectives.
It clearly uses MS-MARCO queries and ground-truth positive passages and creates target labels using the cross-encoder (which is why I said that this is trained via distillation).
As shown in the training script, it uses scores from a cross-encoder to create target labels here:
https://huggingface.co/sentence-transformers/msmarco-distilbert-dot-v5/blob/main/train_script.py#L213
Also, based on the information here: https://www.sbert.net/docs/pretrained-models/msmarco-v5.html it seems that this v5 model was obtained using several rounds of improvements (see v2, v3 and v4 models), and some of previous versions also used cross-encoder based distillation (eg v3 models)
(2) Training strategy for baseline seems sub-optimal.
The training strategy for the baseline dual encoder model is an in-batch hard negative sampling approach.
As per authors, the baseline DE model is trained using in-batch hard negative sampling. Why are other global hard negative mining strategies not used for training baseline DE model? Such global hard negative mining strategies have been shown to considerably improve quality of the DE model over using in-batch negatives. Some references are given below:
- Hofstätter, Sebastian, et al. "Efficiently teaching an effective dense retriever with balanced topic aware sampling." SIGIR 2021.
- Dahiya, Kunal, et al. "NGAME: Negative Mining-aware Mini-batching for Extreme Classification." WSDM 2023.
- Zhang, Wenzheng, and Karl Stratos. "Understanding hard negatives in noise contrastive estimation." NAACL 2021.
(3) Training cost of proposed approach vs other methods that use a separate ANNS for mining hard negatives.
-
I think both ELI and these other methods that use separate ANNS index (let's call them
non-E2Efor this discussion) need to refresh the index after every few epochs i.e. basically reindex the documents/items. Please clarify if I am making a mistake here. -
Can the authors present numbers comparing the re-indexing cost? Both ELI and non-E2E methods would need to re-compute document embeddings for all documents in the corpus, and the index them. Could you please present how much time it takes to compute updated document embeddings and how much time it takes to compute index the once the embeddings are computed?
-
Also NGAME, DyNNIBAL seems fairly efficient but it is not clear why these methods will not be able to leverage GPUs. There is also some recent work which uses GPUs for improving kNN indexing and inference efficiency.
(4) I feel ELIAS and BLISS are important baselines that will help to show the advantages of proposed approach.
- Also, similar to BLISS, ELI also needs to alternate between learning the parameters of the index and updating the index. ELI maybe does it every few epochs but it still does need to alternate.
- Could you please highlight some key differences between BLISS and ELI other than that ELI uses a tree-shaped index while BLISS uses a flat one-level index?
Overall, I don't feel that the concerns raised in my review have been completely addressed and I would like to maintain that the paper has some good ideas but it will perhaps need more revisions before it is ready for a conference.
Joint training of encoder and nearest neighbor search index may not work very well in zero-shot settings.
-That is correct. However, this is simply true for most current IR system, even those which use a two-stage disjoint training paradigm. It would be interesting to study how few-shot learning works with EHI, and is an interesting future direction.
The encoder model is initialized using a variant of distill-bert which is already fine-tuned using labelled data from MS-MARCO dataset.
- Distilbert-cos model we have used through sentencebert was also pre-trained only using self-supervised objectives [1, 2] with mined hard negatives. The comparisons against various baselines in Appendix D predominantly serve to showcase that EHI training paradigm can compete against SOTA methods of the field, while the alignment with ANNS is significantly improved as showcased in Figure 2, Appendix E.
[1] Reimers, Nils, and Iryna Gurevych. "Sentence-bert: Sentence embeddings using siamese bert-networks." arXiv preprint arXiv:1908.10084 (2019). [2] https://github.com/UKPLab/sentence-transformers/issues/1167
distil-bert-cos-v5 model is finetuned on MS-MARCO as well as trained via distillation using teacher models such as cross-encoders. It may not be possible to have similar finetuned models for other domains or in general.
- As expanded in the official sbert repository [1], distilbert-cos-v5 does not use distillation, and follows the standard self-supervised training with mined hard negatives. We note that initializations from this distilbert model does facilitate easier fine-tuning to other domains, as showcased by our experiments on non-MSMARCO datasets including scifact, fiqa, and NQ-320K.
[1] https://github.com/UKPLab/sentence-transformers/issues/1167
What is cost of training proposed joint model and how does it compare with training dual-encoders separately with existing hard negative mining strategies?
- During training, EHI has upto 5% increase in training time over exact search baseline on datasets such as scifact and fiqa. We will also add training time in the final paper. We believe, other hard negative mining strategies the reviewer mentions such as NGAME, DyNNIBAL, etc would intuitively have a higher overhead in comparison to EHI as some of these approaches use KNN and other machine learning methods for training the clusters which are typically run on CPU. Even if EHI and these negative sampling approaches were run on the same device, EHI should still have an improved efficiency since we are only doing inference for the negative mining part unlike other negative mining strategies which requires us to train and infer for each query.
How is load balancing enforced/achieved? What part of the indexing or training step encourages better load balancing
- The indexing similarity loss which is a contrastive loss helps with the load balancing. This along with tuning leads to load-balancing in EHI. See Figure 3a for ablations regarding this loss. The idea here is that while training, EHI will try to push away documents not relevant to a query away from this bucket, thus not allowing this trivial solution where majority documents reach a single leaf (Additional details in Appendix E.4, E.5).
In Sec 4.3, why is using beam-size = branching factored referred to as exact search? Also, how does beam-size = 0.1*branching_factor ensure that we search up to 10% documents?
- Your understanding is correct. We would like to add that Scifact being a very small-scale dataset did not require heights greater than 1, and hence we have mentioned that beam-size=branching factor is analogous to exact search. We will correct this notion, and provide further clarity upon this. Note that beam-size=0.1*branching_factor does not ensure that we search upto 10% of the documents. This could only be true if we strictly enforce uniform clusters. In this work, we do not impose any such constraint, and find that EHI is able to learn near uniform clusters as showcased in Appendix E.4, E.5. We use the index structure to compute the fraction of documents visited, and plot the KMR curves.
“… we now have a learned mapping of d × l, where d is the corpus size, and l is the number of leaves.”
- The indexer during inference serves the purpose to map documents (d) into buckets (l). This knowledge has been represented in this section as a dxl mapping.
We thank the reviewer for the kind comments and insightful questions. We are happy to hear that the reviewer finds the overall motivation, ablations and idea interesting. Please refer below for the responses to specific questions.
Missing comparison with existing work that learns a kNN index jointly with the encode model.
-
It should be noted that ELIAS and BLISS are tailored for the domain of extreme classification while we focus on the dense information literature domain (note that label classifiers, and pre-computing KNN structure for datasets the size of MSMARCO is indeed a significant overhead). ELIAS has been cited in our work, and we do point out the limitations of ELIAS [1] which requires warm starting with a knn clustered index. Regardless, we report our findings on the LF-AmazonTitles-131K (XC dataset) below. Note that below comparison is still unfair, as ELIAS utilizes additional million auxiliary parameters in their label classifiers, while EHI does not. Furthermore, EHI uses label features for encoding, while ELIAS does not. Although EHI outperforms ELIAS by 1.67%, it must be recognized that comparing against extreme classification based methods is unfair.
Furthermore, BLISS [2] as per our understanding is an efficient ANN structure which enforces load balancing. Our goal in this work has been to unify the training scheme of the encoder which learns representations, and the ANN which helps index documents for faster and efficient retrieval. We are happy to add both of these comparisons to our paper if the reviewer think these comparisons are fair indeed.
| LF-AmazonTitles-131K | P@1 |
|---|---|
| ELIAS | 40.13 |
| EHI | 41.8 |
[1] Gupta, Nilesh, et al. "ELIAS: End-to-End Learning to Index and Search in Large Output Spaces." NeurIPS 2022 [2] Gupta, Gaurav, et al. "BLISS: A Billion scale Index using Iterative Re-partitioning." KDD 2022.
The training strategy for baseline dual-encoder models is not clear. Overall, it is not clear is jointly learning kNN index and encoder model is useful to get better negatives during training or does it have advantages beyond just better hard negative mining.
- The training strategy for the baseline dual encoder model is an in-batch hard negative sampling approach. Note that the primary purpose of our work as the reviewer have rightfully captured in their summary is the unified training paradigm of learning both the encoder and indexer without any warm starting. Note that the dynamic negative mining is something EHI achieves for almost free, as the built indexer is also used for the ANNS, and we expand on this further in Appendix F.1. Prior approaches including [1,2,3] do study this dynamic negative sampling approach as well, and is complementary to our approach. However, the difference between EHI and these approaches lies in the fact that EHI does not need to build a new index for clustering every few steps, and the hierarchical tree index learned by EHI is used for the downstream ANNS task. This contrasts with works such as [1,2,3], where the index is used during training and hard negative mining, but is finally discarded. Note that we have compared against DyNNIBAL in Appendix D. Furthemore, as mentioned in our new experiments, EHI with it's differentiable indexer achieves roughly 2% improvement at lower compute budgets over other off-the-hat baselines such as SCANN and Faiss (see Appendix E.8).
[1] Monath, Nicholas, et al. "Improving Dual-Encoder Training through Dynamic Indexes for Negative Mining." [2] Dahiya, Kunal, et al. "NGAME: Negative Mining-aware Mini-batching for Extreme Classification." [3] Hofstätter, Sebastian, et al. "Efficiently teaching an effective dense retriever with balanced topic aware sampling."
It is not clear is this step is cheaper than building index with off-the-shelf libraries like ScANN and FAISS.
- It is definitely cheaper as we are simply using the same trained indexer for re-indexing. If we would like to use some off-the-hat indexer such as SCANN or FAISS, we would need to train the ANNS from scratch every once in a while. From our findings, we note that training of ANNS on MSMARCO is roughly 2x more expensive than just inference using the indexer of EHI.
While the index contains trainable parameters at each internal node, the basic structure of the tree index i.e. height and branching factor is fixed.
- This is absolutely true. These choices are left to the practitioner similar to the network size, shape, activation layers, and other intricacies of any deep learning algorithm. However, we note that EHI is fairly robust to these choices as showcased in Table 13.
Thank you for your detailed comments. We understand that some of the concerns were unanswered, and we clarify these below.
use of
distilbert-cos-v5model.
- We apologize for any misunderstanding here. We note that the framework is indeed supervised and did not use the keyword distillation since there is no teacher-student framework used for training the sentence-bert model, and the goal has been to align queries with positive documents, and away from negative documents. This is further bolstered by the fact that they use
MarginMSELossas highlighted in the file the reviewer has mentioned. We did not use the word distillation as there are no objectives to align representations at a given layer between a teacher and student model. We do acknowledge the fact that a different model was used for mining hard negatives as mentioned earlier, which is similar to the usage of exhaustive hard negative mining approaches such as Rocket-qa [1], etc. We will explicitly mention this in the updated draft to avoid future confusions.
[1] Qu, Yingqi, et al. "RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering." arXiv preprint arXiv:2010.08191 (2020).
use of global hard negative mining strategies for DE training.
- Our main finding is that e2e learning is important, and EHI achieves e2e training and outperforms disjoint training paradigms. We believe experiments suggested by the reviewer earlier to train a disjoint ANNS over the pre-trained EHI encoder further bolsters this claim, and helps alleviate this concern specifically. Even if both e2e pipelines, and the disjoint off-the-shelf indexer pipelines were using the same embeddings, we find that the e2e training paradigm outperforms the latter. We notice that off-the-shelf ANNS indexers on top of the encoder model learnt through EHI (which does have a dynamic negative sampling scheme similar to DyNNIBAL, NGAME, etc) performs worse in comparison to the complete EHI module (encoder with the indexer). At similar compute budgets of visiting 1% of documents, EHI outperforms the SCANN baseline by 4.6% (nDCG@10) on MSMARCO TREC-DL benchmark, and 1% on the nDCG@10 and MRR@10 metric on the MSMARCO benchmark. We have added our findings along with KMR curves in Appendix E.8. Furthermore, other global hard negative mining strategies are complementary to the core message of our paper, and can be applied over EHI as well. Furthermore, we also report the scores of EHI without hard negative mining in Figure 3c, and note 2% improvement of the EHI model over the baselines. We will highlight this finding in the updated draft.
Can the authors present numbers comparing the re-indexing cost?
- In our multi-GPU setup, a forward pass through EHI computes both document embeddings, as well as the index. We note that the entire pipeline has a latency of 2ms/document. The indexing part attributes to only 5% of this time (~ 0.1ms/document), as the neural network being used in the indexer is significantly smaller than the transformer model used for the encoding. Thus, the EHI pipeline adds little overhead over the base encoder.
I feel ELIAS and BLISS are important baselines that will help to show the advantages of proposed approach. Could you please highlight some key differences between BLISS and ELI other than that ELI uses a tree-shaped index while BLISS uses a flat one-level index?
- Thank you for your suggestion. We will add our comparisons again ELIAS as suggested in the updated draft. The key difference from BLISS, is not only the structure of the index as the reviewer correctly points out, but also the fact that the encoder is not trained in BLISS. BLISS is a differentiable ANNS, but the pipeline is not e2e as the encoder remains frozen. We believe this to be the key difference, and limitations of disjoint training paradigms we raised in the Introduction is not solved by BLISS.
We hope this clears some of the remaining concerns the reviewer might have had. We would be very happy to discuss any further questions about the work.
Thanks for providing further clarifications.
Here are some thoughts/follow-up questions
-
It is interesting to see that EHI performs better than ScANN at smaller inference cost budgets. But the improvements are marginal in my opinion and other ANNS algorithms should also be compared against in this figure to make a stronger case for EHI.
- Also, each routing step in EHI to get to the leaf nodes involves forward pass through a neural model while for other tree-based or graph-based indices, the routing/navigation steps are mostly like dot-products between two vectors which is significantly cheaper than forward pass through a neural model. I understand the reasons for comparing against fraction of documents explored instead of inference latency are mainly that EHI is not as optimized (yet) as other ANNS search methods but I feel we don't have enough data in fig 10 to conclude that EHI is better than a disjoint ANNS. Either other ANNS methods should be added (such as FAISS-HSNW, FAISS-IVF) or inference latency should be used instead of % document explored.
-
Most of the main experiments in this paper compare against a weak DE baseline that is trained only on in-batch negatives and not on global hard negatives.
- I feel this is misleading and an unfair comparison as state-of-the-art DE baselines are not trained just using in-batch negatives. For instance, exact search withe baseline DE gets recall@100 = 54 in Table 9 while Thakur et al. (2021) report recall@100 > 58 for various DE models which are trained with hard negatives. Not sure if the base DE in this paper and in Thakur et al. (2021) are similarly parameterized but my main point is that the baseline DE model in this paper is not representative of the state-of-the-art.
- This raises a question -- as a practitioner, would one prefer to use EHI or train DE with hard negatives mined using an ANNS approach. So far the only argument in favor of EHI is that the index and encoder models are trained jointly but it is not clear if it really beats DE trained with hard negatives + separate ANNS index.
-
I am curious to see how training time compares for EHI and DE + ANNS approaches.
-
My understanding that DE + ANNS (let's call them
disjointfor this discussion) need to re-compute document embeddings and create new ANNS index on the document embeddings after every few epochs to support mining hard negatives. Even for EHI, the document embeddings need to be re-computed and the documents need to be re-indexed after every few epochs. And this reindexing is critical even for EHI to get hard negatives as just using in-batch negatives lead to significant drop in performance as shown in Fig. 3c. Thanks for reporting per-document routing time but I was looking for total time spent in this re-embedding and re-building the index step. -
If this re-embedding + re-building the index overhead is significant for
disjointapproaches then there are approaches like NGAME and DyNNIBAL which can reduce this overhead significantly. To the best of my understanding, reported numbers from DyNNIBAL paper are not directly comparable with those reported in this paper due to difference in encoder model. This paper usesdistil-bert-v5models and DyNNIBAL usesroberta.
Thakur, Nandan, et al. "Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models." arXiv preprint arXiv:2104.08663 (2021).
Even for EHI, the document embeddings need to be re-computed and the documents need to be re-indexed after every few epochs. And this reindexing is critical even for EHI to get hard negatives as just using in-batch negatives lead to significant drop in performance as shown in Fig. 3c. Thanks for reporting per-document routing time but I was looking for total time spent in this re-embedding and re-building the index step.
- Thank you for highlighting this. We would like to stress on two points here:
-
EHI supports lazy updates to index, where for a given batch of training points we only compute their latest index. Such a technique is not trivially possible for DE+ANNS techniques.
-
EHI does not have to recompute index parameters from scratch, which is what DE+ANNS techniques would do. For example in IVF, DE+IVF would need to recompute clusters for the latest set of embeddings. In contrast, we are continuously training our index parameters. This bears out in our wall clock comparisons also [1]. On the scale of approximately 9 million documents, we note that EHI based indexing completes in 25 minutes with the help of our multi-GPU pipeline. Of these 25 minutes, only 2-3 minutes are used for the indexing, and the remaining is used in the computation of dense embeddings.
-
[1] Karpukhin, Vladimir, et al. "Dense passage retrieval for open-domain question answering." arXiv preprint arXiv:2004.04906 (2020).
If this re-embedding + re-building the index overhead is significant for disjoint approaches then there are approaches like NGAME and DyNNIBAL which can reduce this overhead significantly. To the best of my understanding, reported numbers from DyNNIBAL paper are not directly comparable with those reported in this paper due to difference in encoder model. This paper uses distil-bert-v5 models and DyNNIBAL uses roberta.
- The re-embedding + re-building the index overhead is indeed a significant overhead as stated in our response to the previous query. We do acknowledge that NGAME, and DyNNIBAL might reduce some of this overhead, but this is not completely clear as latency numbers for DyNNIBAL are not mentioned. Furthermore, the training time of EHI on the LF-AmazonTitles-131K dataset is roughly similar to the training time of NGAME as reported in [1]. However, we would like to stress that our core message of the limitations of the disjoint training paradigm would still remain, and the primary message of our paper supporting the need for e2e training paradigms holds true. Furthermore, we would like to stress that our comparisons against various SOTA models (some more than 10X our model size) including DyNNIBAL helps showcase that EHI is competitive against SOTA models and e2e training pipelines helps make these representations aligned with the downstream clustering as shown across our findings, as well as the newer experiments added during the rebuttal.
[1] http://manikvarma.org/downloads/XC/XMLRepository.html
We hope that the rebuttal clarifies the follow-up questions raised by the reviewer. We would be very happy to discuss any further questions about the work, and would really appreciate an appropriate increase in the score if the reviewer’s concerns are adequately addressed, and revisions have been made to facilitate acceptance of the paper. We will make sure to highlight these fixes in the updated manuscript as suggested by the reviewer.
We again thank the reviewer for taking the time and effort engaging in this dialogue and helping improve our work.
On the scale of approximately 9 million documents, we note that EHI based indexing completes in 25 minutes with the help of our multi-GPU pipeline. Of these 25 minutes, only 2-3 minutes are used for the indexing, and the remaining is used in the computation of dense embeddings.
-
If each document encoding takes 2ms, then 9 million documents should take roughly 5 hours. I understand that this can be parallelized on multiple GPUs. Is that how you get 25 minutes? Also, each document takes 0.1 ms for indexing, so 9 million should roughly take 30 minutes, right?
-
Can you please report the time taken to build separate ANNS indices such as ScANN, FAISS-IVF, and HNSW used in your experiments? Also, some these can benefit significantly by use of GPUs. As reported in Table 5 (Xiong et al., 2020), ANNS building time is merely 10s which is negligible as compared to document embedding time.
-
Also, how what does per-epoch training time looking like? And how many GPUs are used for training? How does it compare with DE (with in-batch) negatives and DE with ANNS-based hard negatives?
Xiong, Lee, et al. "Approximate nearest neighbor negative contrastive learning for dense text retrieval." arXiv preprint arXiv:2007.00808 (2020).
Re: Comparison with SoTA models
-
Sorry for the confusion but when I said baselines are not SoTA, I meant that DE + ANNS baseline used in Figure 2 is not SoTA because the DE is trained with just in-batch negatives whereas SoTA training strategies for DE models used ANNS to get hard negatives.
- Are there numbers where you started with encoder =
msmarco-distilbert-cos-v5, and then trained with global hard negatives using ANNS methods? - This is the baseline that would be really important to understand if end-to-end training is something that a practitioner would want to do or if existing approaches to use DE (trained with hard negatives) + ANNS is good enough or better.
- Are there numbers where you started with encoder =
-
It is difficult to draw conclusions from Table 1 and 2 as the other baselines use different encoder models so it is not a useful comparison to evaluate if the proposed end-to-end training is better than existing training strategies.
-
Also, the base encoder model used here is
msmarco-distilbert-cos-v5which already beats most of the baselines in Table 5 even without any further end-2-end training. As per numbers reported here,msmarco-distilbert-cos-v5gets MRR@10 of 33.75 and EHI in table 5 gets MRR@10 of 33.8. So, SoTA results in Table 5 are because you started with a better encoder model i.e.msmarco-distilbert-cos-v5, and it may not hold if you initialized with saybertorroberta. Please correct me if I am making a mistake in this comparison.- For this reason, I suggested in my initial review that
the base encoder model choice is perhaps not fair, and
authors should consider using
bert,robertaordistil-bertmodels.
- For this reason, I suggested in my initial review that
the base encoder model choice is perhaps not fair, and
authors should consider using
We thank the reviewer for their questions, and appreciate the ellaborate discussions. Please find below clarifications for each of these queries.
If each document encoding takes 2ms, then 9 million documents should take roughly 5 hours. I understand that this can be parallelized on multiple GPUs. Is that how you get 25 minutes? Also, each document takes 0.1 ms for indexing, so 9 million should roughly take 30 minutes, right?
- Yes - the reviewer is correct in their understanding that we can compute the index in 25mins because of our multi-GPU setup. We note that the 0.1ms latency on roughly 9 million documents should take around 15 minutes without any parallelization theoretically and not 30 minutes.
Can you please report the time taken to build separate ANNS indices such as ScANN, FAISS-IVF, and HNSW used in your experiments? Also, some these can benefit significantly by use of GPUs. As reported in Table 5 (Xiong et al., 2020), ANNS building time is merely 10s which is negligible as compared to document embedding time.
- We would again like to stress that this comparison might not be fair. In our implementations, we note building of the ScANN index takes anywhere between 20-30 minutes, but this might be because we are not taking advantage of the GPUs. FAISS-IVF again takes 40-60 minutes to train and re-index, but this again might be due to the fact that we use the CPUs alone. Due to technical difficulties, we were unable to get Faiss to work on GPUs.
Also, how what does per-epoch training time looking like? And how many GPUs are used for training? How does it compare with DE (with in-batch) negatives and DE with ANNS-based hard negatives?
- Per-epoch training time for the scale of MSMARCO is roughly 2-3 minutes per epoch, and we use 8 GPUs for training. The DE (in-batch + hard-negatives) roughly takes 2-3 minutes per epoch as well, and is roughly similar to DE with ANNS-based hard negatives and slightly more expensive that DE (with in-batch) negatives.
Are there numbers where you started with encoder = msmarco-distilbert-cos-v5, and then trained with global hard negatives using ANNS methods? This is the baseline that would be really important to understand if end-to-end training is something that a practitioner would want to do or if existing approaches to use DE (trained with hard negatives) + ANNS is good enough or better.
- To alleviate this concern, we ran an experiment where the EHI indexer mines hard negatives, and used these to train the distilbert model. We note similar numbers as reported in our paper, with the baseline achieving 33.77 MRR@10. This would follow our consistent trend supporting e2e learning over DE (trained with hard negatives) + ANNS. We believe this alleviates some of the concerns surrounding this topic.
It is difficult to draw conclusions from Table 1 and 2 as the other baselines use different encoder models so it is not a useful comparison to evaluate if the proposed end-to-end training is better than existing training strategies.
- We thank the reviewer for pointing this out, and would like to clarify these here, and highlight these points in the updated drafts as well.
- We would like to stress that the fair comparison for the MSMARCO baseline is Figure 2 where the encoders and training mechanisms are held constant and only the ANNS is varied, and the baselines are representative of SOTA as noted by the reviewer. In the case of EHI, the training mechanism includes ANNS, whereas in the other DE+ANNS approaches, the training paradigm is disjoint in nature. Gains at lower compute budgets in this scenario showcases the primary message of EHI, where e2e training paradigm leads to better alignment between the encoder embeddings and the downstream clustering tasks.
- We would also like to highlight that Table 1 and Table 2 are meant for showing the lay of the land, where EHI which uses distilbert-cos encoder is better than quite a few existing off the shelf techniques -- that often also leverage full finetuning on the same dataset unlike EHI which is parameter efficient and outperforms models which was 10x-100x our model complexity and were trained on the same dataset with exhaustive hard negative mining, etc. We also add the model complexity of the models we compare against in Table 1, Table 2 and others to showcase this trend.
- We also observe that at a small scale, fine-tuning the last layer is as effective as full fine-tuning. However, this is complementary and any gains from full finetuning will reflect on EHI as showcased below from our findings on the Scifact dataset.
| Scifact | (R@100) |
|---|---|
| Finetune last layer | 95 |
| Train all layers | 94.77 |
Again, we would like to thank the reviewer for engaging in elaborate discussions with us over this work. We hope this alleviates some of their concerns.
Per-epoch training time for the scale of MSMARCO is roughly 2-3 minutes per epoch,
MS-MARCO contains 500K queries. Using 2ms per forward-pass (as reported by authors), even a forward pass to compute 500K query embeddings and also at least 500K positive items/document embeddings with take 2x500Kx2ms = 2000s ~= 33 minutes on a single GPU and at least 4 minutes when using 8 GPUs. And this is just forward pass, the backward pass to compute gradients will take even longer. Please correct me if I am making a mistake here. Not sure how authors achieve 2-3 minutes per epoch.
To alleviate this concern, we ran an experiment where the EHI indexer mines hard negatives, and used these to train the distilbert model. We note similar numbers as reported in our paper, with the baseline achieving 33.77 MRR@10.
distil-bert-cost-v5 model already has MRR@10 of 33.75 on MS-MARCO as per numbers reported here. So does this mean that your finetuning does not lead to any improvement? Maybe I am missing something here? Also should these be compared with 33.8 for EHI in Table 1? In that case, even EHI does not lead to improvement over distil-bert-cost-v5 model.
Also, the base encoder model used here is msmarco-distilbert-cos-v5 which already beats most of the baselines in Table 5 even without any further end-2-end training. As per numbers reported here, msmarco-distilbert-cos-v5 gets MRR@10 of 33.75 and EHI in table 5 gets MRR@10 of 33.8. So, SoTA results in Table 5 are because you started with a better encoder model i.e. msmarco-distilbert-cos-v5, and it may not hold if you initialized with say bert or roberta. Please correct me if I am making a mistake in this comparison. For this reason, I suggested in my initial review that the base encoder model choice is perhaps not fair, and authors should consider using bert, roberta or distil-bert models.
Could you please respond to this comment from my previous post?
Here are my final suggestions to authors to improve this paper. I don't think it fair to ask or even possible to fix these things during the rebuttal period but I hope it will be helpful to improve subsequent revisions of this work.
- In Figure 2, compare with DE trained with hard negatives and not just DE trained with in-batch negatives
- Use
bert,robertaordistil-bert(or other generic pretrained model) instead ofdistil-bert-cost-v5asdistil-bert-cost-v5has been heavily finetuned over MS-MARCO. - ANNS building time seem a little off for million-scale datasets, and given that proposed approach uses GPUs, it is not fair to compare against CPU based ANNS methods when these can be run over GPUs.
Thanks for the quick response, and follow-up questions. Please find below clarifications for the follow up questions.
The improvements are marginal in my opinion and other ANNS algorithms should also be compared against in this figure to make a stronger case for EHI. Either other ANNS methods should be added (such as FAISS-HSNW, FAISS-IVF) or inference latency should be used instead of % document explored.
- Thank you for your suggestion. We would like to stress that the improvements of the order 4.6%, 1% are indeed statistically significant in the experiments of MSMARCO as further expanded in our statistical significance tests as conducted in Appendix E.3. Furthermore, on our comparisons of EHI against FAISS-IVF on similar compute budgets, we note similar significant improvements of 1.82% (nDCG@10) on MSMARCO TREC-DL benchmark when visiting only 1% of the documents. Furthermore, we notice improvements of 2% on the MRR@10 metric on the MSMARCO benchmark on similar compute budgets. We are currently running comparisons against FAISS-IVF with larger beam sizes, and will update Figure 10 with our updated results at the earliest. Furthermore, we note that comparing inference latency would have been ideal if all the approaches were to run on the same architecture. ScaNN works on CPU using the increased parallel processing to compute nearest neighbors. Recall vs documents visited generally correlated strongly with latency, and other throughput metrics [1]. So for ease of comparison we stick with this metric.
[1] Guo, Ruiqi, et al. "Accelerating large-scale inference with anisotropic vector quantization." International Conference on Machine Learning. PMLR, 2020.
the routing/navigation steps are mostly like dot-products between two vectors which is significantly cheaper than forward pass through a neural model.
- Thank you for highlighting this. We would like to point out that EHI also is routing and scoring primarily based on cheap dot-products. In particular, in SCANN or IVF, given a query embedding , the leaf node is selected by where represents cluster centroid (about 7000 centroids in MSMARCO). Similarly, in EHI, we first compute where is roughly just 768x768 for single height indexers, and then compute , with a similar number of branches (7000 branches in MS MARCO). So the amount and type of computation in each node is almost the same without any significant difference.
So far the only argument in favor of EHI is that the index and encoder models are trained jointly but it is not clear if it really beats DE trained with hard negatives + separate ANNS index.
-
Thank you for highlighting this. We would like to stress that the performance the reviewer mentions are comparisons on the Fiqa datasets - a smaller toy datasets whose findings in the main paper were ablations of EHI to understand the effect of hard negatives, and the various components of the loss function which hold water. We would like to kindly refer to reviewer to experiments including MSMARCO, TREC DL, and NQ320K, where the baseline comparisons are against significantly larger models than EHI which are robust and representative of SOTA (see Table 1, Table 2, Table 5, Table 6, Table 7).
It is important to stress on the fact that both DSI [1], and NCI [2] which are our closely related prior works, train an e2e model (albeit with some warm-starting), follow a similar paradigm, experimental workflow and do not use global hard negative strategies which are complementary to the core message of our works. Furthermore, unlike NCI which required warm-starting, query generation, pawa decoder, and many other additions, EHI achieves superior performance on the SOTA NCI model (~11x our model size) on the NQ320K dataset without these complementary techniques. Furthermore, on MSMARCO, we compare against various works which are considered SOTA including ANCE [3], TAS-B [4] which uses hard-negatives, and other late interaction SOTA methods such as ColBERT [5]. We also compare against the SOTA DSI approach and showcase the efficacy of EHI over prior closely related e2e frameworks at this scale.
Note that the ANNS performance is bottlenecked and the exact search metrics reported in our findings are the oracle metrics. Thus, training e2e not only leads to better performance at lower budgets, but also leads to improved exact search performance as showcased in our findings.
[1] Yi Tay et al. Transformer memory as a differentiable search index. [2] Yujing Wang et al. A neural corpus indexer for document retrieval. [3] Xiong, Lee, et al. Approximate nearest neighbor negative contrastive learning for dense text retrieval. [4] Nandan Thakur et al. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. [5] Khattab, Omar et al. Colbert: Efficient and effective passage search via contextualized late interaction over bert.
We want to thank the reviewer for their time, elaborate discussions, and insightful comments, improving the quality of the paper. We have polished our draft - adding comparisons against disjoint DE+ANNS training (w/ hard negatives), comparisons against ELIAS, and clarifying other aspects surrounding implementation details, negative mining, etc., as requested. We will also follow up with the requested changes of moving Table 1 and Table 2 to the Appendix and clarifying more aspects of the additional experiments the reviewer has graciously suggested. These have indeed been fruitful and helped strengthen our paper.
Since the author-reviewer discussion period is ending soon, we would appreciate an appropriate increase in the score that reflects the same and facilitate acceptance of the paper
EDIT: As suggested by Reviewer 9k3r, we did re-write parts of the paper and moved Table 1 and Table 2 to appendix as suggested. We also added comparisons against the ELIAS benchmarks, and will further polish the paper to involve other discussions we have had with Reviewer PLmq in the final version. Thanks again for the elaborate discussions.
My point is that during finetuning (for results in Fig 2), you only use in-batch negatives to finetune DE+ANNS baseline while EHI uses better negatives, and EHI resulting in better performance might be attributed to this. If would make for stronger argument in favor of EHI if it showed improvement over DE+ANNS baseline even when the DE is finetuned using hard negatives in Figure 2. There is some preliminary evidence that joint indexing is better than disjoint indexing in Fig 10 but as I said previously, more thorough empirical investigation is needed in that figure to make a case for EHI. For instance, it might be possible that if disjoint DE+ANNS training (w/ hard negatives) followed by disjoint ANNS at inference time might beat or be just as good as EHI.
- Thank you for your valuable insights. Please refer below 2 reasons why we believe the concern you have raised (“it might be possible that if disjoint DE+ANNS training (w/ hard negatives) followed by disjoint ANNS at inference time might beat or be just as good as EHI”) have been resolved over our productive discussions, and we will work on highlighting these points in the main paper of our final manuscript.
- Our findings attached in our Official Comment by Authors, where we showcase that models especially on MS MARCO trained with mined hard-negatives achieve similar results to the fine-tuned MS MARCO model. Intuitively, we attribute this behavior to the fact that the model is converged. Thus, this helps showcase not only that the e2e training paradigm leads to statistically significant results over the baseline, but also shows significant improvements in comparison to off-the-shelf indexers at lower compute regimes. Comparisons against encoders which have not converged might not be optimal, and we believe this experimentation and its findings help bolster EHI and the need for e2e training paradigm.
- Our findings attached in our Response to reviewer PLmq (Part 3) where we showcase the following experiment: given a model trained using dynamic hard negative mining (similar to DyNNIBAL, NGAME, or EHI), the exact search metrics are identical (MRR@10 metric of 33.8%). We note that even given the same SOTA architecture, we note that training a disjoint ANNS at inference time does lead to significant performance drops in comparison to EHI when working at low compute budgets. We believe that the experiment suggested by the reviewer is a strong experiment highlighting two important factors of this work: (1) Training in a disjoint fashion lead to encoder learning embeddings unaware of the downstream clustering task. Thus, leading to a misalignment, and this experiment re-confirms this motivation. (2) We showcase that even given the same encoder, aligning the encoder with the ANNS through e2e training such as EHI does show statistically significant improvements and is consistent with our motivation.
- We believe the above two experiments lead to an optimal apple-to-apple comparison possible in this domain (same # parameters, training regime, no query generation, pawa decoder, etc.). Any complementary hard-negative training, or other techniques added to DE models can also be applied to EHI as well, and our findings reported in the above 2 experiments, as well as in the paper would hold water.
The above points help bolster the fact that there is significant evidence supporting that e2e is helpful, and both these experiments suggested by the anonymous reviewer helped improve the paper significantly.
The main contribution that this paper makes is that ANNS and DE models should be trained jointly as opposed to using a disjoint ANNS index. And I think it will benefit to focus on this aspect without focusing too much on whether you have SoTA numbers or not. You can keep your experiments with (distil-bert-cos-v5) to show that you also get SoTA numbers but this can be in the appendix.
- Thank you for your valuable insights. As suggested, we will stick to highlighting our comparisons against DE+ANNS approaches which we also concur with as the main findings of our paper, and move the comparisons against SOTA architectures to the Appendix to avoid any confusion.
We thank the reviewer for their time, and their valuable suggestions over the past week. We believe that the added experiments suggested by the anonymous reviewer help address their concern which we will highlight and bring to the main paper in the final version (as this requires some re-writing due to page constraints). Furthermore, we will also move the comparison to SOTA tables to the Appendix in the final version as suggested. We would be very happy to discuss any further questions about the work, and would really appreciate an appropriate increase in the score if the reviewer’s concerns are adequately addressed to facilitate acceptance of the paper.
MS-MARCO contains 500K queries. Using 2ms per forward-pass (as reported by authors), even a forward pass to compute 500K query embeddings and also at least 500K positive items/document embeddings with take 2x500Kx2ms = 2000s ~= 33 minutes on a single GPU and at least 4 minutes when using 8 GPUs. And this is just forward pass, the backward pass to compute gradients will take even longer. Please correct me if I am making a mistake here. Not sure how authors achieve 2-3 minutes per epoch.
- MSMARCO contains 500K training pairs, and not 500K unique queries (see Table 3). Note that batches are created at the query level, and any repetitions of the query to the various relevant documents are captured, and the 2x cost the reviewer adds over 500K is not accurate. We do notice per-epoch training time in the range of 2-3 minutes as mentioned earlier.
distil-bert-cost-v5 model already has MRR@10 of 33.75 on MS-MARCO as per numbers reported here. So does this mean that your finetuning does not lead to any improvement? Maybe I am missing something here? Also should these be compared with 33.8 for EHI in Table 1? In that case, even EHI does not lead to improvement over distil-bert-cost-v5 model.
- This is not true. The improvements over distilbert-cos-v5 are also reported in Table 10, Table 11, and we use the distilbert-cos-b5 framework explicitly since this is representative of the SOTA architecture. The models are significantly converged, and improvements on exhaustive and competitive benchmarks are indeed challenging. We furthermore, show that the improvements of EHI against such strong encoders which have been trained with exhaustive hard negative mining is statistically significant (see Appendix E.3). This is precisely why we stress on Figure 2 where comparisons of EHI against SOTA architectures such as distilbert-cos-v5 holds value.
Could you please respond to this comment from my previous post?
- Please refer to the final points of our prior response which answers the very same question. We broke our reasoning into 2 parts - 1) It is Figure 2 which is needs to stressed upon as the baseline is representative of SOTA and was trained with hard negatives. 2) The Table 1, Table 2 aim to show the lay of the land, and showcase that EHI is indeed competitive against various SOTA architectures, while remaining significantly better than off-the-shelf indexers such as SCANN, FAISS, etc.
In Figure 2, compare with DE trained with hard negatives and not just DE trained with in-batch negatives
- Thank you for the suggestion. However, note that in our comparisons in Figure-2, the baseline algorithms have been trained with hard negatives, and we are fine-tuning them. Especially, as stated in our experiments run and reported in our response, we notice little improvements as the models are indeed converged.
Use bert, roberta or distil-bert (or other generic pretrained model) instead of distil-bert-cost-v5 as distil-bert-cost-v5 has been heavily finetuned over MS-MARCO.
- Thank you for the suggestion. However, training over base-bert, base-roberta might not be fair, as we will not be able to compare against architectures representative of SOTA as highlighted above. This is precisely why Figure 2 results are of value, we showcase statistically significant improvements of EHI over heavily finetuned models such as distilbert, which is representative of SOTA as the reviewer notes.
ANNS building time seem a little off for million-scale datasets, and given that proposed approach uses GPUs, it is not fair to compare against CPU based ANNS methods when these can be run over GPUs.
- This is precisely why reporting time, as a metric for throughput, is not preferred in the research community [1]. Metrics such as latency depend heavily on the type of machines (GPU, CPU, TPU), language of code, and various other factors. Thus, it is common to report the scale of the model (in terms of parameters), and especially in the efficiency aspect of IR - metrics such as number of documents visited, or FLOPS are preferred metrics for comparison [2] [3]. Note that, this places all algorithm on a similar playing field, and is the fair experiment to report as highlighted above.
[1] Guo, Ruiqi, et al. "Accelerating large-scale inference with anisotropic vector quantization." International Conference on Machine Learning. PMLR, 2020. [2] Formal et al. (SIGIR 2021). From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. [3] Lassance, Carlos, et al. "A Static Pruning Study on Sparse Neural Retrievers." arXiv preprint arXiv:2304.12702 (2023).
We thank the reviewer for their time, and understand they have few suggestions. However, we note that some of these have been choices taken explicitly to show few properties of EHI and have fair comparisons.
Thanks for your prompt response.
I understand the reason for using distil-bert-cos-v5 as the base DE model to initialize all models (both EHI and baseline DE+ANNS) and acknowledge that this is a strong baseline. My point is that during finetuning (for results in Fig 2), you only use in-batch negatives to finetune DE+ANNS baseline while EHI uses better negatives, and EHI resulting in better performance might be attributed to this. If would make for stronger argument in favor of EHI if it showed improvement over DE+ANNS baseline even when the DE is finetuned using hard negatives in Figure 2. There is some preliminary evidence that joint indexing is better than disjoint indexing in Fig 10 but as I said previously, more thorough empirical investigation is needed in that figure to make a case for EHI. For instance, it might be possible that if disjoint DE+ANNS training (w/ hard negatives) followed by disjoint ANNS at inference time might beat or be just as good as EHI. I don't think we have strong empirical evidence in favor of EHI that it is really the approach that beats disjoint training approaches.
Another point is that while starting with a very strong encoder model (distil-bert-cos-v5) means that you can get SoTA numbers but these comparisons not fair and useful in practice (because of difference in encoders). The main contribution that this paper makes is that ANNS and DE models should be trained jointly as opposed to using a disjoint ANNS index. And I think it will benefit to focus on this aspect without focusing too much on whether you have SoTA numbers or not. You can keep your experiments with (distil-bert-cos-v5) to show that you also get SoTA numbers but this can be in the appendix.
We would like to thank the reviewer for the elaborate discussions and the insightful questions. We believe that we have addressed all the concerns raised by the reviewer over the last week during discussion. We have also updated our manuscript and revised them as suggested by the reviewer in their recent response.
Since the author-reviewer discussion period is ending very soon, we would appreciate an appropriate increase in the score that reflects the same and facilitate acceptance of the paper. Again, thank you very much.
The paper purposed a new model arch for the embedding based retrieval, trying to resolve the issue that the traditional product quantization will lead to suboptimal retrieval performance due to its design. Overall the idea is original and well structured, and the math is relatively clear and sound. From the results conducted by the author, their method showed a decent marginal over the baseline models. I have pointed out several minor issues and suggestions below.
优点
The main contribution / original ideas from my point of view are
- e2e joint training of query and doc that eliminates the process of product quantization for doc embedding generation process
- the careful and solid loss function design based on the goal of EBR, with ablation study (I personally like this part most)
these two points are original and well founded, and resolves real issues the baseline design has. Theoretically the assumption also makes perfect sense in that PQ is a compromise to the design and the problem being addressed, while prob distribution over the path in a tree should be superior as well.
the loss function design over the traditional doc product (and others) sounds like a really good example for how optimization goal should be constructed given a real world problem, especially given the ablation study.
缺点
Some suggestions from my perspective
- clear assumption - my first intuition after reading the paper is the story it tries to deliver lacks an good assumption/description of the initiative. The work is an incremental work built on past models, and authors claim that e2e training is better. However the baseline models it compares with are also trained e2e, while it is disjoint because the inference process, i.e. the ANNS requires doc embedding being calculate first. I would phrase the advantage to be to recapture the lost accuracy during ANNS, etc.
- hard-nv mining - the process is only mentioned with a few words, for "fast convergence" purpose. I kind of feel this is a really important topic when we choose to do so, and possibly causing some performance loss if not doing correctly. More details about how the authors doing it should be necessary.
- lack of clarity in model details - it may be due to page limit but the model description is not clear enough. The softmax in the tree, the path embedding, the Wh+1, Uh+1, and p^h+1 looks not that straight forward without more clarification or a graph of training and inference process, etc.
- hyper params - the EHI has a lot hyper params as well even compared with baseline model. how are they selected? the ones in loss, in model arch, etc would have a big impact if not chosen correctly IMHO
问题
see in above weakness section
We thank the reviewer for the kind comments and insightful questions. We are happy to hear that the reviewer finds the idea well founded, and ablations supporting the various design choices of EHI. Please refer below for the responses to specific questions.
clear assumption - my first intuition after reading the paper is the story it tries to deliver lacks an good assumption/description of the initiative.
- Thank you for pointing this out. We note that modern retrieval systems have a disjointed learning process that often results in representations that are sub-optimal for downstream clustering. This is primarily due to the fact that the encoder is not aware that the embeddings need to be clustered. This leads to misalignment of representations. To overcome this limitation, end-to-end training is desired - which leads to not only recapture the lost accuracy during ANNS (supported by our findings in Figure 2, Appendix E.8), but also leads to better representations learned (supported by our findings on Figure 2, Figure 3, and Appendix E). We will further highlight this in the introduction in our draft.
The work is an incremental work built on past models, and authors claim that e2e training is better.
- Please refer above for the motivation to why e2e training is preferred. We also showcase this with findings on various benchmarks as highlighted in Figure 2, Appendix E. We would like to point out that EHI is the first work to the best of our knowledge to solve the problem of training an encoder and ANNS in an end-to-end fashion without using warm-started KNN models to facilitate learning the indexer. What we highlight in this work is an alternative training paradigm to do so and we respectfully disagree with the statement that EHI is an incremental work built on past models.
hard-nv mining -More details about how the authors doing it should be necessary.
- The idea is that for a given positive document () for a query (), you would like to sample other documents that reached the same bucket as (and thus share some similarities with ) but are not considered relevant to the query per the ground truth data. These documents are treated as hard negatives for the query - . Algorithm 1 gives a more formal description of this hard -ve mining, and we have added an explicit paragraph highlighting this in the main paper (see Appendix F.1). An ablation with and without this hard -ve mining is provided in Figure 3c.
lack of clarity in model details
-
For a given query/document, the root node uses the embedding information alone, and we use a linear layer () to transform the embeddings received by the encoder to another 768-dimensional embedding. Having these affine transformations at each height is essential as we would like to focus on different aspects of the same query/document at each height to classify which children node the item needs to go to. This is followed by another linear layer () to predict logits for the children at the node (2 in this case). This, followed by the softmax in the tree, leads to the probability distribution deciding which node to go towards - ().
This is followed by the logic at the following height, which follows roughly the same process as the root node with one significant difference. The affine transformation at this height takes both the embedding input and the one hot version of the probabilities at the previous height to make the predictions. During training, we have 1 at the index with the highest probability and 0 otherwise (assume , and hence one-hot version is ). However, once the conditional probability at the next height is computed, we multiply by this to have the actual probability of reaching nodes 3 and 4 (note these numberings follow the same logic as used in Figure 1). The final path embedding is the concatenation of these probability distributions leading to as depicted in Figure 14.
During inference, the indexer follows a similar setup where, instead of only doing one forward pass with the highest probability, we do forward passes at a given height with top- highest probabilities. We have added a separate section in Appendix F describing the details of the EHI indexer further with the help of figures as requested.
hyper params -how are they selected
- The most essential hyperparameters per our findings were the weight of each loss term and learning rate. This involved selecting a subset of the training dataset and systematically exploring different combinations of hyperparameter values that perform best on this held-out set. Hyperparameters in model architectures, such as height, branching factor, etc., are indeed left to the practitioner's requirements as we notice roughly similar performance across various ablations as depicted in Table 13.
We hope that the rebuttal clarifies the questions raised by the reviewer. We would be very happy to discuss any further questions about the work, and would really appreciate an appropriate increase in the score if the reviewer’s concerns are adequately addressed to facilitate acceptance of the paper.
I have read all the comments and other reviewers opinions and decide to keep my scores unchanged.
This paper introduces an EHI for efficient dense retrieval in semantic search. The traditional two-stage process of dense embedding-based retrieval, involving contrastive learning for embedding and approximate nearest neighbor search for retrieval, can lead to suboptimal performance due to misalignment of representations and a lack of consideration for query distribution. EHI addresses these issues by jointly learning the encoder and the ANNS structure, utilizing a dense path embedding to capture the position of a query or document in the tree. Experimental results demonstrate that EHI outperforms existing state-of-the-art (SOTA) methods on various benchmarks, showcasing significant improvements in retrieval accuracy with a reduced computational budget. Key contributions include the proposal of EHI as the first end-to-end learning method for dense retrieval, offering a paradigm shift by integrating encoder and ANNS in a single pipeline. The paper also compares EHI with other techniques in the context of representation learning, approximate nearest neighbor search, and encoder-decoder models for semantic search.
优点
The paper introduces a fresh approach which cleverly tackles the limitations of traditional retrieval methods. In a quite simple framework It utilizes dense path embedding, making it really efficient at keeping track of where information is in the search process. The experiments show that it outperforms the other methods, proving that it's really effective even when it doesn't have a lot of computing power. By suggesting the first end-to-end learning method, it's like it's changing the game for how we do this kind of search. The paper does a great job comparing it with other methods, showing why it's the better choice for understanding and finding stuff.
The paper presents the results of the performance metrics evaluated on the MS MARCO dev dataset and the NQ320k dataset, demonstrating that EHI performs competitively against various other methods. It achieves compettitive results such as an MRR@10 of 33% and 39% and an R@10 of 89% and 96% on the respective datasets. The results also highlight the superiority of EHI over methods like BM25, DyNNIBAL, DeepCT.
缺点
While the paper presents a comprehensive comparison with several state-of-the-art methods, a potential limitation lies in the absence of a comparison with some of the best-performing ranking models, which could provide a more complete understanding of EHI's relative performance. Additionally, the paper acknowledges the current lack of a deeper theoretical understanding of path embeddings, potentially limiting the in-depth comprehension of the model's inner workings. Furthermore, although scalability is briefly discussed, a more detailed analysis of EHI's performance on extremely large-scale datasets would strengthen its practical applicability. The paper suggests future research directions for enhancing the generalization of tail queries, yet it does not extensively elaborate on how EHI specifically addresses this challenge, warranting a more in-depth exploration of this aspect.
问题
In terms of previous works how do you compare EPI to some of the matrix factorization methods that learn query and document simultaneously?[1,2,4] and can you apply this method on cross lingual retrieval as well?[3]
Overall, a well written and enjoyable paper
[1] Zamani, Hamed, et al. "Pseudo-relevance feedback based on matrix factorization." Proceedings of the 25th ACM international on conference on information and knowledge management. 2016. [2] Kim, Donghyun, et al. "Convolutional matrix factorization for document context-aware recommendation." Proceedings of the 10th ACM conference on recommender systems. 2016. [3] Dadashkarimi, Javid, et al. "An expectation-maximization algorithm for query translation based on pseudo-relevant documents." Information Processing & Management 53.2 (2017): 371-387. [4] Gao, Yang, et al. "Jointly learning topics in sentence embedding for document summarization." IEEE Transactions on Knowledge and Data Engineering 32.4 (2019): 688-699
We thank the reviewer for the kind comments and insightful questions. We are happy to hear that the reviewer found our paper enjoyable to read. Please refer below for the responses to specific questions.
absence of a comparison with some of the best-performing ranking models.
- Our work predominantly focuses on the task of retrieval for a given compute budget, and an off-the-hat addition of ranking models could further improve the MRR metrics of our models. Note that comparing against ranking models would be unfair in the current setup as EHI uses no ranking model other than a simple inner product similarity heuristic. Thus, EHI is a complementary work to ranking models.
future research directions for enhancing the generalization of tail queries, yet it does not extensively elaborate on how EHI specifically addresses this challenge.
- In the concluding paragraph, where we talk about future works, we mention using other techniques, such as RGD [1], to enhance the generalization of tail queries as a potential future direction. EHI does not specifically address this challenge, and we are working on this extension.
[1] Kumar, Ramnath, Kushal Majmundar, Dheeraj Nagaraj, and Arun Sai Suggala. "Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization." arXiv preprint arXiv:2306.09222 (2023).
In terms of previous works how do you compare EHI to some of the matrix factorization methods that learn query and document simultaneously?
- [3] uses matrix factorization for Document Summarization and is a tangential work, as our task at hand is retrieval. Works such as [1,2], although focusing on the task of retrieval, seem to focus primarily on learning the first stage of the retrieval process (embedding of query/document) and do not build an approximate nearest neighbor search (ANNS) structure, which is primarily used to improve the efficiency during test-time. Furthermore, approaches such as matrix factorization face the cold-start problem where the addition of new documents or queries would need to be trained. However, EHI alleviates this limitation since we compute dense embeddings through a deep neural network, and new queries or documents could easily be incorporated, which is crucial for the practical usability of these methods. EHI primarily focuses on unifying the training paradigm when the encoder and ANNS must be trained, and thus is different from these matrix factorization approaches.
[1] Zamani, Hamed, et al. "Pseudo-relevance feedback based on matrix factorization." Proceedings of the 25th ACM international on conference on information and knowledge management. 2016. [2] Kim, Donghyun, et al. "Convolutional matrix factorization for document context-aware recommendation." Proceedings of the 10th ACM conference on recommender systems. 2016. [3] Gao, Yang, et al. "Jointly learning topics in sentence embedding for document summarization." IEEE Transactions on Knowledge and Data Engineering 32.4 (2019): 688-699
can you apply this method on cross lingual retrieval as well?
- Currently, we have predominantly focussed on benchmarks of singular linguality. However, extending to cross-lingual should be reasonably straightforward for two-tower models, and this could be an exciting future direction.
We hope that this rebuttal further solidifies your positive outlook on the paper and we are happy to discuss if you need any further clarifications to facilitate acceptance of the paper.
We thank the reviewers for their time and effort in reviewing our work. We are glad all the reviewers agree that the paper is exciting and the problem is indeed important.
-
We have made minor edits, adding more details into some aspects of negative mining, indexer, and path embedding as suggested by Reviewer vMMh (Appendix F).
-
Over our elaborate discussions with Reviewer PLmq, we are glad that all of their concerns have been resolved for the most part, and we have updated the draft as mentioned in our response. The experiments requested by the reviewer, including comparisons against ELIAS, comparisons with stand-alone indexers where the EHI encoder is fixed, etc., further strengthen our paper.
-
Finally, as suggested by the Reviewer 9k3r, we have added all the comparisons and also acknowledged some of the SOA works in the related works as well as the results section which remained the primary concern about our work and was suggested by the reviewer. We will further polish and add more about these comparisons in the final version as this will require significant re-writing to adhere to the page limits at the moment.
Since the author-reviewer discussion period is ending very soon, we would appreciate any engagement in our responses and an increase in scores that reflect the same to facilitate acceptance of the paper. Again, thank you very much.
This paper focuses on the task of learning dense embeddings for queries and documents/items jointly with an approximate nearest neighbor search (ANNS) index. It uses a standard dual encoder model for embedding queries and documents while learning an inverted file index (IVF) style tree structure for efficient ANNS. The reviewers generally think that the idea of joint learning is interesting and the motivation is clear and it makes valuable contributions to the important problem. They also very much engaged in the discussions with the authors during the rebuttal period and provided a lot of comments. There are still major concerns after the discussion, which mainly lie in the experiments and discussions about & comparisons with related work. The paper could be significantly improved after taking into account all reviews and comments from this submission and thoroughly discussing all related work (and why not comparing with them if so).
为何不给更高分
Please see the weaknesses summarized above.
为何不给更低分
N/A
Reject