Relevance-Based Embeddings for Efficient Relevance Retrieval
The idea is to describe each query (item) by its relevance for a set of support items (queries) and use these new representations to obtain query (item) embeddings.
摘要
评审与讨论
The paper proposes to distill an expensive neural model into multiple query-side and item-side embeddings, or "relevance embeddings", utilizing a retrieve-and-rerank setup. The author shows that the proposed neural similarity model have good approximation properties (when relying on universal approximation properties of the MLPs used in (2)). The authors additionally conduct experiments showing how to improve embedding selection on top of a baseline selection strategy in (Morozov & Babenko, 2019; Yadav et al., 2022).
优点
- Constructing relevance function and its approximations is an important topic and applicable to multiple areas like search, language modeling and recommendations.
- The authors demonstrate improvements vs support embedding selection strategies used in prior work, i.e., AnnCUR (Morozov & Babenko, 2019; Yadav et al., 2022).
- The experiments were conducted on diverse datasets in entity linkage, question answering, and recommendation systems.
缺点
W1. The concept of "relevance-based embedding" seems identical to the multi-embeddings/multi-vector line of research done in prior work (eg MIND, ColBERT, Multi-CLS BERT), which the submitted paper neither compared with or even cited. Moreover, the paper's main conclusion reached in Section 3 seems identical to multiple recent work based on multi-embeddings. A few relevant examples include:
-
(RecSys) Li et al. Multi-Interest Network with Dynamic Routing for Recommendation at Tmall. 2019. Cen et al. Controllable Multi-Interest Framework for Recommendation. KDD'20. Ding et al. Efficient Retrieval with Learned Similarities. 2024.
-
(NLP) Khattab et al. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR'20. Santhanam et al. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL'22. Chang et al. Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling. ACL'23. Dhulipala et al. MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings. 2024.
-
The paper's main theoretical result in Section 3 - "in any relevance function can be well estimated using only such individual vectors" (based on multiple embeddings) is in principle similar to the conclusion reached by Ding et al. in Efficient Retrieval with Learned Similarities. 2024, which also proves that multi-embeddings can be used to approximate arbitrary relevance functions, albeit with different proof construction and a slightly different neural architecture. Additionally, Dhulipala et al. MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings. 2024 also shows for a widely used relevance function in COLBERT, a single very high-dimensional vector can be constructed to approximate the relevance function, and should also be discussed as related work.
W2. Evaluations in the paper are performed against non-standard and/or relatively weak baselines.
- ZESHEL (5 out of 9 scenarios reported) is not commonly used in relevance retrieval scenarios for either RecSys or NLP (see example papers above).
- RecSys (3 out of 9 scenarios reported) are a) based on proprietary, non-reproducible datasets, and b) the teacher model (gradient boosting model) is not considered state-of-the-art for recommendation systems.
- Published recommendation literatures over the last 1-2 years have used cross encoders (acknowledged by authors in intro L47-48), deep stacked networks (eg Wang et al. HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou. 2024), and/or generative approaches (eg Zhuo et al. Learning Optimal Tree Models Under Beam Search. ICML'20., Gao et al. Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations. 2020)
- Additionally the experiment setup used goes against authors' own claim in abstract "... The relevance function is usually an expensive neural similarity model"
- QuestionAnswering (last 1 of the 9 scenarios) - the paper similarly failed to compare with recent state-of-the-art approaches on MS MACRO, including Cao et al. Autoregressive Entity Retrieval. ICLR'21. Ni et al. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. ACL'22., Ni et al. Large Dual Encoders Are Generalizable Retrievers. EMNLP'22., Wang et al. Neural Corpus Indexer for Document Retrieval. NeurIPS'22., Sun et al. Learning to Tokenize for Generative Retrieval. NeurIPS'23. etc.). In fact, the baseline authors used is from 2019.
Overall, more solid experiments on standard datasets are needed to justify whether the baselines used are still relevant and to contextualize the contributions in this paper.
W3. Random choice is a weak baseline for multi-embeddings given most prior work in this area already utilize co-trained query- and item- vectors. See references in W1.
W4. To achieve universal approximation property of MLPs (used in Section 3), we need a very wide hidden layer. More experiments are needed to justify how to trade off quality vs efficiency of the proposed approach for retrieval in practical scenarios, against standard baselines.
问题
My main suggestion is to a) properly contextualize this work w.r.t. prior work on multi-embedding / multi-vector retrieval (W1), and b) conduct experiments against recent baselines on more commonly used benchmark datasets (W2) and in particular add efficiency experiments (W4). Happy to revise the score if a) and b) are addressed sufficiently.
RecSys (3 out of 9 scenarios reported) are a) based on proprietary, non-reproducible datasets
As we write in line 377, all these datasets will be published after the legal approval.
b) the teacher model (gradient boosting model) is not considered state-of-the-art for recommendation systems.
We note that GBDT models are known to be state-of-the-art (or at least competitive) for tabular data, see, e.g., McElfresh D. et al. “When do neural nets outperform boosted trees on tabular data?” NeurIPS 2023. We consider heavy rankers that are currently used in real recommender systems. Here the features are indeed tabular (the mixture of category, textual and numerical features) and the model we consider is empirically shown to achieve the best results among many competitors. Moreover, this GBDT model also uses neural network scores as its features and thus is quite strong.
Published recommendation literatures over the last 1-2 years have used cross encoders (acknowledged by authors in intro L47-48), deep stacked networks (eg Wang et al. HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou. 2024), and/or generative approaches (eg Zhuo et al. Learning Optimal Tree Models Under Beam Search. ICML'20., Gao et al. Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations. 2020)
Could you please clarify how these works weaken the contribution of our study? It is not clear from the comment. Note that in our paper, we do not aim at inventing new CE approaches but use an arbitrary CE instead.
Additionally the experiment setup used goes against authors' own claim in abstract "... The relevance function is usually an expensive neural similarity model"
Note that 6 out of 9 datasets use neural networks as similarity models. In the remaining 3 RecSys datasets we use GBDT models that aggregate many features, including heavy neural-network-based rankers as features. Thus, we cannot agree that the claim in the abstract is not correct.
QuestionAnswering (last 1 of the 9 scenarios) - the paper similarly failed to compare with recent state-of-the-art approaches on MS MACRO
As discussed above, our approach is orthogonal to the developments of complex CE and DE approaches. Our main idea is that we can use an arbitrarily powerful CE and use its relevances for effective and efficient retrieval.
Overall, more solid experiments on standard datasets are needed
Could you please specify which datasets you are referring to in this comment? We believe that the list of datasets we use in our study is quite extensive (we evaluated our approach on a larger list of datasets compared to previous related research).
W3. Random choice is a weak baseline for multi-embeddings given most prior work in this area already utilize co-trained query- and item- vectors. See references in W1.
As discussed above, we note that our study does not use multi-embeddings.
W4. To achieve universal approximation property of MLPs (used in Section 3), we need a very wide hidden layer. More experiments are needed to justify how to trade off quality vs efficiency of the proposed approach for retrieval in practical scenarios, against standard baselines.
In our cases, 50K parameters are empirically shown to be enough, which is still several magnitudes smaller than any DE used in our experiments.
Thank you for a detailed review and suggested references.
Let us note that the mentioned works use a different setup compared to our work and thus we do not use them as baselines in our paper.
Recall that our setup is the following. We assume that there is a heavy ranker (cross encoder, CE) that provides relevances for query-item pairs. This heavy ranker can be an arbitrary complex model and is a standard ingredient of a recommender system. Due to the dataset size, CE cannot be computed for all user-item pairs. A standard approach then is to approximate CE with a dual encoder (DE) that embeds queries and items to a space where relevances can be (approximately) computed via, e.g., the dot product. The top items retrieved via the approximate relevance are then re-ranked by CE. We improve over this standard approach (and simplify it) by creating more efficient representations of queries and items (via support items and support queries, respectively) that do not require training a (potentially complex) DE model (RBE is several orders of magnitude smaller than DE in the number of trainable parameters). Thus, we propose to simplify the Retrieve-and-Rerank scheme using the CE model that must be trained anyway.
Let us now go through the suggested references.
- Li et al. (2019) propose “a multi-interest extractor” that сomplicates the scheme with auxiliary steps and structures (see, e.g., Figure 7, while we use the given model from the ranking service and a small (50K parameters) adapter without auxiliary entities).
- Cen et al. (2020) cite the above paper and also propose a “multi-interest extraction module”.
- Ding et al. (2024) propose to embed queries and documents with multiple trainable embeddings (Figure 1). We instead assume that we are given a powerful cross-encoder and use only its scores to embed queries and items.
To sum up, these three papers propose to complicate the retrieval stage by obtaining multiple trainable representations. In contrast, our approach is based on the fact that there is a heavy powerful ranker, which is a part of every production system, this ranker can be arbitrary and we use only this model to embed queries and items.
- Khattab et al. (2020), see Figure 2, propose the late interaction (d) method. In contrast, both DE and RBE are Representation-based Similarity (a) methods (according to the authors’ classification). In particular, to get a list of candidates for the final re-ranking, our method can utilize any near neighbor search method.
- Santhanam et al. (2022) is another variant of the late interaction approach.
- Dhulipala et al. (2024) is a follow-up of the two works above that also сomplicates the standard DE-based scheme while we aim at simplifying it.
- Chang et al. (2023): this paper seems to be misplaced. Could you please clarify which part of this work is relevant to our study?
We can add some of the above references to the related work, where we discuss different types of approaches to relevance-based search. However, we do not compare with them since our setup is very different. In this regard, we follow previous studies and experimental setup used in a series of recent papers by Yadav et al. (but we extend the list of datasets used to further support our conclusions).
The paper's main theoretical result in Section 3 - "in any relevance function can be well estimated using only such individual vectors" (based on multiple embeddings) is in principle similar to the conclusion reached by Ding et al. in Efficient Retrieval with Learned Similarities. 2024, which also proves that multi-embeddings can be used to approximate arbitrary relevance functions, albeit with different proof construction and a slightly different neural architecture.
Our approaches, assumptions and theoretical results are very different. Note that we do not use multiple embeddings, while Ding et al. (2024) do not use CE-based representations. Could you please give us more detail about which part of the analysis you find similar? Then we will be able to answer the question in more detail.
Additionally, Dhulipala et al. MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings. 2024 also shows for a widely used relevance function in COLBERT, a single very high-dimensional vector can be constructed to approximate the relevance function, and should also be discussed as related work.
A similar comment as above applies to this work: the theoretical result concerns a different function with a different input so we cannot directly relate this result to our work.
ZESHEL (5 out of 9 scenarios reported) is not commonly used in relevance retrieval scenarios for either RecSys or NLP
Note that the two works most relevant to our study, i.e., AnnCUR (Yadav, EMNLP 2022) and AXN (Yadav, ICLR 2024) use these five datasets, so we follow the same setup (with some additional datasets).
-
RBE vs the multi-embedding line of work (MIND, ColBERT, + many discussed in the initial review): RBE first computes a) the distances between query embedding and multiple item embeddings and b) the distances between item embeddings and multiple query embeddings, so in some sense RBE computes 2xN distances whereas prior work computes NxN distances (assuming equal number of multi-embeddings on the query- and item- side), and RBE utilizes a different similarity function vs prior work. I do agree that the selection of fixed support embeddings can make multi-embedding like techniques more efficient, and the paper's approach from a distillation point of view has novel ideas, but multi-embedding literature still seems like a highly relevant line of work (main difference being 2N vs NxN, different ways to combine distances, etc.) that's completely missed by the current paper.
-
Experiments: the only commonly used publicly available dataset across the 9 reported is MS MARCO (the other datasets are not commonly used and can be hard to contextualize/interpret for a general academic audience). Here ColBERTv2 (2021) reported 86.8 R@50 on the dev set whereas AnnCUR achieves .5522 @100 and RBE .6022 @100 (from Table 2 in the paper). Could you explain this gap? Or what would be the results on MS MARCO if we were to use ColBERTv2 (or more recent generative retrieval approaches) as the teacher model for RBE?
-
Missing comparison with distillation-based approaches: Reading through the rebuttal and the paper a bit more, I feel the paper's focus is perhaps more on distillation in the retrieval stage, in which case comparison with alternatives TwinBERT (CIKM'20), Trans-Encoder (ICLR'22) etc might be needed. It is also well-known in IR that distillation by itself boosts retrieval stage's performance (see eg ColBERTv2 paper, Table 4). So having direct comparison with CE->DE distillation is very beneficial for readers to understand RBE's contribution formulation wise.
Thank you for your involvement in the discussion!
RBE vs the multi-embedding line of work (MIND, ColBERT, + many discussed in the initial review): RBE first computes a) the distances between query embedding and multiple item embeddings and b) the distances between item embeddings and multiple query embeddings, so in some sense RBE computes 2xN distances whereas prior work computes NxN distances (assuming equal number of multi-embeddings on the query- and item- side), and RBE utilizes a different similarity function vs prior work. I do agree that the selection of fixed support embeddings can make multi-embedding like techniques more efficient, and the paper's approach from a distillation point of view has novel ideas, but multi-embedding literature still seems like a highly relevant line of work (main difference being 2N vs NxN, different ways to combine distances, etc.) that's completely missed by the current paper.
Could you please clarify what you mean by “distance” and “N”? We want to address this concern properly, but we are not sure that we understand it correctly due to terminology.
If by distance you mean the distance between a query and item embeddings, then for RBE only one dot-product is required for relevance computation for every query-item pair.
If by distance you mean CE calls, then RBE requires CE calls to produce the query embedding whereas item embeddings are pre-computed since they do not depend on a given query. Also, if by you mean (or ), then even for (as in our case) the difference between and CE calls is significant for practical applications.
Also, let us note that a fundamental difference between complex late interactions and split models (like DE or RBE) is that in the latter case one can build an HNSW index (or any other NNS index) and search for the (approximate) nearest items in distance calls, where is the number of items. Importantly, RBE allows for directly using this approach, while more complex methods would require additional developments.
Please, let us know if we misunderstood the comment, we are happy to discuss further.
At this point, although there is indeed some similarity between our work and multi-embedding approaches, we continue to think the relation to these papers is somewhat similar to the relation to other papers dealing with candidate generation for recommendation and search engines. As we wrote, we plan to extend our related work section with a broader discussion.
Experiments: the only commonly used publicly available dataset across the 9 reported is MS MARCO (the other datasets are not commonly used and can be hard to contextualize/interpret for a general academic audience). Here ColBERTv2 (2021) reported 86.8 R@50 on the dev set whereas AnnCUR achieves .5522 @100 and RBE .6022 @100 (from Table 2 in the paper). Could you explain this gap? Or what would be the results on MS MARCO if we were to use ColBERTv2 (or more recent generative retrieval approaches) as the teacher model for RBE?
Our work improves the stage of generating candidate items for further re-ranking by a heavy CE model. Thus, the quality of all the approaches (DE/RBE/AXN) is measured by how accurately they find top documents according to the relevance computed via the CE model, as we write in the definition of HitRate. Therefore, our number cannot be compared to those mentioned in the question.
In the paper we mostly focus on RecSys datasets and added this QA dataset for the purposes of type of data diversity. Also, let us note that on the new RecSys datasets we are competing with strong dual encoders actually used in production systems, while on the remaining datasets our main competitor is AnnCUR since the comparison of AnnCUR with dual encoders was already conducted in previous studies.
Missing comparison with distillation-based approaches: Reading through the rebuttal and the paper a bit more, I feel the paper's focus is perhaps more on distillation in the retrieval stage, in which case comparison with alternatives TwinBERT (CIKM'20), Trans-Encoder (ICLR'22) etc might be needed. It is also well-known in IR that distillation by itself boosts retrieval stage's performance (see eg ColBERTv2 paper, Table 4). So having direct comparison with CE->DE distillation is very beneficial for readers to understand RBE's contribution formulation wise.
Indeed, our approach is similar to the distillation of CE->DE with the idea that it is redundant to teach two models and that it is possible and useful to try to combine these models. However, our approach has significant differences that do not allow us to call it distillation:
CE->DE distillation means that there are still two heavy (comparable in orders of magnitude of trainable parameters) models and usually one cannot distill into a small model without a significant quality decrease. However, in our approach, RBE has ~50K trainable parameters, whereas DE (RecSys/RecSysLT) has ~300M and CE has more than 1B (since it aggregates multiple DEs with different targets and architectures).
CE->DE distillation means that there are still two different model architectures - it may not seem to be a big problem for text QA models, but for recommendations with wide variety of image/textual/statistical/sequential features it is a nightmare: most useful features for CE are features about query and item interactions (e.g., how long a user Q played a game E, how long they played games with the same genre as E, etc.), but such features are unavailable for DE (since they are split models). Our approach has no such problems since we basically call the same CE model and use its score to approximate the other CE scores.
Simply put, in distillation, CE and DE interact only at the learning stage, whereas in RBE they exchange knowledge during run time.
The focus on CE as golden standard for retrieval stage actually represents another flaw in the evaluation (and maybe formulation) of the paper - it is well known that a complex neural ranking architecture (eg CE) does not necessarily improve performance of the retrieval stage. See for ex Rendle et al's classical work neural collaborative filtering vs matrix factorization revisited.
Given this, comparing relative performance on the datasets is often not meaningful by itself, and I strongly suggest the authors to conduct experiments on commonly used datasets (eg MS MARCO) with standard baselines (eg DPR for DE, ColBERTv2 for late interaction multi-embedding models, NCI etc for generative retrieval) to enable the community to contextualize and interpret the results.
The rest of the points I made in the initial review stand too, but I would appreciate it if the authors could improve the experiments first to make the discussion more meaningful.
Experimental setup
Regarding our experimental setup, we would like to note the following:
- In the paper, we assume the standard pipeline typically used in practice: we are given a large set of items , then the first stage (candidate selection) extracts items, and then the final ranking of items returns the best of them. In our setup, we assume that the final ranking is performed by some (arbitrary) CE. The RBE approach affects the candidate selection stage. Let us explain why it is meaningful to evaluate candidate selection approaches by how well they approximate CE. Assume that RBE finds some “good” item that is (mistakenly) not so good according to CE because CE is not ideal. Then, the final re-ranking obtained by CE will not put this item in the list of top- items since CE would score other items higher.
- In fact, both production services that provide the data for the RecSys datasets use a similar experimental setup to evaluate candidate selection approaches for a given CE.
- Our experimental setup directly follows the pipeline used by Yadav et al. (2022). A similar setup is also used by Morozov & Babenko (2019). It is quite typical to evaluate candidate selection approaches by how well they approximate the final ranking model. Another example where “the effectiveness of the first cascade” is measured as its agreement with a heavy ranker is Wang et al. "Fast first-phase candidate generation for cascading rankers," SIGIR 2016.
- Evaluating the final performance is indeed essential when one proposes a new heavy ranker CE itself, but in our setup, we assume that CE is strong, arbitrary, and fixed.
Other raised points
We would be grateful if you could clarify some of the points raised in the original review (see our questions regarding the relation to the mentioned theoretical results in our first reply and regarding the comparison with multi-embeddings in our second reply).
Thank you!
Could you please clarify what you mean by “distance” and “N”?
RBE computes distance from the query embedding to N support item embeddings and distance from the item imbedding to N support query embeddings.
Please reread some of the multi-embedding papers I referred to in the initial review. The difference between them and RBE is a) RBE improves efficiency by using fixed support item/queries, b) RBE uses an arbitrary neural function (, ) to transform the obtained distances followed by inner product, whereas prior work used fixed or learned neural architectures.
I would also like to confirm that I've read the authors' rebuttal, and would like to keep ratings as is.
This paper explores relevance-based embeddings (RBE), by utilizing the support items and support queries to construct relevance vectors upon which the embeddings are trained. This paper tries to provide rigorous proof to show that the functions and (that are used to transform the relevance vectors) can be arbitrary under certain conditions. Building on the previous method of AnnCUR, this paper proposes some heuristics for selecting support items and queries. Experiments shows that the proposed approach can improve model performance compared to the dual encoder method.
优点
- A generalization of AnnCUR's approach to using support items has been introduced.
- Heuristics are provided to select support items. Their effectiveness has been validated by experimental results.
- Detailed explanation of the method and the experiments.
缺点
- The motivation behind CUR decomposition is to enhance performance while maintaining the model's efficiency as much as possible. Therefore, providing the actual runtime of the relevant models in the experiments is essential. Currently, only some analysis is included.
- Presentation: the method descriptions in Table 1 and separating the l2-greedy approach from other methods may lead to misunderstandings.
- Some experimental settings need to be explained more clearly. For example, why the AXN model was chosen? as well as how the dimensions in θI were set, etc. (though these functions are proven to be arbitrary).
问题
- What are the time consumed by each model in Tables 3, 4, 5, and 6? The RBE method requires SI CE calls for each inference, which may represent a significant overhead compared to other methods.
- I am not quite understand the statement “DE uses |SI| more CE calls for the re-ranking” in line 470. Does the Dual Encoder actually use CE calls? Also, how many CE calls does AXNDE make? Why adding 100 items is considered a fair comparison? Will these additional 100 items be re-ranked using the cross-encoder model? If it's simply expanding the evaluation range, it seems that the choice of 100 as the number is not reasonable.
- Why does Table 1 use the QA.small dataset while Table 2 uses the QA dataset?
Thank you for your feedback! We address the questions and concerns below.
providing the actual runtime of the relevant models in the experiments is essential. Currently, only some analysis is included.
In our setup, we assume that the most time-consuming operation is the cross-encoder (CE) computation which is the case for real production systems, including CE and DE used within RecSys/RecSysLT/RecSys2 datasets. The computation of CE is time-consuming since the model itself is complex and also some features used by CE may require additional computations (e.g., some of them are neural-network-based or even DE scores itself). Thus, in each of the experiments, the number of CE computations is fixed for all the models (and is specified in the text). We think that measuring complexity in the number of CE computations is an objective way to compare the methods in our setup. It's worth noting that the neural model used for RBE is significantly cheaper in terms of the number of trainable parameters compared with DE and thus for a fixed number of CE calls RBE-based approach cannot be slower than the DE-based. We do not report actual running time since it may depend on various factors not related to the approach itself (like hardware setup, code optimization, near neighbors search algorithm, etc.). We can add this information to Appendix if required, but re-running all the experiments might take significant time (reported experiments were carried out on different hardware, thus should be restarted).
The RBE method requires SI CE calls for each inference, which may represent a significant overhead compared to other methods.
The RBE method requires CE calls for computing the relevance vector. However, all the methods use the same number of CE calls in total. I.e., if CE calls are spent on the query embedding, then we use fewer CE calls for the final re-ranking for RBE.
I am not quite understand the statement “DE uses |SI| more CE calls for the re-ranking” in line 470.
As written above, all algorithms use the same number of CE calls in total. If RBE uses CE calls for the query embeddings and then CE calls for the re-ranking, then the total budget is which means that DE is allowed to use CE calls for the re-ranking.
Does the Dual Encoder actually use CE calls?
Yes, following the standard approach described in lines 13-18 in the abstract, at the final stage, the top candidates given by the dual encoder are re-ranked with the CE model.
Also, how many CE calls does AXNDE make?
The total number of CE calls is equal for all the algorithms. In the case of AXNDE, there are several iterations and the budget for CE calls is equally divided between the iterations.
Why adding 100 items is considered a fair comparison? Will these additional 100 items be re-ranked using the cross-encoder model? If it's simply expanding the evaluation range, it seems that the choice of 100 as the number is not reasonable.
These 100 (or, in general, ) items are used by RBE to compute query embeddings. As written above, the cost of computing these values is taken into account in our comparison and thus we believe that the comparison is fair.
We hope that our responses clarify the details of our comparison, but if any aspects are unclear, we will be happy to continue the discussion.
Why does Table 1 use the QA.small dataset while Table 2 uses the QA dataset?
By using QA.Small in Table 1, we can test more methods including those that do not scale well.
W2: Presentation: the method descriptions in Table 1 and separating the l2-greedy approach from other methods may lead to misunderstandings.
Could you please clarify what can lead to misunderstanding in the table? We will fix this in the revised version of the paper.
W3: Why the AXN model was chosen?
In Yadav et al. (2024) it is claimed that AXN outperforms AnnCUR, so we decided to add this method to our comparison.
W4. How the dimensions in θI were chosen?
The length of the RBE embeddings was chosen to match the length of the DE embeddings in order to provide adequate comparison in Tables 3, 4, 5, 6.
I acknowledge that I have read the responses and would like to keep the ratings.
Authors propose an improvement over AnnCUR, which matrix factorization is replaced with neural embeddings. Authors also prove that the proposed model family extends AnnCUR and can approximate any continuous function on the joint (compact topological) space of items and queries. Authors also provide an empirical study of how anchor items and queries should be chosen. A good variety of methods are explored: four reasonably strong heuristics as well as a more theoretically well-justified method. Authors experiment with a standard dataset used in the baseline (ZeShEL) as well as standard retrieval dataset (MS MARCO) and datasets from real-world production systems. Experimental results show that beyond certain budget, the proposed method outperforms dual encoder and a strong AnnCUR implementation.
优点
Originality: Authors extend AnnCUR into neural embedding models. The extension of matrix factorization into neural embeddings has been extensively studied in the last 10 years and seen a good success, which makes the proposed approach well-motivated.
Quality: Authors make both theoretical and empirical contributions. With theoretical work, authors provide a universal approximation guarantee for AnnCUR, and then further proves a stronger result for the proposed RBE method. The proof is done at very abstract level (compact topological space of items and queries) and can be instantiated for a wide variety of settings. Authors also provide strong empirical evidences by 1) considering strong baseline methods on both anchor selection and embedding models, 2) experiment across diverse datasets, not only ZeShEL used in the previous work, but also MS MARCO as well as recommender tasks from production systems.
Significance: Author's extension of AnnCUR will open up wide possibilities, as the framework is generic and can accommodate arbitrary neural networks. Subsequent work may study further extend the work by studying variations of neural architecture choices.
Clarity: High-level implications of the theory are well-described, although some abstraction would benefit from more elaboration (see Weaknesses section). The proposed algorithmic framework is also well-motivated by discussing limitations of the AnnCUR baseline.
缺点
It is unclear how well the algorithm scales to larger number of items. Datasets used in experiments have relatively small number of items (at the scale of 10K) with the exception of Military and QA. Comparisons against baseline embedding algorithms are only done in RecSys datasets, which have small number of items. My concern is that as the number of items grows, the larger number of anchors should be chosen. This will in turn make the method less practical compared to simpler dual encoder methods. Even with only 10K-scale RecSys datasets, dual encoders outperform at 100 top items budget. Authors don't seem to be using knowledge distillation to further improve dual encoders, which is a standard technique; when knowledge distillation is used, the benefit of the proposed method would only show up for even larger compute budgets, which may not be practical in many applications. If the comparison between DE and RBE shall be run on datasets with larger number of items, this will strengthen the paper even if RBE doesn't outperform, because this informs readers when to use RBE and when not to. Even better, the scaling study of DE and RBE with respect to the number of items would provide an even stronger guideline for practitioners.
While I value that mathematical guarantees are proved at the quite abstract level, there are opportunities to improve the communication of the abstraction. More specifically, rather than simply saying and are assumed to be compact topological spaces, it would be great to explain what does it mean that the space of queries and items are compact topological spaces. For example, interesting instantiations of the abstraction shall also be discussed, for example, popular Euclidean embeddings of queries and items embeddings. Such elaborations will help practitioners to understand how they shall leverage the proposed abstract framework in their modeling efforts.
问题
In Line 166: Authors suggest CUR requires calls to cross-encoder, but I believe it only requires calls because C and R matrices are randomly selected rows & columns? I can see part can be reduced to in RBE, so I see there can be an improvement of efficiency, but it doesn't seem to be as drastic difference as authors argue.
Exactly in what sense is Theorem 3.2 stronger than Theorem 3.1: is it that is only expected to be continuous and doesn't need to be integrable anymore? Authors mention the guarantee is stronger, but doesn't explicitly mention in what sense it is stronger, hence it would be useful to elaborate.
We would like to thank the reviewer for the feedback and positive assessment of our work!
Regarding scalability and practical applicability, we fully agree that it is an essential question. We would like to note the following:
- Importantly, in practice, one does not have to choose between DE and RBE. As we write in line 300, for RBE, relevance vectors can be enriched with the features of the original query (or item). If a DE is available, it can also be easily combined with RBE as additional features.
- We also discuss various scalability hints in Appendix E.
- We’ve also conducted experiments and compared the results on QA.Small vs QA datasets. We observed that the results did not change significantly after we increased the dataset x10.
- Finally, it seems to be very challenging to objectively compare RBE and DE. While the RBE approach is relatively simple and requires only the CE scores, there are many different DE approaches, they require document and item features, and the best option significantly depends on a particular application.
In Line 166: Authors suggest CUR requires calls to cross-encoder, but I believe it only requires calls because C and R matrices are randomly selected rows & columns? I can see part can be reduced to in RBE, so I see there can be an improvement of efficiency, but it doesn't seem to be as drastic difference as authors argue.
Yes, the number of calls can be rewritten in this way, however, as we write in lines 162-163, for AnnCUR , while for RBE we may have . When , the bounds in the question are equivalent.
Exactly in what sense is Theorem 3.2 stronger than Theorem 3.1
Theorem 3.2 uses fewer assumptions (e.g., the function is only required to be continuous) and also states stronger convergence (uniform convergence instead of L2-convergence).
Why would any user of CUR set ? That seems to defeat the purpose of doing CUR since no subsampling is done on the query part. At the least, using this extreme case to characterize the time complexity of CUR doesn't seem to be fair.
In this comment, we mean the following. For RBE, one can sample some (sparse) query-item pairs to train the function approximating the relevance. Also, for RBE the set of queries used for training the model can be larger than the set of support queries. For the CUR approximation, we need to know the complete (dense) matrix. And if there are some queries that are not in , these queries are not used at all for the CUR approximation. Thus, if by we mean the set of queries used for "training" the approximated relevance, then we can say that equals for the CUR approximation. Yes, can be chosen or sampled from some larger set of queries, we just did not consider this larger set to be for the CUR approximation. We now understand that the terminology can be confusing in this phrase and will make it clear in the text that we do not have to use all the available queries in the CUR decomposition. Thanks a lot for bringing this to our attention.
I see, thanks for the clarification, the original writing makes more sense to me now. I think it's still useful to distinguish and because the whole can still be used for training the cross-encoder model, while small can be sufficient for CUR approximation. I do agree RBE has the benefit of leveraging a larger set of queries in a scalable way and appreciate that point. It just seems characterizing this point is not the ideal way.
This paper analyzes and improves AnnCUR, which is a published method to obtain and use a cross-encoder for retrieval (not just for reranking but as the main retrieval model). AnnCUR obtains query embeddings for a query by using a randomly selected support set of items and computing the query's relevance with each of those items (by using a cross-encoder). The authors of this paper propose relevance-based embeddings (RBE) which improves upon AnnCUR by better selecting a specific support set of items and by adding trainable parameters.
优点
The strengths are as follows:
- From the experiments it looks the proposed method always outperforms the baseline AnnCUR.
- For selecting the support items, the authors tested many simple heuristics and a theoretically justified approach.
缺点
Here are some weaknesses:
- There is no timing information for how long it takes to run any of the method variants or the baselines.
- It's not clear what the architecture of the lightweight neural networks for RBE is? How many parameters are in this neural network?
- The writing is not ver clear. Here are some specific issues:
- It's not clear what the main difference between the result in Theorem 3.1 and Theorem 3.2 is.
- It's not clear how the theory in the first part informs the method in the second part.
- Lots of terms are undefined. What is on line 272? What is ? Explicitly specify what and are in tables 4 and 5.
- The notation in section defining the metric HitRate(P,T) is also very hard to read.
- How does this method perform on other retrieval benchmarks like BEIR and NQ?
问题
Here are some questions:
- Will the RecSys and RecSys2 datasets be publicly released?
- Why is RBE so much better on Recsys2 when compared to any other dataset?
- The CUR approximation requires computing the matrix . This is very expensive. RBE doesn't need to compute this matrix, right? Does that mean RBE can scale to larger datasets?
- For the l2-greedy approach isn't it very expensive to optimize over all possible choices of . How long does this take in practice?
Thank you for the review and detailed feedback! We address the questions and concerns below.
There is no timing information for how long it takes to run any of the method variants or the baselines.
In our setup, we assume that the most time-consuming operation is cross-encoder (CE) computation which is the case for real production systems. The computation of CE is time-consuming since the model itself is complex and also some features used by CE may require additional computations (e.g., some features in the RecSys dataset are scores produced by auxiliary neural networks). Thus, in each of the experiments, the number of CE calls is fixed for all the models (and is specified in the text). We think that measuring complexity in the number of CE computations is an objective way to compare the methods in our setup. We do not report actual running time since it may depend on various factors not related to the approach itself (like hardware setup, code optimization, near neighbors search algorithm, etc.). We can add this information to Appendix if required, but re-running all the experiments might take significant time (reported experiments were carried out on different hardware, thus should be restarted).
It's not clear what the architecture of the lightweight neural networks for RBE is? How many parameters are in this neural network?
It is a simple feed-forward MLP with two hidden layers and ELU activations (see line 997). It has about 50K trainable parameters. Note that dual encoders used in practice are typically significantly more complex, e.g., DE from RecSys has ~300M parameters.
It's not clear what the main difference between the result in Theorem 3.1 and Theorem 3.2 is.
Theorem 3.1 has stronger requirements and weaker convergence. Trainable embeddings allow us to soften the requirements (e.g., can be any continuous function) and prove stronger convergence guarantees (uniform convergence instead of L2 convergence). We will add a comment about this to the text.
It's not clear how the theory in the first part informs the method in the second part.
Could you please clarify what parts of the text you are referring to in this comment? In short, Section 3.2 discusses and formalizes the previously proposed CUR approximation. While the CUR approximation is not novel, the analysis is. Section 3.3 introduces trainable embeddings and shows that they allow us to obtain better convergence guarantees. Our approach uses such trainable embeddings and is motivated by this result.
Lots of terms are undefined. What is on line 272?
The size of , see line 151.
What is ?
In line 320, is the actual relevance and is the predicted relevance. We will specify this in the text, thanks for bringing this to our attention.
Explicitly specify what and are in tables 4 and 5.
and are used to compute HitRate, as specified in the table itself. These parameters range from 100 to 900 (the first row of the table).
The notation in section defining the metric HitRate(P,T) is also very hard to read.
The name HitRate itself is intuitive: it is the hit rate, which means that we retrieve top-P items according to the predicted relevance and check how many of them are in the true top-T list. When used with only one argument , HitRate is computed for . Do you have any specific questions or suggestions regarding this notation? We are happy to discuss and clarify in the text.
How does this method perform on other retrieval benchmarks like BEIR and NQ?
We have not experimented with these datasets. Note that our experiments cover all datasets used in the related work by Yadav et al. (2022) and also include the QA dataset based on MSMarco and three new datasets based on real recommender services.
Will the RecSys and RecSys2 datasets be publicly released?
Yes, as written in line 377, the data will be released with the final version of the paper.
Why is RBE so much better on Recsys2 when compared to any other dataset?
Note that RecSys2 significantly differs from the other datasets. In particular, it is obtained from a different source compared to RecSys/RecSysLT datasets and thus some difference in performance can be expected. However, we have not investigated this subject deeper. Note that on Yugioh and QA, the difference is also noticeable.
The CUR approximation requires computing the matrix . This is very expensive. RBE doesn't need to compute this matrix, right? Does that mean RBE can scale to larger datasets?
Yes, as mentioned in line 1114, RBE with learnable document vectors can be trained similarly to DE on sparse data and thus can scale to larger datasets. We will add a more detailed comment to the text.
For the l2-greedy approach isn't it very expensive to optimize over all possible choices of . How long does this take in practice?
Note that l2-greedy is a greedy algorithm thus it does not perform exhaustive search over all possible choices of . As written in lines 256 and 266, it greedily selects items one by one. A detailed procedure of how each item is selected is described in Appendix B. In l2-greedy experiments, the choice of support items takes tens of minutes on our hardware, but we have not measured or optimized it as we used different hardware for different experiments. Also, l2-greedy is a part of the pre-processing stage and thus does not affect the query time.
Dear Reviewer Ujqc,
We would like to ask whether our response addresses your concerns. Also please note that we updated the paper taking into account the comments of the reviewers. If you have any additional questions or concerns, we will be happy to address them.
Authors developed an algorithm which learns to embed queries using query relevance based on fixed set of randomly chosen set. In my opinion this area is well studied, and authors have missed whole genre of research on memory-based architectures.
- End-To-End Memory Networks, NIPS 2015.
- OAK: Enriching Document Representations using Auxiliary Knowledge for Extreme Classification, ICML 2024
优点
I do not see any major contributions by this paper.
缺点
Authors need to work on literature review, they have missed important area of work called memory networks, this area also relies on fixed set of vectors to augment query and generate query representation. This makes baselines and comparison incomplete.
问题
Kindly compare with the memory network literature and its state of the art. Also compare with retrieval augmented retrieval approaches like OAK for comprehensive comparision.
Thank you for suggesting these references. However, while these works may seem similar to our study (due to somewhat similar keywords), they are not relevant to our setup.
Recall that our setup is the following. We assume that there is a heavy ranker (cross encoder, CE) that provides relevances of query-item pairs. This heavy ranker can be an arbitrary complex model and is a standard ingredient of a recommender system. Due to the dataset size, CE cannot be computed for all user-item pairs. A standard approach then is to approximate CE with a dual encoder (DE) that embeds queries and items to a space where relevances can be (approximately) computed via, e.g., the dot product. We improve over this standard approach (and simplify it) by creating more efficient representations of queries and items (via support items and support queries, respectively) that do not require training a (potentially complex) DE model (RBE is much smaller, several orders of magnitude in the number of trainable parameters).
Now, consider the work by Sukhbaatar et al. (2015). Here, supporting facts are used but they significantly differ from our support items. Indeed, supporting facts are obtained externally and they are different for different queries. In our case, support items are taken from a given set of items and they are fixed for all the queries (thus, they allow us to create a common representation for all queries). See Figure 2 and Appendix B in Sukhbaatar et al (2015) for illustrations.
Regarding Mohan et al. (2024), this work uses external information to enrich document representations, while we only use CE which is a part of any recommender system. As above, this external information is different for different documents, while we use a fixed set of 100 queries to describe each document. See the abstract and Table 2 of Mohan et al. (2024) for details.
In summary, we cannot relate the referred papers to the setup considered in this paper (both general setup and algorithm details). We will be happy to provide a more detailed discussion and comparison if you could give us more details on why these works are relevant and in what form they can be applied to our setup.
If you have any feedback, concerns, or questions regarding our work, we will be happy to address them.
Dear Authors,
- I suggest please ready OAK's contribution more carefully, it uses memory bank similar to your approach, what you store in that memory bank is totally up to you. I would like to see comparison between OAK and Your approach.
- Memory networks use fixed support set similar to Sukhbaatar et al. (2015).
- Kindly provide comparison between model parameters for your approach and approach in memory network as well as OAK.
Dear Reviewer,
Thank you for your involvement in the discussion! We are happy to address your concerns, but still have several questions to clarify first.
Summary
In short, the review mentions two papers, for which the problem setup, experimental setup, and data used differ from what we do in our work. You argue that the approaches in the paper are directly relevant to our work and could be applied in our setup. After reading the papers, we do not see how to do it directly: we have some ideas about how the approaches can be adapted to our setup or how some ideas can be used, but we see the adaptation of the referred approaches to be a subject of additional investigation. Our discussion could be more constructive if you could describe how you see the application of these approaches to our setup (with our input data, task, and experimental setup). We will then be happy to address the questions in more detail and possibly conduct additional experiments.
Details
Now we discuss the two mentioned papers in more detail. Since we are not sure how exactly you suggest adapting the approaches to our setup, we make some guesses below. Thus, our discussion may not agree with what you suggest - if so, please correct us.
OAK
First, consider the paper by Mohan et al. (2024). Let us discuss why we believe that there are some important differences. Below we quote the authors of Mohan et al. (2024).
In this paper, we propose a novel framework, Online Auxiliary Knowledge (OAK), which harnesses auxiliary information linked to the document to improve XC accuracy.
Note that we don’t have any external or auxiliary information.
Training for OAK involves three stages: (1) Linker training (2) OAK pretraining (which involves training the augmentation block as shown in the figure) (3) OAK finetuning. In the first stage, the linker module in OAK is trained to link documents with relevant AKPs from a large pool using an existing XC method.
Note that in our work we do not have a linker, which is a major ingredient of the approach in Mohan et al. (2024).
Another essential difference preventing the direct comparison is that the authors solve the classification problem, whereas we solve the ranking problem.
Despite the differences listed above, we did our best to think of how and what part of the approach from Mohan et al. (2024) could be used in our setup.
Consider Figure 1 in Mohan et al. (2024). Since we have support items for queries and support queries for items, at least we have to use two document-like parts with two bases, two linkers, two combiners, which seems to be a different architecture compared to what is shown in Figure 1. Thus, let us assume that we have trainable embeddings for items and use the AK bank only for queries.
Our best guess is the following (referring to Figure 1):
- Say that “document” is our “query”;
- Say that “label” is our “item”;
- Now ”Auxiliary Knowledge” is a set of all items (it is not auxiliary, but it is our most probable assumption);
- The linker should find the most relevant items to the query.
If this reformulation is what is suggested, then we see the following issues:
- The problem of finding the most relevant items (the original task) seems to be reduced to the same problem of finding the most relevant items but by the linker;
- Moreover, finding the most relevant items from the AK bank seems to require an additional index for fast nearest neighbor search;
- If instead of the whole set of items AK bank contains only 100 support items, then no linker is needed but then we lose another important ingredient of the approach.
To us, the adaptation to our setup does not look straightforward. Thus we kindly ask you to provide more details on how the approach applies to our setup.
Memory networks
Now, consider the paper by Sukhbaatar et al. (2015). The authors write
Our model takes a discrete set of inputs x1, ..., xn that are to be stored in the memory, a query q, and outputs an answer a.
The capacity of memory is restricted to the most recent 50 sentences.
Our approach does not take any external inputs to be stored in memory and it is not needed in our setup since queries are independent (one user’s preferences will not help to determine other users interests). We calculate CE(q, S_I) scores for each query independently and do not store them.
With this paper, the setup seems to be very different and we struggle to adapt it to our case. Thus, as above, we kindly ask you to provide more detail on how exactly you suggest to apply their approach to our setup.
Model size
Finally, addressing your question about the number of parameters, we note that our architecture is a simple feed-forward MLP with two hidden layers, it has only about 50K trainable parameters.
I disagree with author's analysis. Reasons are as follows: As stated in line 140-143, relevance-based embeddings are generated using pre-selected set of items and vice-versa.
- These pre-selected set essentially becomes your auxiliary metadata. You can choose any linker of your choice; you will anyhow need a mapping (linker) to items/query with corresponding pre-selected set.
- Similarly in memory bank you can keep the pre-selected set as memory in network.
I am keeping my scores.
- Thanks
Following the reviews and discussions, we've updated the paper by making edits and clarifications. In particular,
-
We extended the discussion of distillation approaches (lines 98-102). We are ready to extend the related work further by adding more mentioned references after the discussion with the reviewers (in our replies we've asked for some clarifications to properly address the comments of the reviewers);
-
Added a comment about the difference between Theorems 3.1 and 3.2 (lines 206-211);
-
Specified the number of parameters of RBE (line 1025);
-
Added some fixed/clarifications following the reviews.
Summary: The paper proposes a method to efficiently retrieve items using score from arbitrary (possibly expensive and non-factorized) relevance model. The main idea is to describe queries/users by their relevance to some pre-selected support passages/items and describe passages/items by their relevance to some support queries/users. Further, the authors show relevance-based embeddings are powerful enough to approximate any complex similarity model. Finally experiments are carried out on ZESHEL, MSMARCO, and RecSys dataset.
Strengths:
- Interesting approach of relevance based embeddings
- Strong performance on ZESHEL
Weakness:
- Lukewarm response from all but one reviewer
- Scalability of proposed approach not clear (cost grows as linearly with number of anchors)
- All datasets studied in this paper are very small (even MSMARCO only a subset used)
- Theoretical results somewhat weak and basically adapts universality of MLP to paper's setup
- Not using high quality heavy ranker models in experiments
- Unclear experimental setups for QA tasks: If heavy ranker is a DE model, not sure what is the utility of the proposed method? For example, in QA tasks all-mpnet-base-v2 is used as heavy ranker model, which is a DE model. Is the proposed faster than a kNN search? Seems not based on the algorithm.
- Runtime/efficiency comparisons not thoroughly presented
- Presentation issues: Reviewers found description of RBE's lightweight neural networks not clear as well as distinction between Thm 3.1 and 3.2
Decision: While the work shows promise and proposes an interesting approach, unfortunately, the paper can't be accepted in its current form and addressing all the concerns would warrant another round of reviewing. While it is unreasonable to ask comparison to Colbert or Memory network for this paper, other comments such as use of SoTA heavy ranker, metrics, etc remain valid.
审稿人讨论附加意见
We thank the authors and reviewers for engaging during the discussion phase towards improving the paper. Below are some of the highlights:
- Relationship to multi-embedding literature, e.g. Colbert
- Authors correctly clarified fundamental differences in approach and setup
- While some similarities exist, RBE has distinct goals and methodology
- Relationship to memory network literature
- Authors correctly clarified fundamental differences in approach and setup
- While some similarities exist, RBE has distinct goals and methodology
- Choice of datasets and baselines (Multiple reviewers)
- Authors explained following setup from prior work
- Response partially addressed concerns but more standard benchmarks could help
- Theoretical results and comparisons
- Authors explained differences from prior work
- Some points remained unclear but core contributions were defended
Reject