4.6

/10

withdrawn5 位审稿人

最低3最高5标准差0.8

3.8

置信度

ICLR 2024

Retrieving Texts by Abstract Descriptions

Shauli Ravfogel,Valentina Pyatkin,Amir David Nissan Cohen,Avshalom Manevich,Yoav Goldberg

OpenReview PDF

提交: 2023-09-24更新: 2024-03-26

TL;DR

We define the notion of description-based similarity and train a retrieval model for that type of similarity.

摘要

关键词

similaritydescriptionsLMsretrieval

评审与讨论

审稿意见

评分: 5置信度: 32023-10-30

The paper addresses the challenge of locating texts in a large document collection based on abstract descriptions of their content. The authors argue that current text embeddings and semantic search solutions are inadequate for this task, as they lack a well-defined notion of similarity. They propose a new model that significantly improves retrieval by utilizing a consistent and well-defined similarity based on abstract descriptions. The model is trained using positive and negative pairs sourced through prompting a LLM. The authors highlight the limitations of existing search techniques, including keyword-based retrieval, dense similarity retrieval, QA-trained dense retrieval, and query-trained dense retrieval. They emphasize the need for a specific type of similarity, referred to as description-based similarity, which captures the relation between abstract descriptions and concrete instances within documents. They demonstrate the effectiveness of their proposed model in retrieving relevant texts based on abstract descriptions and suggest that their approach can enhance knowledge discovery in various data-intensive domains, including legal, medical, and scientific research. Overall, the paper emphasizes the importance of a well-defined similarity measure for effective semantic search and presents a novel approach that leverages the strengths of LLMs to achieve a retrieval task that is not feasible using traditional text generation capabilities.

优点

They evaluate the effectiveness of their proposed "Abstract-sim" model in sentence retrieval based on abstract descriptions, comparing it with several baseline models. The evaluation includes both human and automatic evaluations. For the human evaluation, the researchers conducted a crowd-sourced evaluation of retrieval performance for 201 random descriptions, comparing their model with several strong sentence encoder models. The results of the human evaluation indicate that the "Abstract-sim" model outperforms the baselines significantly, with an average of close to 4 out of 5 sentences deemed relevant for the query, while the baseline models had significantly lower performance, ranging between 1.61 to 2.2 sentences. The automatic evaluation was carried out to assess the model's robustness to misleading results. The authors generated a dataset of valid and invalid descriptions, and their model demonstrated superior performance in terms of precision at various retrieval points, with the largest disparity observed at precision@1. Their modell achieved a precision@1 score of 85%, compared to 7~3% for the strongest baseline model. The paper emphasizes the potential of leveraging large language models for generating tailored training datasets, despite their limitations in direct retrieval tasks. Their results indicate that the proposed model, trained on a dataset specifically tailored to the task, performs significantly better than standard sentence-similarity models.

缺点

lack of comparisons with state of the art retrieval models and neural rankings.

Guo, Jiafeng, et al. "A deep look into neural ranking models for information retrieval." Information Processing & Management 57.6 (2020): 102067.

Mitra, Bhaskar, and Nick Craswell. "Neural models for information retrieval." arXiv preprint arXiv:1705.01509 (2017).

问题

I think the baselines for their algorithms are pretty simple

评论- Response

2023-11-14

We appreciate your detailed evaluation of our submission and the constructive feedback you provided. The recognition of the improved results we find in both human and automatic evaluations motivates the need to tailor models to our notion of similarity. We agree that this work underscores the potential of utilizing large language models for crafting tailored training datasets. Regarding the issue you raised with the baselines, we appreciate the suggestion to evaluate on additional existing encoders. Please see our general response (“Baselines”) for additional results on 4 existing SOTA models.

审稿意见

评分: 3置信度: 42023-11-01

This paper introduces a novel retrieval task designed to fine sentences that exemplify the ``instance-of'' property related to a given query. To achieve this, the paper constructs a dataset using a large language model and utilizes this dataset to develop a dense retrieval model. Experimental results from a manually constructed dataset demonstrate that the proposed dense retriever outperforms baseline models.

优点

The paper effectively delineates the problem at hand by highlighting its distinction from existing research. This clear exposition aids readers in comprehending the subject matter. Furthermore, the employment of crowd-workers to curate a new retrieval dataset is commendable, as it promises to significantly benefit future search research.

缺点

The main concerns regarding this paper are:

While the paper emphasizes that the retrieval based on description-based similarity is different from the existing retrieval, the description-based similarity also belongs to the similarity between texts. Existing methods measure this text similarity through learning from query-document (sentence) pairs, while the proposed method learns from query (description)-sentence pairs. Thus, from this viewpoint, the proposed approach seems to address a specific instance of text-to-text similarity rather than introducing a fundamentally new form of similarity-based search.
The retrieval in this paper appears specialized in a specific domain and not universally applicable. It would be better to explain in detail the actual application that requires this proposed retrieval.
While the paper innovates by introducing a new dataset, the retriever itself lacks novelty. Essentially, it is the same as the existing methods that train encoders, previously used in dense retrieval, and subsequently use nearest-neighbor search techniques.
Recent dense retrieval research has seen the emergence of diverse encoders and similarity techniques, such as Colbert and PLAID. It's necessary for this paper to evaluate the efficacy of its proposed method by incorporating a variety of encoders and similarity metrics in the experiments.

问题

Q1: Generating data with GPT often leads to the issue of hallucination. How was this tackled in this study?

评论- Response

2023-11-14

We appreciate your engagement with our work and would like to address the claim that our proposed approach merely addresses a specific instance of text-to-text similarity rather than introducing a fundamentally new form of similarity-based search.

The retrieval in this paper appears specialized in a specific domain and not universally applicable. Indeed, we do not aim to propose a “universally applicable” solution, and doubt that such a solution exists. Which current similarity approach would you consider universal in the sense that you mention? The very core of this paper is pointing out major limitations in the corpus-based, “universal” notion of similarity, which is ill-defined. We empirically show that models trained without an explicit notion of similarity fail to generalize to the very natural use case of retrieving text by abstract description. As you noted, “the description-based similarity also belongs to the similarity between texts”. We did not claim otherwise. This work is focused on similarity between texts, and our contribution is in introducing and formally defining a unique notion of similarity which is as useful as it is not addressed by current approaches.

The retriever itself lacks novelty: We clarify that our primary contribution lies not in the intricacies of modeling but in drawing attention to a critical limitation in existing methodologies. True innovation, in our view, at times does not come from improving conventional modeling techniques. Rather, it involves reevaluation of implicit, unexplored blind spots within existing frameworks. Our work addresses the inadequacy of the current understanding of similarity, which we posit is a substantial and crucial contribution, equal in importance to advancements on the modeling side. We respectfully disagree with the assumption that innovation must come in the form of a new model. Please also refer to the general response for additional explanation about the innovation in this work.

Different encoders: Please note that Colbert is a late-interaction model, whose application is orthogonal to our approach. Please see our general response for results on 4 additional SOTA encoders.

Hallucination in GPT-generated descriptions: The concern is justified. However, our dataset is not a goal by itself. Rather, it is used to train an encoder, which we then thoroughly evaluate in both human and automatic evaluation. As such, even if the dataset contains biases of all sorts (which it probably does, like any other dataset), it is still useful for tackling the problem of enabling search by abstract descriptions. At the same time, please note that as we describe in the paper, we did conduct an human evaluation study of the generated descriptions, focusing on their faithfulness to the original text. As we report in the paper this human evaluation indicated that the descriptions are largely of high quality and are faithful to the text. Could you please explain if there are unconvincing aspects in our evaluation?

2023-11-23

Thank you for your response. My primary concern, regarding the novelty of the proposed method for addressing the task, remains unresolved. While the paper mentions how this method differs from existing ones, the actual approach to the solution appears to be merely fine-tuning using the constructed dataset. In light of this concern, I have decided to maintain my original score.

审稿意见

评分: 5置信度: 42023-11-02

This paper posits that the similarity reflected in embeddings is often ill-defined and inconsistent, which can be suboptimal for various practical use cases. To address this issue, the paper adopts a novel approach. It leverages off-the-shelf large language models to generate multiple descriptions for a given document. Subsequently, it conducts sentence retrieval tasks based on these descriptions to enhance the retrieval task's ability to capture abstract semantic information. As no suitable public dataset is available for the new retrieval settings, the authors introduce a new dataset for training and evaluating their model. The results on this proposed dataset indicate that the trained model outperforms the baselines in both human evaluations and automatic assessments.

优点

This paper offers a fresh perspective on the traditional retrieval task, highlighting the limitations of term-based and vector-based matching approaches. It introduces a novel description-based matching approach and enumerates its advantages over these traditional methods.
To validate the effectiveness of the proposed method, the authors construct a new dataset based on descriptions using Wiki data. They employ off-the-shelf large language models for extensive data collection and annotation, underscoring the rigor and comprehensiveness of their approach.
The paper is exceptionally well-written, ensuring that it is easily accessible and comprehensible for readers, making it a valuable contribution to the field.

缺点

I totally agree that the term-based and vector-based retrieval frameworks are not perfect and may lead to some problem in practice. However, I wonder that is it really a new of the proposed description-based framework, because as mentioned in the paper, author just modify the dataset and change the meaning of relevance. Moreover, the model used in the paper is also vector-based method.
Using large language model to generate training dataset is risky in two folds. Firstly, it may not cover all the aspects of a given document form the generated descriptions, so that it may missing some information of the document. Second, it may also contain duplicate aspects of one document so that after model training, some aspects will be strength or biased by the data.
It may not a fair comparison between proposed method and baselines. As mentioned in Weakness 1, the meaning of relevance is changed. The proposed method train and evaluate on the same data distribution is evidently better than the model test on OOD distribution.

问题

It is an interesting paper that expand the view of relevance. However, the major concern is that the formulation of description-based framework is also weak and lack some theoretical support, which is the same as the vector-based one. I think how to formulate the description-based relevance is the vital problem in the next version.
How to make sure abstract description is what we actually need in practical search? Some statistic study may be involved as former evidence.
The comparison of proposed method and other baselines should be fairer. Furthermore, some strong dense retrieval baselines should also be involved in the experiments.

ANCE: Approximate nearest neighbor negative contrastive learning for dense text retrieval
BERM: BERM: Training the Balanced and Extractable Representation for Matching to Improve Generalization Ability of Dense Retrieval
TAS-B: Efficiently teaching an effective dense retriever with balanced topic aware sampling.
Contriever: Unsupervised dense information retrieval with contrastive learning.

评论- Response

2023-11-14

We appreciate your engagement with our submission and the constructive feedback provided. The recognition of our paper's fresh perspective, novel description-based matching approach, rigor, and comprehensiveness is encouraging. We look forward to addressing the concerns you raised:

Risks of Using Large Language Models: The concern you raise is justified. However, our dataset is not a goal by itself. Rather, it is used to train an encoder, which we then thoroughly evaluate in both human and automatic evaluation. As such, even if the dataset contains biases of all sorts (which it probably does, like any other dataset, human-generated or not), it is still useful for tackling the problem of enabling search by abstract descriptions. Please note that as we describe in the paper, we did conduct a human evaluation study of the generated descriptions, focusing on their faithfulness to the original text. This human evaluation indicated that the descriptions are largely of high quality and are faithful to the text. Furthermore, human evaluation is also performed to evaluate the results retrieved by the final model trained on the GPT-generated data, and this model is significantly better than any of the baselines (figure 2).

Fairness in Comparison with Baselines: We emphasize that our modification of the dataset is a deliberate choice to highlight the limitations of existing frameworks and address the specific challenges of abstract semantic information retrieval. We see it essential to point out to the fact that the SOTA models fail to capture this notion of similarity, despite them being trained on orders of magnitude more data. In other words, we empirically show the limitations of existing SOTA models in capturing the exact sense of similarity we are arguing for. Please see also “innovation” under the general response.

Theoretical Support for Description-Based Relevance and Practical Need for Abstract Descriptions: Our emphasis is on the conceptual innovation rather than the modeling side: we explicitly state in our introduction that our goal is to introduce a unique form of similarity—description-based similarity. This entails defining a specific notion of retrieval by similarity, and describing its usefulness in certain information-seeking scenarios. Please also refer to the sections “innovation” and “Use Cases of Description-Based Search” in our general response.

Inclusion of Strong Dense Retrieval Baselines: We added 4 SOTA text encoders to the automatic evaluation - please refer to the general response under “Baselines”.

How to make sure abstract description is what we actually need in practical search? The idea to perform a user study that empirically evaluates the kind of queries end-users are interested in is highly interesting. Such study has the potential to uncover specific use cases where the ability to search by abstract descriptions is particularly beneficial. We believe, however, that it is beyond the scope of this work: we do discuss in detail several naturally occurring scenarios where such ability is warranted, and the primary focus of our work is to showcase the limitations of existing methods in fulfilling this requirement and to propose a language model-assisted technique to address it. While a user study could offer supplementary insights, we view it as a distinct contribution, separate from the specific goals outlined in our present work.

2023-11-16

Thank you for the detailed response. I understand your contribution to the task designing and the benchmark construction. It seems you found some weaknesses in current retrieval models. However, my major concern is the solution to this problem, which just fine-tuned on the proposed dataset does not convince me that it is a hard task for further exploring.

I suggest that if the paper focuses on the task designing and the benchmark construction, you need to leave more space to analyze the recent problem and the cause of the problem rather than the performance comparison.

Based on the major concern mentioned above, I will not change my score.

审稿意见

评分: 5置信度: 52023-11-07

The authors in this paper propose a different search approach by retrieving sentences based on abstract descriptions of their content. The authors demonstrate the shortcomings of the current methods of text embeddings and propose a metho to improve them. The authors created a dataset using LLMs to capture the notion for similarity and use the same to train an encoder whose representations are better than the state-of-the-art. Specifically, the authors used GPT-3 to generate positive and misleading descriptions for sentences from the English Wikipedia dataset. The authors utilize a pretrained sentence embedding model and fine-tune it with contrastive learning to train their model for the task of aligning sentences with their descriptions. They use two encoders – one as a sentence encoder and the other as a description encoder. Limitations of the approach were not discussed in the paper.

优点

The authors propose a novel approach to generate abstracts instead of the regular search methods.
The authors have used both human evaluation and automatic evaluation to evaluate the proposed model.

缺点

In the abstract, did you mean “inconsistent” instead of “non-consistent”? In the Introduction, "This make the” --> “This makes the”, "well defined” --> “well-defined”. There are several such grammatical errors, and it would benefit the authors to run the text through any of the free grammar tools available. Also, the authors can recheck the camel cases of sentences and sub-headers (full stop or no full stop?).
What are the different use cases of the proposed description-based search in documents? The authors can discuss some different case studies or use cases to convince readers.

问题

评论- Response

2023-11-14

We appreciate your review of our submission and the constructive feedback provided. We highly value the recognition of our proposed novel approach to retrieval by description, which is distinct from regular search methods. Additionally, we appreciate the acknowledgment of our comprehensive evaluation strategy, which include both human and automatic evaluation.

Grammatical Errors: Thanks for pointing out grammatical errors. We apologize for any oversight in our proofreading. In our revised manuscript, we addressed the issues you raised, and we will revise the manuscript again for the CR version.

Use Cases of Description-Based Search: Please refer to the general response under “Use Cases of Description-Based Search” and "Innovation" for a detailed response.

2023-11-14

Thank you for responding to the question raised in the weaknesses. I have read through the general response and while I am not thoroughly convinced with the use cases (the ones mentioned in the response are generic, I was hoping to see something more specific, like a case study), I will retain my score because the work is novel and can contribute to the domain with some minor changes addressed.

评论- Response

2023-11-15

Thanks for the response. Can you please clarify what do you mean by a case study in this context?

Additionally, could you please clarify if there are any additional issues that require our attention? In your response you said you believe the paper is novel and can contribute to the field, but please note that your current score means the paper is below the acceptance threshold.

审稿意见

评分: 5置信度: 32023-11-09

The paper defines a new task that retrieves text based on abstract descriptions. The specific kind of similarity between text and abstract description are defined and hand curated examples were used in the instructions to LLM to generate training data. The proposed method works better than other sentence/text retrievers trained with the general definition of sentence similarity on the test data designed for this task.

优点

The text and abstract description similarity is a very interesting type of similarity and would value the information retrieval field. I think the strength of the paper is to design the prompts to gather the text/description pairs that satisfy the definition.

缺点

The paper proposed an interesting new task. I'm confident it will be useful for some application or existing retrieval applications. However, the paper didn't explore what will benefit from this new task as much.

Also, I would think it is pretty straight forward to see that the proposed method would outperform a general purpose retrieval or sentence similarity model. Those method are not trained or finetuned using the same training data, which defines the relationship of sentence and its abstraction.

问题

Questions: How is precision @k decreases when k is increasing, especially for the proposed method?

Suggestions: I think this is a interesting task with value, but I think it is worth to explore what end task would be benefit from this new task, or how this task post challenges to existing retrieval models, if any.

Minor typos:

page 8. Settings. "invalid-recall@k" is missing the @ sign.

评论- Response

2023-11-14

We appreciate the insightful feedback you provided, particularly the acknowledgment of the description-based similarity as an interesting type of similarity and its potential value to the information retrieval field. We refer to the specific issues you raised:

Exploration of Applications: We appreciate your acknowledgment of the interesting nature of our proposed task but recognize the need for a more detailed exploration of its potential applications. Please refer to the general response under “Use Cases of Description-Based Search” and under “Innovation” for a detailed response.

Expected Outperformance of the Proposed Method: Our emphasis is on drawing attention to the need to retrieve by description, and showcasing the limitations of existing encoders in this task. We find it essential to test models on our data to showcase the limitations of SOTA general-purpose encoders in capturing the notion of similarity we argue is useful. Their failure is a nontrivial finding, as these models were trained on orders of magnitude more data. It is not possible to highlight this inherent limitation of existing models without evaluating them on data tailored for our notion of similarity. Please refer also to our explanation under “innovation” in the general response.

How is precision @k decreases when k is increasing? Indeed,as we note in the paper, the improvement over the baselines is smaller when k increases. Our hypothesis is that this is a by-product of the standard contrastive learning objectives we employ, which focus on the top result, and do not directly optimize the tail of the distribution (in other words, the loss takes into account a single positive at a time). We agree this is an interesting phenomenon to study in future work (particularly - can we design contrastive learning methods that take the tail of the distribution into account?).

评论- General Response

2023-11-14

We thank all reviewers for their feedback. They commend the paper's innovative approach to text and abstract description similarity (Reviewer SM29), the human and automatic evaluation we employ (Reviewer 34Co), the fresh perspective it offers on the traditional retrieval task, and the clear writing (Reviewer s5K1) and its clear exposition and dataset curation (Reviewer Uira). Finally, Reviewer nR3E highlights the effectiveness of our model in description-based retrieval, emphasizing its practical impact in leveraging large LMs for tailored training datasets. Here we provide a concise response for repeating issues that were raised in the reviews. For convenience, we also respond to all individual claims below.

Innovation: Our novel contribution does not lie in model architecture, but in task definition and tailor data-collection. We believe that contributions in ML extend beyond proposing better architectures or novel loss functions. Innovation can manifest by illuminating blind spots and implicit assumptions that impede progress in the current literature on text encoding, in the form of an ill-defined notion of similarity that commonly-used models optimize. The core objective of our paper is to introduce a fresh perspective on similarity—one that pertains specifically to the alignment between descriptions and corresponding texts. Notably, our findings reveal that even SOTA models trained on datasets exponentially larger than ours struggle to generalize to description-based similarity, a surprising and nontrivial finding. Comparing existing models on data tailored to our notion of similarity is essential for showcasing the limitations of these models.

Use Cases of Description-Based Search: We envision the task of retrieving texts by their description to be directly useful on its own for many end-uses, in particular information seeking by experts in large document collections. This is true regardless of any relevance for an existing NLP end-task. A second-order user-study can be useful, but is not in the scope of the current work.

In the introduction section, we discuss the use cases of our proposed approach in information-seeking scenarios. Its Usefulness is particularly highlighted in scenarios where experts aim to locate domain-specific relevant text in large corpora. For instance, a legal researcher can inquire about "precedents for intellectual property disputes"; an environmental scientist can explore the "impact of deforestation on local ecosystems"; and in educational research, a query like "innovative teaching methods for mathematics" allows researchers to find relevant materials without having to give a particular example of existing methods. Natural language descriptions provide an effective way for domain experts to search over large corpora of natural language texts.

Baselines: we acknowledge the importance of comparing against diverse baselines. While it is impractical to encompass all encoder models documented in the literature, we have conducted a comprehensive evaluation against E5, HyDe, and Instructor—acknowledged as state-of-the-art general-purpose retrieval models at the time of the submission, as per the MTEB benchmark. Responding to the reviewers' request for a more extensive evaluation, we have expanded our automatic evaluation to include the 4 additional models: Contriver [1], GTE-large [2], BGE-en [3], and Ember [4] (the last 3 are the highest ranked avaialble models on MTEB, and the Contriver model is highly popular, and was proposed by Reviewer s5K1). The results of this evaluation are now presented in the revised paper (Figure 3 and Figure 4).

Notably, the 4 new models exhibit both higher valid recall and higher invalid recall, implying a tendency to rank both positive and misleading, negative sentences highly. Indeed, the precision@k results are notably low, indicating an inability to distinguish between positive instances and challenging negatives. This set of additional results emphasizes a consistent pattern across current state-of-the-art models—they struggle to generalize to the concept of description-based similarity. The observed limitations underscore the need to collect data tailored for our notion of similarity.

[1] Izacard, Gautier, et al. "Unsupervised dense information retrieval with contrastive learning." arXiv preprint arXiv:2112.09118 (2021).

[2] Li, Zehan, et al. "Towards general text embeddings with multi-stage contrastive learning." arXiv preprint arXiv:2308.03281 (2023).

[3] Zhang, Peitian, et al. "Retrieve Anything To Augment Large Language Models." arXiv preprint arXiv:2310.07554 (2023).

[4] https://huggingface.co/llmrails/ember-v1