EAReranker: Efficient Embedding Adequacy Assessment for Retrieval Augmented Generation
摘要
评审与讨论
Summary
The paper introduces EAReranker, an efficient embedding-based framework for assessing document adequacy in RAG. EAReranker aims to address challenges of traditional reranking approaches by evaluating document utility without accessing original text content. It uses a decoder-only architecture for predicting adequacy directly from embedding vectors. An experiment on four benchmarks shows competitive performance with existing rerankers.
优缺点分析
Strength
S1. Adequacy is an overlooked and timely research area in the applications of RAGs. I appreciate that the paper focuses on this important topic.
S2. The proposed EAReranker method is sound. The proposed framework to assess adequacy looks useful to the community. The experiments are thorough, and the findings are promising.
S3. The paper is well-written and easy to follow.
Major Weakness
W1. The framework for assessing adequacy could be strengthened with more detailed explanations. For instance, the rationale behind the four dimensions of adequacy assessment needs clarification. Are these dimensions based on prior research? What is the reasoning for choosing six distinct bins with varying ranges?
W2. The challenges of existing approaches are (1) computational overhead, (2) dependency on plain text, (3) lack of refined content value assessment. The proposed method only addresses (1) and (3). Does EAReranker support non-text data? Additionally, if users must train their own EAReranker, this could reduce efficiency compared to methods that don't require training.
Minor Weakness
M1. The text in figure 1 is too small to see.
M2. RAG and embedding search work on multimodal content as well. (L39)
问题
- How is the framework for assessing adequacy developed? Are these dimensions based on prior research? What is the reasoning for choosing six distinct bins with varying ranges?
- Does EAReranker support non-text data?
- Are users expected to train their own EAReranker? Or the trained EAReranker will be available for users?
局限性
yes
最终评判理由
Most of my questions have been answered. Given that my original rating was already positive, I will be maintaining it.
格式问题
n/a
We are grateful to the reviewer for their thoughtful feedback and positive assessment of our work. We appreciate the recognition that adequacy is "an overlooked and timely research area" and that our proposed method is "sound" with "promising" findings. Below, we address the questions and concerns raised.
W1: Rationale for the Adequacy Assessment Framework
Thank you for this important question, which highlights a core contribution of our work. The reviewer asks for a more detailed explanation of our adequacy assessment framework, specifically the rationale behind the four dimensions and the choice of six bins with varying ranges.
1. On the Rationale for the Four Dimensions of Adequacy:
Our adequacy assessment framework is designed to move beyond traditional "relevance" and evaluate a document's utility as context for a generative model. This requires a multi-faceted view. The four dimensions—Verifiability, Need Coverage, Evidence Completeness, and Structure Suitability—were systematically developed based on an analysis of failure modes in RAG systems and a synthesis of principles from established research in information quality and argumentation theory.
- Verifiability: This dimension is inspired by the literature on fact-checking and information reliability [1]. An unverifiable document, even if topically relevant, poses a high risk of causing the LLM to hallucinate or generate misinformation.
- Need Coverage & Evidence Completeness: These dimensions are adapted from concepts in question answering and argumentation mining [2, 3]. A document might be relevant but incomplete, addressing only a fraction of the user's query (
Need Coverage) or presenting claims without sufficient supporting arguments (Evidence Completeness). These are common reasons for unsatisfactory RAG outputs. - Structure Suitability: This dimension addresses a practical but critical aspect of data quality and its impact on LLM behavior. Poorly structured documents can actively introduce misinformation; for example, erroneous OCR or incorrect parsing of multi-column PDFs can concatenate unrelated sentences, creating false statements. Furthermore, LLMs are susceptible to the style and quality of their context. When presented with long, poorly structured content, they are more likely to generate responses of similarly low quality. Conversely, a document with a structure aligned with the user's needs (e.g., a well-formed table for a comparative query) makes it easier for the LLM to produce a high-quality, well-formatted response. This is especially true for smaller, less capable LLMs, which may struggle to follow complex formatting instructions but can more easily leverage a well-structured source document.
2. On the Rationale for the Six Bins:
The six-bin structure with non-uniform score ranges was designed to provide both interpretable categories and operational utility for RAG systems. The bin boundaries are strategically placed to create actionable thresholds, a design choice informed by empirical analysis during our dataset annotation pilot studies.
-
[0.90, 1.00] (Precise Adequacy) & [0.75, 0.90) (High Adequacy): This high-end separation distinguishes "perfect" context from "excellent" context. Empirically, documents providing fully comprehensive and perfectly structured answers are rare, justifying a narrow top bin.
-
[0.50, 0.75) (Middle Adequacy) & [0.25, 0.50) (Marginal Relevance): The core of the distribution lies here. The 0.5 threshold is a critical pivot point, separating documents that are fundamentally useful from those that are only peripherally relevant.
-
[0.10, 0.25) (Weak Relevance) & [0.00, 0.10) (Irrelevance): This low-end separation helps distinguish documents with a faint, almost useless signal from those that are completely off-topic. The 0.10 threshold acts as a "hard filter" to discard content that offers no value.
-
The fine-grained splits at the extremes—[0.90, 1.00] (Precise Adequacy) and [0.00, 0.10) (Irrelevance)—serve a key purpose in training. By isolating the absolute best and worst cases, we provide the model with clearer, more definitive guidance signals at the poles of the adequacy spectrum.
-
The Irrelevance bin ([0.00, 0.10)) is designated for samples that are completely unrelated, akin to the random negative samples often used in training traditional rerankers. By grouping them into this narrow, low-value range, we reduce the model's sensitivity to minute differences between them (e.g., the distinction between a score of 0.05 and 0.04 becomes less significant in the loss calculation). This encourages the model to allocate its capacity more effectively toward a more critical task: distinguishing between documents with varying degrees of utility in the higher adequacy ranges ([0.10, 1.00]), which often correspond to the "hard negatives" that challenge traditional models.
W2: Support for Non-Text Data and Training Efficiency
1. Extensibility to Multimodal Data:
An important strength of the EAReranker framework is its inherent extensibility to non-text (multimodal) data. Our model operates exclusively on embedding vectors. Therefore, as long as a query and a document (be it an image, audio clip, or table) can be mapped to fixed-dimensional vectors by a capable multimodal embedding model, EAReranker can assess its adequacy without any architectural changes. This is a significant advantage over plaintext rerankers, which are fundamentally text-dependent.
There are two primary pathways to achieving this:
- Native Multimodal Training: One could construct a multimodal adequacy dataset using appropriate multimodal LLMs for annotation, and then train a natively multimodal EAReranker from scratch.
- Zero-Shot/Few-Shot Transfer: A more resource-efficient approach would be to leverage the existing text-based EAReranker. By using a strong multimodal embedding model (which maps text and images to a shared semantic space), the text-trained EAReranker could likely perform a form of zero-shot transfer, assessing image adequacy based on the semantic proximity of its embedding to relevant text embeddings. This could be further enhanced with a small amount of multimodal labeled data for fine-tuning.
2. Justifying the One-Time Training Effort:
The reviewer brings up the valid consideration of the initial training cost. We see this as an upfront investment that unlocks significant and continuous gains in inference efficiency.
- A "Train-Once, Use-Many" General-Purpose Model: For a given embedding model, EAReranker only needs to be trained thoroughly once. The resulting model can then be deployed as a general-purpose adequacy assessor by any user of that specific embedding model. This eliminates the need for each user to undergo the training process themselves.
- **Amortized Costs in Production:**For organizations that require custom fine-tuning on proprietary data, the training effort is a one-time or infrequent cost. In contrast, the computational overhead of plaintext rerankers is incurred with every query and scales with document length and traffic volume. Over time, the consistent, low-latency, and memory-stable inference provided by EAReranker (as shown in Table 6) results in a more scalable, predictable, and cost-effective production system, meaning the initial training cost is quickly amortized.
M1 and M2: Figure Text Size and Multimodal Content
Regarding M1, we agree with the reviewer that the text in Figure 1 was too small. We have remade the figure with a larger, more legible font in the revised paper to improve readability.
Regarding M2, thank you for this correction. RAG is not limited to text. We have corrected the statement on Line 39 and elsewhere to reflect that embeddings and RAG systems are increasingly used for multimodal content. This correction also strengthens our argument regarding EAReranker's applicability to non-text data, as discussed in our response to W2.
We hope these clarifications and the corresponding revisions to our paper have adequately addressed the reviewer's concerns. We thank the reviewer again for their thoughtful and valuable feedback.
References:
[1] Thorne J, Vlachos A, Christodoulopoulos C, et al. "Fever: a large-scale dataset for fact extraction and verification." NAACL 2018.
[2] Wachsmuth H, Trenkmann M, Stein B, et al. "Are view corpus for argumentation analysis." Lecture Notes in Computer Science 2014.
[3] Wadden D, Lin S, Lo K, et al. "Fact or fiction: Verifying scientific claims." EMNLP 2020.
Thank you for the detailed response! The rational for the adequacy assessment framework and the pathway to multimodal data are very informative. I recommend adding them in the camera-ready draft. Thanks!
Most of my questions have been answered. Given that my original rating was already positive, I will be maintaining it.
The paper points to the issues of high computational overheads, dependency on plain text and insufficient assessment of document value in current RAG systems. To address these issues the authors propose to sue Decoder only architectures which takes embedding vectors as input where the embedding vectors are encodings of the input as given by a retrained embedding vector. The paper also claims to have constructed a new adequacy assessment metric.
优缺点分析
Strengths
- The paper considers an important problem
- New dataset is curated
Weaknesses
- The paper keeps on talking about the high level ideas abut does not provide a concrete methodology to address the discussed issues. As an example see Section 31. which I again repetition of what was said in the Introduction but it is not clear how the goals are actually achieved. I also do not see any mathematical description of the adequacy metric are at an extremely high level. Algorithm 1 which seems to be important in calculating some metric is not even explained in the text.
- One of the issues considered is privacy. I do not see how just using pertained encoders to separately extract embeddings solve the issue. What is the threat model there? What if the attacker also has access to the pre-trained encoder? Why using pertained encoder with plain text is safer than the whole system?
- I do not see much contribution of the work except a new dataset. Using embeddings instead of plain text is straightforward. it is not even directly clear how it solves the issues of computational overhead and privacy.
问题
Ina addition to my questions posed in the weaknesses please explain exactly how your proposed architecture mitigates issues of computational overheads and privacy preservation?
Describe the adequacy metric mathematically. How does each component relate to and assure the four requirements specified in Section 4.1?
局限性
No
最终评判理由
Based on authors' response I have slightly increased my score. Unfortunately it still remains below acceptance threshold as key details are missing from the paper. I believe after addition of these details the paper requires a new review.
格式问题
None
We sincerely thank you for your time and for providing us with insightful and constructive feedback on our paper. Your comments have highlighted areas where further clarification can strengthen the paper. We have carefully considered each point and provide detailed responses below.
W1: Clarity of Methodology
We provide the following concrete details to clarify our methodology.
-
Formal Problem Characterization: We will move beyond the high-level comparison and provide a formal problem characterization. We define the task mathematically, shifting from the traditional reranking function (which operates on plaintext query and document ) to our proposed adequacy assessment model , which operates exclusively on embedding vectors and . We will explicitly state the four essential capabilities our model must possess to achieve this: (1) extracting fine-grained semantic information from fixed-dimension vectors; (2) operating exclusively on embeddings; (3) evaluating multi-dimensional adequacy beyond simple relevance; and (4) maintaining computational efficiency invariant to document length.
-
Detailed Explanation of Algorithm 1: Algorithm 1 implements a hierarchical validation mechanism to ensure annotation quality:
- Process: It begins by scoring a query-document pair with an initial set of 4 LLMs. It calculates the mean score and checks if all individual scores fall within a tight tolerance (e.g., ).
- Combinatorial Validation: If consistency is not achieved, the algorithm progressively incorporates scores from additional LLMs (up to a total of
m). The core innovation lies in its combinatorial approach: at each step, it systematically evaluates all possible subsets of 4 model scores (Combinations(S, 4)) to find any self-consistent group. - Rationale: This method is more robust than a simple average, as it can identify a reliable consensus even if a minority of models produce outlier scores. This ensures the high quality and consistency of the annotated dataset used to train EAReranker.
W1 and Q2: Mathematical Description of Adequacy Metric
We provide a more specific definition and examples of our adequacy assessment framework.
-
Multi-Dimensional Adequacy Framework: We formalize adequacy through four pillars: Verifiability, Need Coverage, Evidence Completeness, and Structure Suitability. We clarify that our approach does not score each dimension independently but rather instructs the LLMs to holistically evaluate adequacy by simultaneously considering these factors, as they are inherently interconnected in practical RAG applications. For example, a document that has high Need Coverage but low Verifiability (i.e., it addresses the query but with unsubstantiated claims) would receive a low overall adequacy score. Table 1 links score ranges to the required qualitative levels on these dimensions.
-
Mathematical Formulation of Score Calibration: We will detail the two-stage process for generating the final ground-truth score .
-
Stage 1: LLM-based Binning: A score is generated via Algorithm 1.
-
Stage 2: Within-Bin Calibration: To enhance score granularity, we calibrate using signals from a set of plaintext rerankers. This is formalized by the equation below:
Here, is the score range of the bin assigned by the LLM, and is a normalized, composite score from multiple plaintext rerankers. This equation demonstrates how we preserve the LLM's high-level adequacy judgment (the bin) while injecting fine-grained ranking information.
-
W2 and Q1: Computational Overheads and Privacy Preservation
-
Explanation of Computational Overhead Mitigation:
- Constant Complexity: Traditional cross-encoder rerankers process concatenated
[CLS] query [SEP] documentsequences. Their computational complexity and memory usage scale with document length. In contrast, EAReranker's input is a sequence of expanded embeddings with constant length , completely decoupled from the original document's length. - Empirical Evidence: As shown in Table 6 of our main paper, this leads to constant VRAM usage (~550MB) for EAReranker, whereas plaintext models can consume up to 8441MB for long documents. This provides clear, practical mitigation of the overhead problem.
- Constant Complexity: Traditional cross-encoder rerankers process concatenated
-
Explanation of Privacy Preservation:
- Threat Model & Architecture: We consider a scenario where the inquirer and the data provider are two non-trusting parties. Our method decouples the document embedding process from the reranking service. A user can convert sensitive documents to embeddings locally using a public pre-trained encoder. Only the fixed-size, anonymized embedding vectors are sent to the remote reranking service.
- Security Guarantees: An embedding is a lossy, one-way representation. Reconstructing long, complex, or private text from its embedding is computationally infeasible, even if an attacker possesses the same encoder model.
- New Experimental Results: To empirically validate this, we conducted inversion attacks on our query dataset using state-of-the-art methods: Vec2Text, TEIA, and GEIA. The Rouge-L scores of the recovered text were 0.2873, 0.1969, and 0.3180, respectively, indicating very low semantic overlap and failed reconstruction. This demonstrates that sensitive raw data, which would be fully exposed with plaintext rerankers, remains protected in our framework.
W3: Contribution of the Work
-
Novel Architecture: A naive embedding-based model would lack the capacity for fine-grained assessment. Our contribution is a specialized architecture with two non-trivial components:
- Embedding Dimension Expansion (Section 5): Our technique of transforming single vectors into rich sequential representations (
e'_{q,i} = W_{q,i}e_q + b_{q,i}) is a novel method that allows a Transformer to "unfold" and deeply analyze the dense information within embeddings. Our ablation study (Table 5) shows this is critical, with its removal causing a 3.70% drop in ACC25. - Bin-Aware Weighted Loss (Section 5): Our custom loss function, , which specifically penalizes predictions crossing semantic bin boundaries, is essential for training a model that aligns with the operational needs of RAG. Its removal causes a 2.67% drop in ACC25.
- Embedding Dimension Expansion (Section 5): Our technique of transforming single vectors into rich sequential representations (
-
Principled Adequacy Framework: Our work is among the first to move beyond simple "relevance" to formally define and operationalize a multi-dimensional concept of "adequacy" for RAG. Our rigorous annotation methodology (Section 4) and the resulting large-scale dataset are significant contributions that enable future research.
-
SOTA-Level Performance with High Efficiency: Demonstrating that an embedding-only model can achieve performance within 0.54% of state-of-the-art plaintext rerankers (Table 3) while being 2-3x faster with constant memory usage (Table 6) is a significant and non-obvious result.
We thank the reviewer for their critical assessment, which has helped us clarify the foundations and technical innovations of our work. We have endeavored to address each concern with concrete details and empirical evidence. We believe these clarifications substantiate the novelty and significance of our contributions to the field.
References:
[1] Morris J, Kuleshov V, Shmatikov V, et al. "Text Embeddings Reveal (Almost) As Much As Text." EMNLP 2023.
[2] Huang Y H, Tsai Y, Hsiao H, et al. "Transferable Embedding Inversion Attack." ACL 2024.
[3] Li H, Xu M, Song Y. "Generative Embedding Inversion Attack to Recover the Whole Sentence." Findings of ACL 2023.
Dear Authors,
Thanks for the elaborate response. While I appreciate the provided details, I still believe that the paper requires another round of full review. At the moment I will keep my score but look closely at other reviewers' reviews for the possibility of increasing the score.
The authors propose a methodology for ranking measuring both the relevance and adequacy of documents to be input to a reader in a RAG pipeline. They propose an annotation process and a predictor for this adequacy often finding that their approach can be more efficient while maintaining effectiveness.
优缺点分析
Strengths
- The problem addressed is important as cutoff values presented in works become increasingly shallow.
- Writing is clear.
- Several benchmarks are utilized across multiple domains.
Weaknesses
- The authors should clarify their definition of adequacy, the notion of a comprehensive set of information would appear to better relate to recall. The use of accuracy doesnt appear to assess the criteria from this approach is motivated.
- The methodology in dataset curation appears to apply a point-wise scoring method for how comprehensive each text is as opposed to each ranking, this feels largely identical to the criteria used in relevance judgements (See Umbrella[1] for automated judgements and DL-19 [2] for human criteria).
- The authors note a multifaceted evaluation process however it is ambiguous how effectiveness is further decomposed beyond accuracy-based measures.
- The work is effectively an adapter and thus should be compared, see preTTR [3] or more generally model adapters [4, 5]
References:
[1]: Upadhyay, S., Pradeep, R., Thakur, N., Craswell, N. and Lin, J., 2024. Umbrela: Umbrela is the (open-source reproduction of the) bing relevance assessor. arXiv preprint arXiv:2406.06519.
[2]: Craswell, N., Mitra, B., Yilmaz, E., Campos, D. and Voorhees, E.M., 2020. Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820.
[3]: MacAvaney, S., Nardini, F.M., Perego, R., Tonellotto, N., Goharian, N. and Frieder, O., 2020, July. Efficient document re-ranking for transformers by precomputing term representations. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 49-58).
[4]: Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M. & Gelly, S.. (2019). Parameter-Efficient Transfer Learning for NLP. <i>Proceedings of the 36th International Conference on Machine Learning</i>, in <i>Proceedings of Machine Learning Research</i> 97:2790-2799 Available from https://proceedings.mlr.press/v97/houlsby19a.html.
[5]: Zhang, R., Han, J., Liu, C., Gao, P., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H. and Qiao, Y., 2023. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
问题
The pipeline relies entirely on LLM agreement; no human annotation is described. Was any manual spot-checking performed to confirm bin labels, especially for boundary cases?
Could the authors rerun key experiments with multiple seeds and provide 95 % confidence intervals or paired tests against baselines?
Algorithm 1 accepts a subset when all four scores lie within 0.2 of the mean. Why was 0.2 chosen, and how robust is the dataset size/quality if this tolerance is tightened or relaxed?
局限性
yes
最终评判理由
I feel the authors have made a good effort to address my concerns
格式问题
None
W1, W2 and W3: Clarification of "Adequacy" versus "Relevance"
We thank the reviewer for raising this critical point. We acknowledge the need to better articulate the distinction between the "relevance" commonly measured in IR and the "adequacy" we propose for RAG systems.
The core distinction is that relevance measures topical alignment, while adequacy measures a document's functional utility for a downstream generation task. A document can be highly relevant but inadequate. For example, consider a medical query about a specific disease treatment. A short patient case file might be highly relevant from a keyword perspective, but a comprehensive medical guideline or a well-structured wiki page would be far more adequate, providing the necessary background, evidence, and detailed information for a generative model to produce a reliable and complete answer.
Our concept of adequacy is designed to capture this utility. As another example, for a complex query like "What are the latest regulatory compliance requirements for international data transfers across EU, UK and APAC regions for financial services?", a relevant document might simply discuss "EU data transfer regulations." However, an adequate document must be verifiable, comprehensive enough to cover the multifaceted aspects of the query (EU, UK, APAC, financial services), and structured in a way that the RAG model can synthesize a correct and actionable response.
Our methodology, as detailed in Section 4.1 of our submission, operationalizes this by evaluating documents on their ability to provide verifiable, comprehensive, and well-structured information. The examples in Table 2 ("Comparative Case Study") further illustrate this. For the query "Introduce China's capital," the document that simply repeats the query ("Introduce China's capital") achieves perfect cosine similarity (1.0) but is correctly assigned a very low adequacy score (0.1740) because it provides zero utility. Conversely, a detailed paragraph receives a high adequacy score (0.9261) even with a lower similarity score. This demonstrates that our adequacy assessment successfully moves beyond simple topical relevance.
W4: Comparison to Model Adapters
We appreciate the reviewer pointing out these valuable related works. We agree that EAReranker shares the goal of parameter efficiency with adapters and query-time efficiency with methods like preTTR. We would like to clarify the key distinctions that position our work as a novel approach.
-
Comparison with general model adapters: While conceptually similar in aiming for efficiency, our approach differs fundamentally in its architecture and point of application.
- Traditional adapters are small neural modules inserted between the layers of a large pre-trained model (e.g., BERT). They are trained while the base model's weights remain frozen.
- EAReranker, in contrast, is a standalone, post-hoc module. It operates on the output of a pre-trained embedding model (i.e., the embedding vectors) and does not require access to the internal architecture of the embedding model itself. This makes our approach more flexible and modular, as it can be paired with any black-box embedding model without modification.
-
Comparison with preTTR: preTTR is an excellent work on optimizing transformer re-rankers, and we thank the reviewer for the reference. Both preTTR and our work use pre-computation to accelerate query processing. However, they operate on different input modalities, which leads to different trade-offs:
- preTTR operates on text. It pre-computes contextualized representations for each token in a document. At query time, it loads these numerous token vectors and executes the remaining transformer layers on the combined query-document text. Its computational cost is still dependent on the length of the document.
- EAReranker operates on embeddings. It works with a single, fixed-size vector for the entire document. This fundamental design choice provides two key advantages, which are central to our paper's contributions:
- Constant Computational Profile: As shown in Table 6, EAReranker's inference time and memory usage (∼550MB) are constant and independent of the original document's length. This is a critical feature for systems handling documents of highly variable lengths, where text-based models like preTTR would see fluctuating resource usage.
- Inherent Privacy: By relying solely on embedding vectors, our method never requires access to the original plaintext. This makes it suitable for privacy-sensitive applications where raw content cannot be processed by a third-party reranker.
While a direct experimental comparison is challenging due to these architectural differences, we believe this highlights the unique value proposition of our embedding-only approach. We will add a more detailed discussion in the appendix to better contextualize EAReranker with respect to these important efficiency-focused methods.
Q1: Human Annotation and Spot-Checking
Thank you for this important question. We apologize for the omission in our description. While our goal was to build a scalable annotation pipeline, the process was not entirely automated and did include a manual verification step.
The authors performed manual spot-checking on approximately 1,000 samples generated by the LLM scoring framework. This process was crucial for validating the quality of the labels and refining our methodology. The insights gained from this manual review helped us iterate on and improve the prompts used for the LLMs (as described in Section 4.2) to ensure they captured the nuances of adequacy more accurately. This human-in-the-loop step was vital for ensuring the overall quality and reliability of our final dataset.
Q2: Statistical Significance
This is an excellent suggestion to enhance the rigor of our findings. We confirm that the results reported in the paper are stable across multiple runs. To demonstrate this, we trained our best-performing model variant (EAReranker with bge-m3 embeddings) five times with different random seeds. The mean and standard deviation for our primary adequacy metrics are reported below:
| Metric | Mean (%) | Std. Dev. (%) |
|---|---|---|
| ACC25 | 84.28 | 0.31 |
| LACC@25 | 86.12 | 0.28 |
| LACC@10 | 92.85 | 0.24 |
As shown, the standard deviations are very low, confirming that our training process is robust and the reported results are reliable and representative.
Q3: Hyperparameter in Algorithm 1
This is a very insightful question about a key hyperparameter in our data annotation pipeline. The tolerance threshold of 0.2 was chosen empirically to balance dataset quality and scale.
Our semantic adequacy bins, as shown in Table 1, have a width of 0.15 to 0.25. A tolerance of 0.2 ensures that the scores from the four LLMs in an accepted group are highly likely to fall within the same semantic bin, preventing major disagreements from polluting the label. We found that a stricter tolerance (e.g., <0.1) significantly reduced the dataset size by rejecting many valid but nuanced cases, while a looser tolerance (e.g., >0.3) introduced noticeable noise from inter-bin disagreements that harmed model performance. Therefore, the 0.2 threshold represents a pragmatic choice to maintain high inter-rater agreement at the semantic bin level, which is critical for our task, without overly constraining the size of the training data.
We believe these clarifications address the reviewer's concerns and underscore the contributions of our work. We are confident that our paper presents a novel and practical solution for an important problem in RAG systems. We thank the reviewer again for their time and valuable feedback.
Thank you for thoroughly addressing my comments. I still have concerns over the definition of adequacy and would like to discuss the notion of relevance. If we look at modern annotation, a perfectly relevant document is usually something wholly dedicated to the information (or generation need if the user is an LLM), a related text may incidentally have a lexical match, but this is merely a common heuristic. I think the author's notion lies somewhere between relevance and the idea of a static ranking (think pagerank or authority indicators). If the authors could clarify this point (and potentially incorporate some of this discussion into the manuscript), I feel that my concerns are adequately addressed. Based on this, I would raise my score.
We are very grateful to the reviewer for their continued engagement and for providing this opportunity to further clarify our core concepts. The reviewer's question hits upon a crucial and nuanced point that distinguishes our work and is central to its contribution. We agree that a clearer articulation of this, incorporated into the manuscript, will significantly strengthen the paper.
Our concept of adequacy is designed to complement, not replace, the established notion of relevance. A document can be highly relevant yet functionally inadequate for fulfilling a user's underlying goal, especially in systems like Retrieval-Augmented Generation (RAG).
The core distinction is illustrated with examples where a document is undeniably relevant but varies greatly in its functional utility.
Example 1: Analytical Tasks (From Data to Insight)
Query: "Explain the key financial risks associated with investing in emerging market equities for a portfolio manager."
- Document A (High Relevance, Moderate Adequacy):
A quantitative assessment for a portfolio manager shows key financial risks in emerging market equities are currency fluctuation (), political instability (), and liquidity (), with a 5% monthly Value at Risk of 15%.
- Document B (High Relevance, High Adequacy):
A guide for a portfolio manager explaining key financial risks: 1) Currency Risk: How a volatile exchange rate can diminish returns, even if stock prices rise. 2) Political Risk: How unstable governance can lead to sudden policy shifts, citing past crises as examples.
Both documents are highly relevant to the query. However, Document A presents information as raw data, requiring significant interpretation to form an explanation. Document B is structured for explanation, providing the definitions and context that directly facilitate the generation of a high-quality answer.
Example 2: Operational Tasks (From Information to Action)
Query: "Provide a guide on how to set up a secure private Docker registry using TLS for a DevOps engineer."
- Document A (High Relevance, Moderate Adequacy):
For a secure private Docker registry, a DevOps engineer must edit the
config.ymlto set thetls.certificateandtls.keypaths. This is a primary step for enabling TLS. After saving the changes, the registry daemon must be restarted for them to take effect. - Document B (High Relevance, High Adequacy):
This guide provides a step-by-step setup for a secure private Docker registry: Step 1) Generate TLS Certificates: Run the
opensslcommand. Step 2) Configure Registry: Point to your generated files inconfig.yml. Step 3) Restart: Safely restart the Docker container.
Again, both documents are relevant. But Document A is a descriptive reference that forces the user to infer the correct sequence of actions. Document B's explicitly structured, step-by-step format provides an unambiguous and complete process, making it far more adequate for fulfilling the user's request for a "guide."
These examples illustrate that adequacy assesses the functional utility of information. It evaluates whether the content's presentation and structure are optimized for fulfilling the user's implicit task.
Adequacy vs. Static Authority & Relevance
The reviewer correctly intuited a connection to concepts like PageRank. To clarify our contribution, we present adequacy as a distinct, complementary dimension of assessment.
Here is a summary of these crucial distinctions:
| Feature | Relevance | Static Authority (e.g., PageRank) | Adequacy (Our Work) |
|---|---|---|---|
| Primary Goal | Measure topical alignment between query and document. | Measure query-independent, global importance of a document. | Measure a document's functional utility in fulfilling the user's underlying task. |
| Core Question | Is this document about what the user asked? | Is this document generally authoritative or trustworthy? | Does this document provide the necessary reliable, complete, and well-structured information to achieve the user's goal? |
| Key Signals | Lexical & Semantic: Keyword importance (e.g., TF-IDF, BM25), semantic similarity/match. | External Graph & User Behavior: Link graphs, click-through rates, domain reputation. | Intrinsic Content & Structure: Verifiability, Need Coverage, Evidence Completeness, and Structure Suitability. |
| Judgment Basis | Topical Match: Assesses how well the document's subject matter aligns with the query's topic based on its content signals. | Global Importance: Assesses the document's authority within the broader corpus, independent of any specific query. | Task Fulfillment: Assesses how well the document's content and presentation enable the completion of the query's implicit task (e.g., "explain," "guide"). |
| Applicability | Universal in information retrieval systems. | Primarily in large, interconnected corpora like the web. | Especially critical for advanced systems like RAG, which perform tasks beyond simple document retrieval. |
Relevance is a broad category that leverages both lexical signals (like BM25) and semantic signals. In contrast, Adequacy introduces a new set of intrinsic signals defined in our paper—Verifiability, Need Coverage, Evidence Completeness, and Structure Suitability—which are specifically designed to evaluate a document's fitness for fulfilling a complex user task. This makes the two concepts methodologically distinct.
Thank you again for your insightful and constructive feedback. It has been instrumental in helping us refine and strengthen our core claims. We will incorporate the key clarifications from this discussion into our final manuscript to make the distinction of our work clearer. We appreciate your valuable time and consideration of our work.
This paper introduces a novel embedding-based re-ranking method (adequacy as the score to re-rank), EAReranker, to address three challenges that traditional re-ranking approaches face: computational overhead scaling with document length, dependency on plain text, and assessment of document value beyond relevance metrics. EAReranker first expands the embeddings from an existing embedding model (such as bge-m3) using multiple independent projection layers, and then trains L stacked Transformer Decoder layers with the adequacy scores obtained from multiple LLMs through their adequacy assessment annotation framework. The effectiveness of EAReranker is demonstrated through examining its ranking performance (on 2 English retrieval datasets from BEIR and 2 Chinese retrieval datasets from C-MTEB) and its adequacy assessment accuracy (on the curated datasets based on bge-m3-data).
优缺点分析
Strengths
- The proposed method is novel and inspiring. The three challenges that EAReranker aims to address are practical in real-world scenarios.
- Evaluation results on 4 retrieval datasets are convincing. EAReranker's ranking performance is competitive with those of superior plain-text re-rankers.
- The paper is well-written, and the structure of the paper is clear. The figures presented in the paper are also helpful for clearly understanding the proposed method.
Weaknesses
- The experimental settings on in-domain / out-of-domain are not presented in detail. I'm worried about the generalizability of EAReranker. Many existing works [1,2] usually use MSMARCO [3] as the training dataset and use the other 14 datasets in BEIR [4] as the evaluation datasets. I'm curious about the performance of EAReranker under this setting. Another option is using the well-trained EAReranker to perform evaluation on some other OOD datasets, such as AIR-Bench [5].
- One main challenge that EAReranker aims to address is the computational overhead scaling with document length. The 4 retrieval datasets used in the experiment all have relatively short documents in their corpora. MLDR [6] and NarrativeQA [7] are two long-document retrieval datasets used in the experiments of BGE-M3 [8]. Experimental results on these datasets would increase the soundness of the paper.
- The ablation study of the "within-bin score calibration methodology" introduced in Section 4.3 is missing.
[1] RepLLaMA: https://arxiv.org/pdf/2310.08319
[2] Llama2Vec: https://aclanthology.org/2024.acl-long.191.pdf
[3] MSMARCO: https://arxiv.org/pdf/1611.09268
[4] BEIR: https://arxiv.org/pdf/2104.08663
[5] AIR-Bench: https://arxiv.org/pdf/2412.13102
[6] MLDR: https://huggingface.co/datasets/Shitao/MLDR
[7] NarrativeQA: https://arxiv.org/pdf/2310.19923
[8] BGE-M3: https://arxiv.org/pdf/2402.03216
问题
Please refer to the weaknesses.
局限性
Yes.
最终评判理由
I have no further technical questions to the paper. I am inclined to rate it as a weak accept as reflected by my original score.
格式问题
No
W1: Generalizability on Out-of-Domain Datasets
Following your excellent suggestion, we conducted zero-shot evaluations on AIR-Bench, a challenging benchmark for real-world retrieval systems. We evaluated on wiki_zh and healthcare_en tasks following the official BM25 initial retrieval protocol:
Table 1: Zero-shot Evaluation on AIR-Bench (NDCG@10)
| Model | wiki_zh | healthcare_en |
|---|---|---|
| bge-reranker-v2-m3 (Plaintext) | 63.51 | 53.76 |
| gte-reranker-base (Plaintext) | 64.16 | 47.16 |
| Cosine (bge-m3) | 63.52 | 49.05 |
| Cosine (gte-base) | 61.86 | 47.48 |
| EAReranker (bge-m3) | 64.76 | 52.81 |
| EAReranker (gte-base) | 63.03 | 47.62 |
These results demonstrate EAReranker's strong generalization capability across domains and languages. Our embedding-only approach remains competitive with plaintext rerankers while preserving the efficiency and privacy advantages detailed in our paper. This addresses the generalizability concern by showing consistent performance on unseen OOD datasets.
W2: Performance on Long-Document Retrieval
Thank you for this critical suggestion. We have conducted additional experiments on NarrativeQA, a benchmark specifically designed for evaluating long-document retrieval:
Table 2: Ranking Performance on NarrativeQA (NDCG@10)
| Model | NDCG@10 |
|---|---|
| Cosine (bge-m3) | 48.72 |
| EAReranker (bge-m3) | 57.41 |
These results confirm that EAReranker effectively handles long documents while maintaining its key efficiency advantage. In RAG systems, document embeddings are typically pre-computed and stored in vector databases. EAReranker leverages these existing embeddings, eliminating the need to re-process long texts for each query. Unlike plaintext rerankers that incur computational costs proportional to document length, EAReranker maintains constant inference time and memory usage regardless of the original document size, while still achieving strong ranking performance.
W3: Ablation Study for Score Calibration
Our Response: We appreciate this observation and have conducted the requested ablation study comparing our calibrated scores () against uncalibrated LLM scores ():
Table 3: Ablation Study of Score Calibration Methodology
| Training Data | ACC25 | LACC@10 |
|---|---|---|
| Calibrated Scores (full model) | 84.28% | 92.85% |
| Uncalibrated LLM Scores | 82.95% (-1.33) | 91.98% (-0.87) |
The results confirm that our calibration methodology () provides a more effective training signal. This enables the model to better distinguish fine-grained differences in document adequacy within semantic bins, improving both accuracy and filtering capability.
We thank the reviewer again for their valuable feedback. The additional experiments have strengthened our work by validating EAReranker's generalization ability, long-document performance, and calibration methodology effectiveness. These results have been included in the appendix of our revised paper.
This paper introduces EAReranker, a novel and efficient embedding-based framework for assessing document adequacy in RAG systems. It effectively addresses critical challenges of traditional reranking: computational overhead scaling with document length, dependency on plain text, and insufficient assessment beyond simple relevance. Reviewers liked its effectiveness in achieving competitive performance with state-of-the-art plaintext rerankers while maintaining constant memory usage and processing 2-3x faster.
The authors were very diligent in the their rebuttal. They addressed questions about performance on out-of-domain datasets (AIR-Bench) and long-document retrieval (NarrativeQA). Authors also provided extensive new experiments demonstrating strong zero-shot generalization and effective handling of long documents, significantly strengthening these points.
There were also requests for more detailed explanations of the algorithm, mathematical clarity for adequacy scoring, and ablation for score calibration, as well as statistical significance. Authors provided in-depth explanations, mathematical formulations, a key ablation study showing calibration methodology effectiveness, and statistical robustness tests.
There were still some open questions about clarity and some missing details. A lot of this was addressed in the rebuttal and I hope the authors carry over all of their nice work from the rebuttal into the next version of the paper.