PaperHub
8.0
/10
Oral4 位审稿人
最低8最高8标准差0.0
8
8
8
8
3.0
置信度
正确性3.3
贡献度3.5
表达3.3
ICLR 2025

Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

OpenReviewPDF
提交: 2024-09-28更新: 2025-05-02
TL;DR

How to better evaluate and make LLM better for RAG task

摘要

关键词
Large Language ModelsTrustworthinessHallucinationsRetrieval Augmented Generation

评审与讨论

审稿意见
8

This paper examines the trustworthiness of LLMs within the RAG framework. The authors introduce a unified evaluation metric, TRUST-SCORE, designed to assess both response truthfulness and attribution groundedness, ensuring that model outputs are based on retrieved documents rather than on the model's internal parametric knowledge. Additionally, the authors propose a new alignment method, TRUST-ALIGN, to enhance LLM trustworthiness according to the TRUST-SCORE metric, demonstrating improvements over baseline approaches such as in-context learning when also evaluated with the proposed metric.

优点

The paper is well-written, with a relevant set of experiments to evaluate the meaningfulness of the proposed metric and alignment method across various datasets, and model configurations. The inclusion of ablation studies and additional results allows for a deeper understanding of the method's impact.

The rationale behind each term in the proposed metric is clear and grounded in prior research, making it easy to interpret the components individually and understand their contributions to the final score. This breakdown helps readers see how each part of the metric contributes to the final score, making it easier to understand which factors are driving the results. Such clarity is particularly valuable for RAG models where responses must be both accurate and grounded in retrieved documents rather than relying on the model's internal knowledge.

缺点

The results in Table 2 are hard to interpret. The best DPO values come from different model sizes depending on the dataset, which makes it hard to draw general conclusions about which configuration works best overall. Additionally, it’s unclear how reliable the TRUST-SCORE is as a standalone measure, since its value can end up high simply due to averaging across metrics., rather than showing a clear advantage in each metric. This approach can lead to "metric gaming," where high performance in one sub-metric can boost the final score even if other terms don’t improve much, which might overstate the model’s actual abilities. For example, PostCite TRUST score of 22.28 in row 2 of Table 2 under QAMPARI is higher than other baselines, despite having an EM score of zero.

Another problem is that while FRONT seems to be the strongest baseline, we don’t have results for it across different model sizes and families, particularly those where DPO performs best in Tables 2 and 3. This makes it challenging to judge how beneficial TRUST-ALIGN really is.

Lastly, the paper doesn’t clearly explain how the proposed alignment method is different from previous SFT and DPO methods mentioned in the related work section, leaving some ambiguity around what’s truly novel here.

问题

L246: “The positive response corresponds to an answer that encompasses expected gold claims for q and corresponding citations referring to the documents.”

Q1: If the response contains partial gold claims or partial citations, is it included in the dataset? If so, as + or -?

L277: “we fine-tune LLaMA-2-7b on the source datasets, creating MsftM_{sft} (Appendix E.1).”

Q2: It's not clear how fine tune is performed and the appendix does not clarify that. Is the fine tuning task next token prediction? If so, how are questions, documents and answers stitched together?

Q3: Why does TRUST-ALIGN without refusal HT increase the F1rgF1_{rg} score in the ELI5 dataset in Table 4? Is this a typo (see 2nd item in the list of typos below)?

Q4: I understand the importance of disentangling parametric knowledge from grounded knowledge for the purpose of this study, but in practice, one would leverage both. If I understand correctly, the alignment method proposed by the authors could be used to reduce hallucination. How do the metric and alignment method introduced in this work generalize to real-world scenarios?

Typos:

L282: 19K. Figure 2 says 20K and 50% of 40K is 20K.

L449: the difference is 0.48% for QAMPARI and -0.78% for ELI5.

评论

[Clarification 7]: Why does TRUST-ALIGN without refusal HT increase the score in the ELI5 (Table 4)?

Thank you for helping to bring this typo and others mentioned in "Typos" to our attention. The value in the text for ELI5’s F1_RGF1\_{RG} was wrongly calculated with Trust-Score’s value; it will be corrected in the next version. As shown in the table, there is indeed an increase in F1_RGF1\_{RG} for ELI5 by 0.78% for the case without refusal HT. We provide a discussion below as to why an increase is observed.

We observe that the increase in overall F1_RGF1\_{RG} is due to the recall for answerable questions (R_ansR\_{ans}, proportion of answered and answerable questions over total answerable questions) increasing more than the corresponding drop in precision (P_ansP\_{ans}, proportion of answered answerable questions over total answered questions), resulting in an increase in F1_ansF1\_{ans}. This increase outweighs the drop in its F1_refF1\_{ref}, leading to the overall rise in F1_RGF1\_{RG}.

Table 2: Component wise breakdown of F1_RGF1\_{RG}

ModelR_refR\_{ref}P_refP\_{ref}F1_refF1\_{ref}R_ansR\_{ans}P_ansP\_{ans}F1_ansF1\_{ans}F1_RGF1\_{RG}
DPO-LLaMA-2-7b83.9884.9584.4643.0041.2042.0863.27
Without refusal HT78.4386.7582.3854.1039.5745.7164.05

[Clarification 8]: How does the metric and alignment method introduced in this work generalize to real-world scenarios?

It is an interesting question, thank you for asking. In critical domains like healthcare and law, one should only rely on vetted knowledge retrieved from a vetted knowledge base to ground the answer. The answers are not admissible otherwise. We may allow the LLMs to employ the commonsense reasoning skills learned during pre-training, but those must not override what has been stated in the vetted documents. This is especially important for facts, but also the rules of reasoning should they change from context to context. For instance, the legal age of driving varies from country to country, but it is typically 18 which could be encoded in the LLM parameters. If the reasoning involves such information, the LLM must use the vetted legal age of, say 21, over its parametric knowledge that states that it is 18. This is where grounded citation (F1_CGF1\_{CG}) comes into the picture. Again, should an LLM be unable to determine the legal age in some country to answer some question (where knowing this information is necessary, but not necessarily sufficient), it should simply refuse to do so, as it cannot ground the final answer in the retrieved documents. This is an aspect that grounded refusals (F1_RGF1\_{RG}) seeks to measure. Using the age of 18, as encoded in the parameters, cannot be truthfully grounded and thus should not be trusted. Such scenarios are meant to be captured by our proposed metrics and mitigated by alignment.

If the reviewer feels that this point should be emphasized in the draft, we would love to do it.

评论

Thank you for addressing my questions and conducting extra experiments. I am satisfied with the answers. I updated my ratings to reflect my new opinion that this paper is a clear accept.

评论

Dear Reviewer ewWr,

Thank you so much!

--

Authors

评论

[Clarification 3]: Additional FRONT results demonstrating the effectiveness of Trust-Align

We thank the reviewer for raising this point. In our revised version, we have included the complete results of FRONT. Additionally, we have attached the results of FRONT across different model sizes and families (corresponding to tables 2 and 3) for your convenience below.

Trust-Align performance relative to the additional FRONT results is on par with the existing FRONT results. Moreover, Trust-Align delivers substantial improvements in F1_RGF1\_{RG} across all 27 configurations and enhances F1_CGF1\_{CG} in 24 out of 27 configurations, further demonstrating its effectiveness in improving response groundedness and citation quality.

Table 1: Additional FRONT results

ModelDatasetMethodAR (%)EMACF1EM^{F1}_{AC}F1_RGF1\_{RG}F1_CGF1\_{CG}TRUST
LLaMA-2-7bASQAFRONT100.0060.4739.1568.8656.16
DPO65.3052.4866.1283.9467.51
QAMPARIFRONT100.0017.2722.7824.2621.44
DPO32.3032.0371.6749.4251.04
ELI5FRONT100.0021.6617.1552.7230.51
DPO21.6022.5463.2747.3544.39
LLaMA-3.2-1bASQAFRONT79.1148.2254.4848.2950.33
DPO41.6738.6458.6179.3558.87
QAMPARIFRONT98.607.5724.5415.3215.81
DPO20.0027.2267.9249.4248.19
ELI5FRONT97.2016.1120.7630.1922.35
DPO9.6013.2059.3548.2140.25
LLaMA-3.2-3bASQAFRONT95.2563.1949.4557.4656.70
DPO77.8559.8266.3884.2170.14
QAMPARIFRONT92.7012.9932.8919.1921.69
DPO48.2029.1370.8545.6548.54
ELI5FRONT86.9019.9532.2141.9731.38
DPO17.5018.3362.7955.8745.66
LLaMA-3-8bASQAFRONT99.0562.2541.6266.1456.67
DPO56.4353.9465.4988.2669.23
QAMPARIFRONT100.0013.5322.7820.4218.91
DPO22.4035.3570.7358.7754.95
ELI5FRONT99.5018.9917.8544.6927.18
DPO15.5020.8163.5750.2444.87
Phi3.5-miniASQAFRONT99.7963.3039.7971.6358.24
DPO66.5652.2364.2085.3667.26
QAMPARIFRONT100.0011.9722.7821.5018.75
DPO30.1036.4273.9553.4054.59
ELI5FRONT96.6021.4621.3561.4134.74
DPO24.9023.3967.6247.4246.14

[Continued below]

评论

[Continued from above]

ModelDatasetMethodAR (%)EMF1_ACEM^{F1}\_{AC}F1_RGF1\_{RG}F1_CGF1\_{CG}TRUST
Qwen-2.5-0.5bASQAFRONT100.0042.8339.1545.8742.62
DPO71.8450.5961.2852.4054.76
QAMPARIFRONT99.3011.5223.2315.9016.88
DPO17.9015.7661.8429.7335.78
ELI5FRONT99.9013.7417.2927.9519.66
DPO21.7013.6860.7922.7232.40
Qwen-2.5-1.5bASQAFRONT99.2657.7441.3655.7051.60
DPO72.5752.6862.3866.8160.62
QAMPARIFRONT98.8016.0524.4511.6017.37
DPO20.0023.8068.4650.9847.75
ELI5FRONT99.9019.5717.2937.7024.85
DPO33.6019.0357.9131.6336.19
Qwen-2.5-3bASQAFRONT97.4755.1544.0162.7253.96
DPO49.4755.1963.7678.6465.86
QAMPARIFRONT79.1020.6948.6225.6731.66
DPO48.1035.6970.3145.6450.55
ELI5FRONT93.6018.6925.3737.4027.15
DPO13.5022.5264.3842.0142.97
Qwen-2.5-7bASQAFRONT86.3964.5860.0858.2760.98
DPO59.4955.0466.2283.5768.28
QAMPARIFRONT84.7017.0242.8524.4828.12
DPO32.1030.1170.6853.4851.42
ELI5FRONT57.6028.2754.1456.6146.34
DPO21.0024.3063.7947.0245.04
评论

[Clarification 4]: Distinction from previous SFT and DPO works

We acknowledge that the distinctions in our approach may not have been clearly highlighted, and we appreciate the opportunity to clarify.

In this work, we aim to address the groundedness problem in RAG more holistically by introducing the concept of grounded answers, refusals, and attributions—where the model must refuse to answer if the provided documents lack sufficient information pertaining to the query, answer with information grounded in the documents, and provide appropriate attributions for the context (set of documents). To improve an LLM's RAG fitness across these three aspects, we contributed in two broad directions: 1) Proposing a metric that can holistically measure groundedness, and 2) Constructing an alignment dataset to enhance the groundedness score of a given model.

When it comes to the second aspect of our contribution which we believe the reviewer is concerned about, while the overarching elements of our framework may share similarities with prior works 151-5 —e.g., collecting seed data from open-source datasets, automatic data construction, fine-tuning with SFT/DPO—there are fundamental differences in our data construction approach that contributes towards novelty. One of the core distinctions of a data construction method lies in the specific problem it is designed to address. In our case, we aim to enhance the model's performance in three key aspects: refusal groundedness, answer groundedness, and citation groundedness. A non-ideal model is prone to exhibit five types of hallucinations related to one of the groundedness aspects: Unwarranted Refusal, Over-Responsiveness, Overcitation, Improper Citation, and Inaccurate Claims (as shown in Table 1). Focusing on these broad categories of hallucinations (including the important ones i.e. refusal-based) is one of the primary distinguishing factors compared to related works. This led us to design dedicated data augmentation techniques that encompass these non-idealities in the training and test sets; thus, the construction method differs substantially (Section 4).

Notably, similar works 151-5 improve the overall framework with focus on data construction but do not propose advancements in fine-tuning or preference optimization techniques for RAG. Similarly, it is not the focus of our work to propose a novel fine-tuning technique for RAG. We believe that specialized training techniques tailored for this task is interesting but out of scope of this work.

We appreciate the opportunity to discuss the contribution of our work and are happy to address any further questions or clarifications the reviewer may have.

References:

11 Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. Retrieved November 21, 2024, from arXiv.org website: https://arxiv.org/abs/2310.11511

22 Ye, X., Sun, R., Arik, Sercan Ö, & Pfister, T. (2023). Effective Large Language Model Adaptation for Improved Grounding and Citation Generation. Retrieved November 21, 2024, from arXiv.org website: https://arxiv.org/abs/2311.09533

33 Huang, L., Feng, X., Ma, W., Gu, Y., Zhong, W., Feng, X., … Qin, B. (2024). Learning Fine-Grained Grounded Citations for Attributed Large Language Models. Retrieved November 21, 2024, from arXiv.org website: https://arxiv.org/abs/2408.04568

44 Li, D., Sun, Z., Hu, B., Liu, Z., Hu, X., Liu, X., & Zhang, M. (2024). Improving Attributed Text Generation of Large Language Models via Preference Learning. Retrieved November 21, 2024, from arXiv.org website: https://arxiv.org/abs/2403.18381

55 Zhang, T., Patil, S. G., Jain, N., Shen, S., Zaharia, M., Stoica, I., & Gonzalez, J. E. (2024). RAFT: Adapting Language Model to Domain Specific RAG. Retrieved November 21, 2024, from arXiv.org website: https://arxiv.org/abs/2403.10131v1

评论

[Clarification 5]: If the response contains partial gold claims or partial citations, is it included in the dataset? If so, as + or -?

For responses containing partial gold claims or partial citations, we include it in the dataset as the negative (unpreferred) response. Specifically, it is categorized as having “Inaccurate Claims” or “Improper Citation” hallucination type respectively as defined in our paper.


[Clarification 6]: Is the fine tuning task next token prediction? If so, how are questions, documents and answers stitched together?

Thank you for bringing this up. This section could indeed benefit from further clarification. The fine-tuning task is indeed a next token prediction task. The model learns to predict the next token based on the given input, which includes instructions, documents, and questions formatted as a structured prompt. To fine-tune the model, we designed the input format to guide the model's behavior and align it with the objectives of the task. We attach the complete prompt format used for fine-tuning to illustrate how the questions, documents and answers stitched together:

Instruction: Write an accurate, engaging, and concise answer for the given question using only the provided search results (some of which might be irrelevant) and cite them properly. Use an unbiased and journalistic tone. Always cite for any factual claim. When citing several search results, use 
$
1
$

$
2
$

$
3
$
. Cite at least one document and at most three documents in each statement. If multiple documents support the statement, only cite a minimum sufficient subset of the documents. If none of the provided documents contains the answer, only respond with ‘‘I apologize, but I couldn’t find an answer to your question in the search results.’’

Question: Who was looking for a heart in the wizard of oz?

Document 
$
1
$
: {passage1}    
Document 
$
2
$
: {passage2}    
Document 
$
3
$
: {passage3}    
Document 
$
4
$
: {passage4}    
Document 
$
5
$
: {passage5}  

Answer:   

Model output: The Tin Woodman was looking for a heart in "The Wizard of Oz" [1][2][4].

We hope this clarifies the reviewer’s doubts and we are happy to provide more clarification if needed.

评论

Thank you for your constructive feedback. We have carefully reviewed and addressed the points you raised, and we are happy to provide further clarification if needed. If you find that your concerns have been resolved, we would be grateful if you could consider improving your score.

[Clarification 1]: Table 2 clarification

Thank you for your feedback. Trust-Align does not aim to introduce a new state-of-the-art LLM for RAG, as such, we do not identify a single best configuration. Instead, the method demonstrates how to enhance an LLM's capabilities in three key dimensions of trustworthiness: answer correctness (EMF1_ACEM^{F1}\_{AC} score), citation groundedness (F_CGF\_{CG}), and refusal groundedness (F_RGF\_{RG}) relative to its own baseline.

Table 2 aims to highlight the effectiveness of Trust-Align in enhancing the performance of each model family, rather than comparing between models. To better convey this within-model improvement, we will adjust how the results are presented, specifically, we will highlight the best values within each family.


[Clarification 2]: Reliability of Trust-Score as a standalone measure

Thank you for bringing up this crucial point. We agree that the standard arithmetic average (AM) may not represent the skewed score of the sub-metrics. Thus, depending on what is important, defined by the application and use case, one may take a weighted average. As far as we understand, there’s no one-size-fits-all solution in this case, as aggregation would inevitably lead to information loss. Thus, the individual sub-metrics are independently crucial.

审稿意见
8

Authors present a study of ‘grounded’ RAG in LLMs, i.e., a method for evaluating and aligning LLMs for RAG to increase the correctness of RAG, that LLMs cite the relevant literature and correct information, and identify when they do not have the necessary information to respond accurately (i.e., the “groundedness problem”). Using the direct preference optimization, the fine-tuned TRUST-ALIGN model approaches SOTA performance.

优点

The paper is an original and significant contribution to the field of RAG in LLMs as it helps to mitigate the problems of citation hallucination or groundedness of claims. The paper clearly articulates an extended experimental protocol, providing the rationale for its methodology. The findings are supported by the presentation of the results, and extensive appendices documenting the various steps of the study.

缺点

I cannot find any statement with regard to data validation by human reviewers. Despite the multiple steps involved in generating the training sets, it's unclear how the training data were validated. It would appear that the soundness of the generated dataset is reliant on the natural language inference engine and GTP-4 validation alone. It is claimed that many of the results are significant but I cannot find any statistical tests to support those claims.

问题

1017 Collecting Quality Questions. The dataset construction begins by collecting a set of high- quality (challenging) and diverse questions from source datasets i.e. ASQA, QAMPARI, and ELI5—referred to as seed samples -> These are used in evaluation, doesn’t this violate the separation principle? I understand these were used as ‘seed samples’ but if these have semantically similar enough, what guarantees are there that the models are not just fitting to the testing data?

评论

[Clarification 2]: Statistical testing

Thank you for bringing this up. We intentionally omitted the significance testing results from the main draft to improve table readability and we are happy to discuss it here. However, if the reviewers feel these results should be included, we will make sure to incorporate them in the next version. For your reference, we provide below the results of an independent sample t-test comparing baseline models with Trust-Aligned models.

Table 2: Results from significance testing on Trust-Score

Datasett-statisticp-value
ASQA7.709.85e-10
ELI56.401.18e-06
QAMPARI7.833.31e-08
EXPERTQA8.638.00e-10

Across all datasets, we observed that t-statistic > 0 and p < 0.001 (Table 2), indicating significant improvements for Trust-Score at a significance level of 0.01.


[Clarification 3]: Separation of train and test set/test-set integrity

An important aspect of any research is ensuring that the validation set and test set are well separated from the training set to maintain test-set integrity, which was a paramount consideration in our experiments. Thus, we ensure that the test set remains as disjoint as possible from the training samples. Precisely, the seed samples are sourced from the training split of three candidate data sources: ASQA, QAMPARI, and ELI5, totaling approximately 10,000 samples. Evaluation is conducted on their corresponding disjoint test split (2,948 samples) to preserve the separation between training and test data.

To assess generalizability, we also tested on the out-of-domain dataset ExpertQA, comprising expert-curated samples from 32 fields such as medicine, law, history, and engineering. This evaluates the aligned models on questions distinct from training domains (e.g., Google, Reddit, and Wikipedia). We appreciate the reviewer bringing up this question and will ensure these details are included in the updated draft to avoid any further confusion.

评论

Thank you for your detailed and constructive feedback, we answer the questions/provide clarifications below:

[Clarification 1]: Human evaluation of Trust-Align dataset

That's a great point, here we provide observations from the human evaluation. From the Trust-Align dataset, we created a data mix containing 97 samples, with approximately 20 samples representing each error type. This was followed by human evaluation where five expert annotators rated each sample. Each response was rated on three dimensions, taking inspiration from the criteria used by Gao et al. (2023) and Liu et al. (2023): (1) Correctness, (2) Citation Recall, and (3) Citation Precision. For Correctness, given a response, a set of documents, and the question, a human evaluator assesses whether the answer is correct. The answer is labeled “correct” if it fully satisfies the information requested in the question and if the claims can be inferred from the documents; otherwise, it is marked “wrong.” For Citation Recall, given a sentence and all its cited documents, human evaluators determine whether the complete set of citations “fully supports” or “does not support” the sentence. For Citation Precision, given a sentence and one of its citations, human evaluators decide whether the citation “fully supports,” “partially supports,” or “does not support” the sentence.

Table 1: Data quality review of Trust-Align data

MetricAgreement (%)
All samples79.90
Positive samples80.41
Negative samples79.38
All citations76.96

Table 1 demonstrates a high degree of agreement (79.90%) between human annotations and our automated response labels, with 80.41% agreement on positive responses and 79.38% agreement on negative responses.

Our findings also show a high degree of agreement (76.96%) between human annotations and TRUE NLI on the necessity (precision) and sufficiency (recall) of the citations. Additionally, Cohen’s kappa coefficient between humans and TRUE suggests a moderate level of agreement (0.55), underscoring the validity of using TRUE in our data construction pipeline.

This level of agreement is on par with prior studies conducting human evaluations which typically have agreement rates in the range of 58-80% [1-4]. We appreciate the reviewer pointing out the important discussion on evaluating the alignment data quality which we will include in the appendix of the updated draft.


References:

[1] Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2023). Self-Instruct: Aligning Language Models with Self-Generated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2023.acl-long.754

[2] Gao, T., Yen, H., Yu, J., & Chen, D. (2023). Enabling Large Language Models to Generate Text with Citations. Retrieved November 18, 2024, from arXiv.org website: https://arxiv.org/abs/2305.14627

[3] Kamalloo, E., Jafari, A., Zhang, X., Thakur, N., & Lin, J. (2023). HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution. Retrieved November 18, 2024, from arXiv.org website: https://arxiv.org/abs/2307.16883

[4] Chia, Y. K., Cheng, L., Chan, H. P., Liu, C., Song, M., Aljunied, Sharifah Mahani, … Bing, L. (2024). M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework. Retrieved November 18, 2024, from arXiv.org website: https://arxiv.org/abs/2411.06176

审稿意见
8

The authors develop several new metrics for assessing LLM response groundedness in a RAG setting. These metrics comprise a holistic groundedness/trustworthiness metric called Trust-Score. In addition, the authors develop a fine-tuning method called Trust-Aligne for increasing LLM response groundedness. The authors fine-tune several Llama, Qwen, and Phi models using this and other baseline methods, and they also compare the groundedness of the resulting models' responses to larger proprietary models prompted with ICL.

优点

Originality: high. I don't think I've seen a thorough assessment of grounded refusals before. Quality: moderate. Clarity: moderate. The definition of Trust-Score is clear, nuanced, and well-motivated. The description of the dataset generation and fine-tuning process is also detailed and clear. Significance: moderate.

缺点

Given that the best fine-tuned models only slightly out-performed GPT-4, it seems there are limited practical takeaways to be found for e.g. an application developer choosing and LLM. i.e. the practitioner is still well-justified in simply choosing a non-fine-tuned frontier model for their RAG application.

I found tables 2 and 3 very difficult to interpret, though mostly because it's not clear to me why the specific comparisons presented are valid or meaningful.

No confidence intervals are given.

问题

Why are the comparisons presented in Tables 2 and 3 meaningful? e.g. Trust-Score on LLaMA-3.2-3b fine-tuned with DPO compared to LLaMA-2-7b using FRONT? Why are comparisons not restricted to within-model?

Why is groundedness important in its own right if parametric knowledge can get the job done? If a model possesses the parametric knowledge to answer an otherwise un-answerable question, is "over-responsiveness" less an issue of hallucination than, say, poor instruction-following?

评论

[Clarification 4]: Importance of groundedness when parametric knowledge potentially suffices.

If a model possesses the parametric knowledge to answer an otherwise unanswerable question, is "over-responsiveness" less an issue of hallucination than, say, poor instruction-following?

This is an interesting question. There are numerous scenarios where parametric knowledge may not be preferred for generating factual responses. Prominent examples include AI-based search engines like Perplexity and You.com, which prioritize grounding their outputs in web data. These platforms utilize various LLMs as information consolidators (leveraging augmented knowledge) rather than as information generators (relying on parametric knowledge). Other RAG applications where outputs must reference provided documents include legal tasks (maintaining accurate records), healthcare (summarizing patient-doctor conversations), and finance (extracting compliance details and assessing fund performance).

While parametric knowledge can indeed enhance the performance of RAG systems, its utility is most pronounced in applications where the ground-truth answers remain consistent, regardless of whether augmented information is used. In such cases, the model’s output is not conditioned on attached documents; instead, the documents serve to improve the likelihood of producing a correct answer. However, in our work, which focuses on attributed text generation, we assume that the provided documents contain the necessary answers. If the documents are insufficient, the model is expected to generate a refusal, even if it has the correct answer encoded internally in its weights. One could, however, design a system that attributes parametric knowledge when a claim is generated without referencing the documents. This would require redefining the concept of citation groundedness, as citations would no longer correspond to specific documents.

We acknowledge the reviewer’s point that poor instruction-following is a broader issue that underpins various problems, including hallucination and over-responsiveness. Hallucination can be viewed as a specific and perhaps extreme manifestation of poor instruction-following. Another manifestation is the inappropriate use of valid parametric knowledge, or "over-responsiveness," where models generate correct answers without grounding them in the provided documents. While this is not hallucination per se, it could be mechanistically correlated and warrants further research.

In this work, we specifically address the subset problem of RAG performance, focusing on reducing RAG-specific hallucinations by improving citation groundedness and ensuring appropriate refusals. Poor instruction-following is a larger challenge, and while being a strong instruction-following model is likely a sufficient condition to reduce RAG hallucinations, the reverse is not necessarily true. If the reviewer feels this discussion should be included in the main draft, we would be happy to incorporate it.

评论

I'm (very weakly) suggesting that perhaps grounding behavior might be most effectively improved by improving instruction-following in general rather than focusing on fine-tuning for groundedness specifically. At least for models with high parametric knowledge, and especially in context of overall model performance.

评论

Thanks for improving the tables, they are much easier to interpret now.

We are grateful to the reviewer for constructive comments that helped us make the draft clearer and tables easy to understand.

As you point out, the fact that this fine-tuning can improve the grounding behavior of an 8B model to exceed that of frontier models suggests that there might be significant room for improvement in frontier model behavior as well. Is that the main message of your paper? I for one did not come away with that as the call to action.

The main message of the paper is two folds:

(Contribution-1) Proposing a new measure, Trust-Score, which evaluates an LLM’s fitness for RAG applications more holistically. It involves measuring of claim groundedness and attribution groundedness corresponding to the claims. A distinctive feature of the Trust-Score is that it measures the model’s capacity to produce appropriate refusal responses i.e. how effectively the model declines to provide an answer when the source documents lack sufficient information. Our analysis, summarized in Tables 2–4, reveals that both open-weight and frontier models exhibit heavily underperform on Trust-Score metric.

(Contribution-2) To make model responses (claims and attributions) grounded in the documents, we propose Trust-Align. Trust-Align contribution primarily comes at the dataset level where we construct a preference dataset covering a range of samples specifically designed to reduce errors affecting the Trust-Score—Inaccurate Answer, Over-Responsiveness, Excessive Refusal, Over-Citation, and Improper Citation. We regard them as LLM hallucinations within a RAG framework.

Given that frontier model providers do not facilitate DPO fine-tuning, claiming that these models will undoubtedly benefit from Trust-Align will be inappropriate. However, our findings—showing substantial improvements across the studied models and datasets, with average t-statistics of 7.64 points across benchmarks—indicate that similar gains could be expected for frontier models.

While we are keen to apply Trust-Align (DPO) to frontier models as soon as it is available, we explored its potential impact by performing SFT of GPT-4o using positive samples from the Trust-Align dataset. The results are shown below:

Table: Trust-Align of GPT-4o using positive samples.

ModelDatasetMethodAR (%)EMACF1EM^{F1}_{AC}F1RGF1_{RG}F1CGF1_{CG}TRUST
GPT-4oASQAICL84.4962.9261.4073.6665.88
SFT74.2659.2268.6287.5472.09
QAMPARIICL60.4014.2975.2020.4333.69
SFT34.6041.5677.1553.6456.99
ELI5ICL66.1035.2568.3337.7141.58
SFT25.5024.1068.3456.0948.99

GPT-4o sees an improvement in Trust-Score by 6.21 (ASQA), 23.3 (QAMPARI), 7.41 (ELI5) points when aligned using a subset of Trust-Align data. We will include results of DPO on frontier models as soon as it is facilitated by the model providers.

We are happy to clarify any further doubts the reviewer may have regarding our contributions and are grateful to the reviewer for such constructive discussions.

I'm (very weakly) suggesting that perhaps grounding behavior might be most effectively improved by improving instruction-following in general rather than focusing on fine-tuning for groundedness specifically. At least for models with high parametric knowledge, and especially in context of overall model performance.

Thank you for the interesting suggestion. We agree that instruction following and grounding are interconnected tasks, as highlighted in the previous response. This is an intriguing direction that we will explore in future work.

评论

Thank you for your constructive feedback. We have carefully addressed each of your points below. If you have any additional questions or need further clarification, we are more than willing to assist.

[Clarification 1]: ...best fine-tuned models only slightly out-performed GPT-4 ... limited practical takeaways to be found

This is an excellent question. Trust-Align does not aim to introduce a new state-of-the-art LLM for RAG. Instead, it demonstrates how to enhance a given LLM's capabilities in three key dimensions of trustworthiness: answer correctness ( EMF1_ACEM^{F1}\_{AC} score), citation groundedness ( F_CGF\_{CG} ), and refusal groundedness (F_RGF\_{RG}). The effectiveness of Trust-Align depends on the inherent capabilities of a model i.e. to what extent they can be good for RAG. For example, on ExpertQA, Trust-Align significantly increased LLaMA-3-8b Trust-Score from 38.26% to 54.85% making it better than GPT-4 on refusal (67.07% vs 52.91%) and citation groundedness (70.11% vs 69.83%). While it is not our goal to achieve SOTA, given the trend observed: LLaMA 3b (49.0%) < LLaMA 7b (51.8%) < LLaMA 8b (54.85%), one can potentially achieve a much higher score with larger open-source models (>10b).

Observing that a (potentially) much smaller model of 8B size could outperform GPT-4 and Claude-3.5 at Trust-Score, we believe there is significant room for improvement in frontier models. These models can greatly benefit from Trust-Align to achieve a higher Trust-Score. We are eager to explore Trust-Align's potential on these models as soon as DPO support becomes available from their providers.

Thus, while choosing a non-fine-tuned frontier model can deliver performance on par with a fine-tuned sub-10B parameter model, as demonstrated across different model families such as Qwen, LLaMA, and Phi, it is advisable to fine-tune a model to achieve better alignment for RAG applications.

We appreciate the opportunity to discuss the contribution of our work and are happy to address any further questions or clarifications the reviewer may have.


[Clarification 2]: Clarifying Tables 2 and 3 and the corresponding comparisons presented

Thank you for your feedback. While we highlighted various findings from our experiments that may include inter-model comparisons, our main aim was to propose a method (Trust-Align) that improves a given model's RAG appropriateness as measured by Trust-Score.

Table 2 and 3 aims to demonstrate the effectiveness of Trust-Align in improving the performance of models across different families, such as LLaMA, Qwen, and Phi. The emphasis is on showcasing the improvements within each model family, rather than identifying a single model that performs best across families. To better communicate these within-family improvements, we have revised how the results are presented by specifically highlighting the best values within each model family and tuned the corresponding discussion to better convey the contributions.


[Clarification 3]: Confidence intervals

Thank you for bringing this up. We intentionally omitted the significance testing results from the main draft to improve table readability. However, if the reviewers feel these results should be included, we would be happy to incorporate them in the next version. For your reference, we provide below the results of an independent sample t-test comparing baseline models with Trust-Aligned models.

Table 2: Results from significance testing on TRUST SCORE

Datasett-statisticp-value
ASQA7.709.85e-10
ELI56.401.18e-06
QAMPARI7.833.31e-08
EXPERTQA8.638.00e-10

Across all datasets, we observed a t-statistic > 0 and p < 0.001 (Table 2), indicating significant improvements for TRUST SCORE at a significance level of 0.01.

评论

Thanks for improving the tables, they are much easier to interpret now.

As you point out, the fact that this fine-tuning can improve the grounding behavior of an 8B model to exceed that of frontier models suggests that there might be significant room for improvement in frontier model behavior as well. Is that the main message of your paper? I for one did not come away with that as the call to action.

审稿意见
8

The paper introduces a metric called "trust-score" for assessing groundedness, alongside an alignment approach aimed at improving this metric. It provides a useful metric for advancing the groundedness of large language models (LLMs) for a variety of applications. I am actually surprised there has been no metrics like this for groundedness.

优点

It provides a useful metric for advancing the groundedness of large language models (LLMs) for a variety of applications.

缺点

Minor Issue: The definition of the Trust Score, particularly the subscript notation, is somewhat unclear (and also in Figure 1). It might help to explore ways to improve readability.

Distinction Between Parametric Knowledge and Groundedness: Could you further discuss the practical implications of distinguishing between parametric knowledge and groundedness? While it adds rigor to make this distinction, to what extent would a user be concerned about whether information comes from parametric knowledge versus grounded sources? In which scenarios might this distinction be more critical?

问题

Choice of Models in Figure 2 (Steps 4–6): Why are different models used in Steps 4 (GPT-4) and Steps 5–6 (LLaMA-2-7B)? Is it because Step 5 requires fine-tuning? However, fine-tuning is also possible with GPT-4 via API, correct? If the decision was driven by the use of DPO for alignment in Step 6, could you clarify the rationale for these model choices?

Method Clarification (Line 271): Could you elaborate on the methodology for selecting documents similar to those containing gold claims but still irrelevant to the query?

Comparative Baseline (Line 359): What results would you observe if compared to a simple instruction-based baseline? For instance, Baseline 1 could involve instructing the model explicitly to avoid over-responding.

Clarification on SFT Model (Line 387): Regarding the SFT model, do you mean fine-tuning with the trust-align dataset, specifically with only positive responses (r+)? Please clarify if so.

评论

[Clarification 4]: Method Clarification (Line 271)

Thanks for pointing this out. This section could indeed benefit from further clarification. The irrelevant documents used in our method are selected for their similarity to the question (cosine similarity > 0.7) but are carefully filtered to ensure they do not contain any gold claims, as determined by the TRUE NLI model. This approach ensures that we have high-quality negative examples.

From an initial pool of 100 documents pre-retrieved using GTR/BM25 (as detailed in the "Collecting D’s" section of the methods), we filter a subset of 50 documents. These documents are chosen based on their cosine similarity to the question while ensuring, via the TRUE NLI model, that they do not support any claims related to the question. From this filtered subset, we sample groups of five documents to create document sets, with questions associated with these sets labeled as unanswerable.


[Clarification 5]: Comparative Baseline (Line 359)

Thank you for bringing this up, we have provided an elaborate discussion on this in Appendix G.3.

Instruction-based refusal (ICL), which combines in-context examples with a refusal instruction, is a key baseline in our study. Results for this baseline are presented in Tables 2 and 3 of the main paper under rows labeled "ICL."

While Tables 2 and 3 include refusal prompt results only (ICL), we conducted a small study on frontier models (GPT-4, GPT-3.5, Claude-3.5) and Trust-Aligned LLaMA to determine how the refusal prompt compares to the default prompt. As shown in Table 12, the refusal prompt (R) outperforms the default prompt (D).

Templates:

  • Default prompt: Write an accurate, engaging, and concise answer for the given question using only the provided search results…
  • Refusal prompt: [Appends to the Default Prompt] If none of the provided documents contains the answer, only respond with 'I apologize, but I couldn't find an answer...’

Table 1: Refusal and default prompting results on LLaMA-2-7B (more models can be found in the paper appendix)

BaselinePromptAR%EMF1_ACEM^{F1}\_{AC}F1_RGF1\_{RG}F1_CGF1\_{CG}bfTRUST\\bf{TRUST}
ICLRefusal0.000.0026.280.008.76
ICLDefault94.3050.3849.5143.6747.85

Taking LLaMA-2-7b as an example, models rarely refuse under the default prompt (AR% close to 100), while adding a refusal prompt in ICL drastically reduces AR%, often to near zero, indicating indiscriminate refusal. At both extremes, Trust-Score scores suffer due to errors in correctly refusing questions and lower citation groundedness scores. In contrast, Trust-Align enables models to identify and correctly answer appropriate questions, resulting in nuanced refusal ability and improvements in F1_RGF1\_{RG}. This highlights that prompting alone is insufficient to improve the model's trustworthiness effectively. Interestingly, refusal prompting appears to yield greater benefits in more capable models, such as LLaMA-2-13b and LLaMA-3-8b.

Although these findings are detailed in the appendix, we recognize their relevance to the reviewer’s question and will incorporate this discussion into the main paper for added clarity.


[Clarification 6]: Clarification on SFT Model (Line 387)

Yes, we first perform Supervised Fine-Tuning (SFT) on the based model with the positive responses (r+) from the Trust-Align dataset. Performing SFT prior to Direct Preference Optimization (DPO) is part of the standard preference optimization pipeline as outlined in Rafailov, R. (2023) and Ziegler, D. M. (2019). Initialization of the reference model with a supervised fine-tuned model reduces the distribution shift between the reference policy and the true reference distribution leading to training stability as noted by several works for instance Tunstall, L. (2019). The absence of the distillation SFT step often results in models failing to learn effectively from feedback, leading to poor performance. Thus, performing SFT prior to DPO is critical to training stability and ultimate model performance.


References:

11 Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. ArXiv, abs/2305.18290

22 Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., … Irving, G. (2019). Fine-Tuning Language Models from Human Preferences. Retrieved November 19, 2024, from arXiv.org website: https://arxiv.org/abs/1909.08593

33 Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., Werra, L.V., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A.M., & Wolf, T. (2023). Zephyr: Direct Distillation of LM Alignment. ArXiv, abs/2310.16944

评论

Clarification 4: thanks! worth adding a little bit details in the manuscript.

Clarification 5: fascinating "adding a refusal prompt in ICL drastically reduces AR%, often to near zero". thanks for making it more prominent.

评论

Dear Reviewer 4fcY,

Thanks so much for the constructive feedback!

--

Authors

评论

Thank you for your thoughtful comments and we appreciate the feedback you have given us. We have addressed your points below and are happy to assist in any further clarifications.

[Clarification 1]: Clarifying Trust-Score notation

Thank you for the suggestion. We will update the figure and corresponding part of our article with a clearer notation system for improved readability. Since this will require changes at various places in the draft, we will incorporate them in the next version after the discussion phase ends to minimize notation conflicts during the rebuttal.


[Clarification 2]: Distinction Between Parametric Knowledge and Groundedness

This is an interesting question, thank you for asking! There are numerous real-world scenarios where it is critical to distinguish between parametric knowledge and external knowledge (groundedness). Parametric knowledge refers to knowledge derived from the model’s training data and stored in the model’s weights. In contrast, external knowledge pertains to information sourced from external documents or real-time data. This distinction becomes particularly important when LLMs are used as information consolidators (primarily using augmented knowledge) rather than as information generators (relying solely on parametric knowledge). The need for groundedness is especially pressing in high-stakes applications such as legal tasks (maintaining accurate records), healthcare (summarizing patient-doctor conversations), and finance (extracting compliance details or assessing fund performance), where outputs must faithfully align with the provided documents.

Another case where such distinction is important is where the answer depends on changing real-world facts. Parametric knowledge is static after training and cannot easily reflect updates in real-world information. This limitation is significant for AI-based search engines like Perplexity and You.com, where the primary role of the LLM is to act as an information consolidator, synthesizing grounded, up-to-date web search results, rather than generating responses based on potentially outdated parametric knowledge. Groundedness enables LLMs to adaptively provide contextually relevant and current information, making them more reliable for applications requiring real-time awareness.

We appreciate the opportunity to discuss the importance of distinguishing parametric knowledge from external knowledge (groundedness) and are happy to address any further questions or clarifications the reviewer may have.


[Clarification 3]: Choice of Models in Figure 2 (Steps 4–6)

Trust-Align constructs a preference dataset consisting of positive (preferred) responses in Step 4 and negative (unpreferred) responses in Steps 5-6. The choice of GPT-4 for Step 4 and LLaMA-2-7B for the later steps is primarily driven by the quality and quantity of responses we aimed to generate.

For positive samples, our goal was to produce responses in the format: 'statement1 11

22 statement2 33 .' Given the state-of-the-art instruction-following capabilities of GPT-4, it was the natural choice over open-source models (e.g., LLaMA-2-7B) for generating responses that effectively stitch together gold claims with their corresponding attributions. Notably, a range of frontier models can perform this task; however, GPT-4's greater accessibility to us was one reason we chose it over other APIs, such as Claude-3.5.

The procedure for generating negative responses is different. We focused on two aspects when choosing a model: 1) the naturalness or quality of the response, and 2) the quantity or a sufficient number of responses to perform preference optimization. We found that LLaMA-2-7B tends to generate more diverse hallucinations, which allowed us to retrieve more negative samples and consequently have a greater number of (positive, negative) tuples for preference alignment compared to GPT-4.

While simple ICL was initially explored for generating negative responses, it was found to produce outputs of poor quality. To address this limitation, fine-tuning was carried out to align LLaMA-2-7B to generate responses in the format: 'statement1 11

22 statement2 33 ,' without requiring explicit ICL. This fine-tuning step significantly improved the quality of negative responses, improving coherence and clarity. Thank you for asking this question and we hope that our answer clarifies the points mentioned.

评论

Clarification 2 is excellent and is prob worth adding some of the response to the paper to strengthen the contribution.

AC 元评审

This paper introduces a method (Trust-Align) for enhancing a model’s response across three dimensions: trustworthiness, citation groundedness, and refusal groundedness, measured by their groundedness metric, Trust-Score. The authors evaluate the proposed metric and alignment method across numerous models, datasets and configurations. The experiments are well documented and clearly articulated providing extensive support for the findings.

审稿人讨论附加意见

Lack of clarity: particularly Table 2 & 3 were difficult to read and interpret. The authors have subsequently revised and improved these with satisfactory clarity.

Distinction between parametric knowledge and groundedness: this was addressed satisfactorily.

Trust-Score notation, confidence intervals, and statistical testing: in response to these, the authors have included further clarification and necessary additional information.

Overall a good contribution.

最终决定

Accept (Oral)