c:["$","div",null,{"className":"container py-8 max-w-6xl mx-auto","children":["$","$e",null,{"fallback":null,"children":["$","$L16",null,{"paper":{"id":"Z8Mfy0iK4n","title":"Entropy Reveals What You Know: An Entropy-Guided Method for Enhancing the Reliability of Large Language Models","abstract":"$17","keywords":["Reliability","Large language Model"],"primary_area":"generative models","venue":"ICLR 2025 Conference Withdrawn Submission","conference":"ICLR","year":2025,"status":"withdrawn","is_accepted":false,"avg_rating":3.66667,"avg_rating_normalized":3.66667,"rating_min":3,"rating_max":5,"rating_std":0.942809,"review_count":3,"comment_count":4,"creation_date":"2024-09-27","modification_date":"2024-11-21","forum_link":"https://openreview.net/forum?id=Z8Mfy0iK4n","pdf_link":"https://openreview.net/pdf?id=Z8Mfy0iK4n","arxiv_id":null,"arxiv_url":null,"arxiv_match_method":null,"arxiv_matched_at":null,"tldr":"","created_at":"2026-01-21T12:25:02.24792+00:00","updated_at":"2026-04-22T07:42:04.168561+00:00","authors":[{"id":"~XiaoQi_Han1","name":"XiaoQi Han","openreview_id":"~XiaoQi_Han1","position":0},{"id":"~Ru_Li2","name":"Ru Li","openreview_id":"~Ru_Li2","position":1},{"id":"~Zhichao_Yan2","name":"Zhichao Yan","openreview_id":"~Zhichao_Yan2","position":2},{"id":"~Jeff_Z._Pan1","name":"Jeff Z. Pan","openreview_id":"~Jeff_Z._Pan1","position":3}]},"stats":{"ratings":[{"id":"YNXdl4ZkMw","value":5,"confidence":4},{"id":"y4x4KILc7U","value":3,"confidence":3},{"id":"3RK0Qr3wG7","value":3,"confidence":4}],"avg_rating":3.6666666666666665,"rating_min":3,"rating_max":5,"rating_std":1.1547005383792515,"detailed_scores":{"soundness":[3,2,2],"contribution":[3,2,1],"presentation":[3,1,1],"originality":[],"quality":[],"clarity":[],"significance":[]}},"commentTree":[{"id":"YNXdl4ZkMw","paper_id":"Z8Mfy0iK4n","replyto":"Z8Mfy0iK4n","number":1,"type":"Official_Review","role":"reviewer","rating":5,"confidence":4,"soundness":3,"contribution":3,"presentation":3,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":5,"summary":"In this paper, the authors propose a new method called **SREF** to enhance the reliability of LLMs and reduce hallucinations. The method involves rephrasing questions to introduce diverse responses based on the same knowledge, followed by concatenating these QA pairs as references for the final answer. Extensive experiments are conducted on several models and datasets, demonstrating that SREF achieves the best mean performance. The authors also correlate entropy with consistency and KL divergence, showing significant relationships between these metrics.","questions":"Typo: \"With the influence of R, y will be 'Unsure' when R is inconsistent with x, see Eq.7 for details,\" \"inconsistency\" should be replaced with \"inconsistent.\"","soundness":3,"strengths":"- This paper is the first to correlate using LLMs as references with entropy, thereby establishing a connection with consistency and KL divergence. The experimental results demonstrate a positive correlation between the decreased entropy by using LLMs as references and increased consistency and KL divergence.\n\n- Comprehensive experiments are conducted from the perspectives of factuality and consistency across three datasets and twelve LLMs, providing strong evidence for the effectiveness of SREF.\n\n- The paper is well-written and easy to follow.","confidence":4,"weaknesses":"$18","contribution":3,"presentation":3,"code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2024-10-19T00:00:00+00:00","modified_at":"2024-11-13T00:00:00+00:00","replies":[],"contentHtml":{"summary":"

In this paper, the authors propose a new method called SREF to enhance the reliability of LLMs and reduce hallucinations. The method involves rephrasing questions to introduce diverse responses based on the same knowledge, followed by concatenating these QA pairs as references for the final answer. Extensive experiments are conducted on several models and datasets, demonstrating that SREF achieves the best mean performance. The authors also correlate entropy with consistency and KL divergence, showing significant relationships between these metrics.

","questions":"

Typo: \"With the influence of R, y will be 'Unsure' when R is inconsistent with x, see Eq.7 for details,\" \"inconsistency\" should be replaced with \"inconsistent.\"

","strengths":"

\n
This paper is the first to correlate using LLMs as references with entropy, thereby establishing a connection with consistency and KL divergence. The experimental results demonstrate a positive correlation between the decreased entropy by using LLMs as references and increased consistency and KL divergence.
\n
\n
Comprehensive experiments are conducted from the perspectives of factuality and consistency across three datasets and twelve LLMs, providing strong evidence for the effectiveness of SREF.
\n
\n
The paper is well-written and easy to follow.
\n

","weaknesses":"$19","code_of_conduct":"

Yes

"}},{"id":"y4x4KILc7U","paper_id":"Z8Mfy0iK4n","replyto":"Z8Mfy0iK4n","number":2,"type":"Official_Review","role":"reviewer","rating":3,"confidence":3,"soundness":2,"contribution":2,"presentation":1,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":3,"summary":"This paper reveal that large language models (LLMs) carry vast amounts of knowledge, but they face issues with factual consistency and truthfulness, leading to unreliable responses and application risks. To improve model reliability, the authors introduce SREF, an entropy-based self-reference approach, which enhances reliability by increasing consistency in answers to known facts and encouraging refusal to answer uncertain questions.","questions":"Please refer to the weaknesses section.\n\n1. Why does SREF fail on strong models, especially considering that strong models are recognized to have a broader self-knowledge that could enhance the generation of self-references? Please ask the authors to explain.","soundness":2,"strengths":"1. The method is simple and straightforward, utilizing question rephrasing sampling to generate relevant knowledge and perform reasoning to optimize the consistency of traditional self-correction. This approach is applicable to both open-source and closed-source models.\n2. From the perspective of entropy, the method offers relevant theoretical support for their approach.\n3. The experimental design is relatively comprehensive, validating the effectiveness of the SREF. The discussion is also interesting.","confidence":3,"weaknesses":"$1a","contribution":2,"presentation":1,"code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2024-10-28T00:00:00+00:00","modified_at":"2024-11-13T00:00:00+00:00","replies":[],"contentHtml":{"summary":"

This paper reveal that large language models (LLMs) carry vast amounts of knowledge, but they face issues with factual consistency and truthfulness, leading to unreliable responses and application risks. To improve model reliability, the authors introduce SREF, an entropy-based self-reference approach, which enhances reliability by increasing consistency in answers to known facts and encouraging refusal to answer uncertain questions.

","questions":"

Please refer to the weaknesses section.

Why does SREF fail on strong models, especially considering that strong models are recognized to have a broader self-knowledge that could enhance the generation of self-references? Please ask the authors to explain.

","strengths":"

The method is simple and straightforward, utilizing question rephrasing sampling to generate relevant knowledge and perform reasoning to optimize the consistency of traditional self-correction. This approach is applicable to both open-source and closed-source models.
From the perspective of entropy, the method offers relevant theoretical support for their approach.
The experimental design is relatively comprehensive, validating the effectiveness of the SREF. The discussion is also interesting.

","weaknesses":"$1b","code_of_conduct":"

Yes

"}},{"id":"3RK0Qr3wG7","paper_id":"Z8Mfy0iK4n","replyto":"Z8Mfy0iK4n","number":3,"type":"Official_Review","role":"reviewer","rating":3,"confidence":4,"soundness":2,"contribution":1,"presentation":1,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":3,"summary":"The authors introduce an entropy-guided framework that leverages models' understanding of rephrased questions as relevant references, to improve answer consistency for known facts and encourage refusal of uncertain questions. Experiments on several LLMs show that SREF yields more reliable results for task performance and consistency.","questions":"1. Why rely on distractors and a multiple-choice setup to measure consistency? A more direct approach would be to directly measure the consistency of multiple generated answers. \n\n2. Could you explain in detail how Equation 7 determines whether the model will be ‘unsure’ about a question?\n\n3. The question in Figure 2 is itself ambiguous, requiring more than just factual knowledge. Could you provide examples where the question is unambiguous and SREF still shows improved performance?","soundness":2,"strengths":"The experiments cover 12 LLMs across 3 datasets, which is a good range.","confidence":4,"weaknesses":"The proposed method lacks novelty. See related works such as 1) https://aclanthology.org/2023.findings-emnlp.1032/ 2) https://arxiv.org/pdf/2406.02543. 3) https://arxiv.org/pdf/2402.00367. Additionally, the motivation, especially in the introduction, isn’t clearly articulated. It’s unclear how limitations in intrinsic self-correction led to the decision to leverage the model's internal knowledge to create relevant references.\n\n\nThe experiments are inadequate. First, missing related convincing baselines: 1) https://aclanthology.org/2023.findings-emnlp.1032/\n2) https://arxiv.org/pdf/2210.01296. Also, the evaluation of uncertainty is insufficient; a higher frequency of 'unsure' outputs does not necessarily indicate well-assessed uncertainty. A better evaluation would be to check if the model’s accuracy is higher when it doesn’t output ‘unsure.’","contribution":1,"presentation":1,"code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2024-10-30T00:00:00+00:00","modified_at":"2024-11-13T00:00:00+00:00","replies":[],"contentHtml":{"summary":"

The authors introduce an entropy-guided framework that leverages models' understanding of rephrased questions as relevant references, to improve answer consistency for known facts and encourage refusal of uncertain questions. Experiments on several LLMs show that SREF yields more reliable results for task performance and consistency.

","questions":"

\n
Why rely on distractors and a multiple-choice setup to measure consistency? A more direct approach would be to directly measure the consistency of multiple generated answers.
\n
\n
Could you explain in detail how Equation 7 determines whether the model will be ‘unsure’ about a question?
\n
\n
The question in Figure 2 is itself ambiguous, requiring more than just factual knowledge. Could you provide examples where the question is unambiguous and SREF still shows improved performance?
\n

","strengths":"

The experiments cover 12 LLMs across 3 datasets, which is a good range.

","weaknesses":"$1c","code_of_conduct":"

Yes

"}},{"id":"dYeB9mCIfJ","paper_id":"Z8Mfy0iK4n","replyto":"Z8Mfy0iK4n","number":1,"type":"Withdrawal","role":"author","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"withdrawal_confirmation":"I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors."},"created_at":"2024-11-21T00:00:00+00:00","modified_at":"2024-11-21T00:00:00+00:00","replies":[],"contentHtml":{"withdrawal_confirmation":"

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.

"}}],"submissionHistory":[]}]}]}]