PaperHub
5.0
/10
Poster4 位审稿人
最低4最高7标准差1.2
5
4
7
4
3.8
置信度
正确性2.3
贡献度2.5
表达2.8
NeurIPS 2024

SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models

OpenReviewPDF
提交: 2024-05-11更新: 2025-01-25
TL;DR

Self Logits Evolution Decoding (SLED) improves LLM factuality through better decoding.

摘要

关键词
Large Language Models

评审与讨论

审稿意见
5

This work proposes self-evolution decoding, a method to improve LM's factuality without using external knowledge or fine-tuning data. Specifically, the differences between each layer’s logits and the final layer’s logits are utilized to approximate the gradient, which are further used to estimate the inner knowledge of LMs. Finally, the estimated "inner" distribution is utilized to adjust the model's final outputs. Experiments on various tasks show the effectiveness of the proposed method.

优点

  • The paper is generally easy to understand.
  • The proposed method is shown to provide good results on a variety of tasks.

缺点

  • I think some of the approximations might be questinable. First, in Section 2.2, I'm not sure whether it is reasonable to use the logit differences to estimate the gradient: logits are unconstrained, while the scale of the gradients are constrained within 1. Moreover, in Section 2.3, I'm not sure why the estimations are broken-down for each item in the vocabulary and why the weights for different layers are directly aggregated and normalized with cosine similarties. More ablation studies should be provided to verify some of these choices.

问题

  • I'm wondering if there can be good methods to evaluate LM's factuality in more realistic open-ended generation tasks, which are the main scenarios where we hope to improve LM's factuality and consistency.

局限性

Limitations have been discussed.

作者回复

Dear Reviewer

Thank you so much for taking the time to provide your feedback. Your comments and suggestions are invaluable to us. We appreciate the opportunity to address your concerns regarding the approximation used in our approach, and we are also including additional results to support our methodology.

"Q: logits are unconstrained, while the scale of the gradients are constrained within 1"

Thank you for your insightful comment regarding the unconstrained nature of the logits L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)} and the constrained nature of the gradients (p1t1,p2t2,...,piti,...,pdtd)(p_1 - t_1, p_2 - t_2, ..., p_i - t_i, ..., p_d - t_d). We appreciate the opportunity to address your concern.

In fact, this is an important consideration in our methodology. Our approach does not rely on the magnitudes of L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)}; rather, it utilizes this difference to approximate the direction of the gradient vector. Specifically, we do not employ the expression: L(n)L(N)(p1t1,p2t2,...,piti,...,pdtd)\mathcal{L}^{(n)} - \mathcal{L}^{(N)} \approx (p_1 - t_1, p_2 - t_2, ..., p_i - t_i, ..., p_d - t_d) to directly estimate (t1,t2,...,ti,...,td)(t_1, t_2, ..., t_i, ..., t_d).

Instead, our method involves increasing the cosine similarity between L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)} and the vector (p1t1,p2t2,...,piti,...,pdtd)(p_1 - t_1, p_2 - t_2, ..., p_i - t_i, ..., p_d - t_d): CosineSimilarity[L(n)L(N),(p1t1,p2t2,...,piti,...,pdtd)]CosineSimilarity[\mathcal{L}^{(n)} - \mathcal{L}^{(N)}, (p_1 - t_1, p_2 - t_2, ..., p_i - t_i, ..., p_d - t_d)] to estimate (t1,t2,...,ti,...,td)(t_1, t_2, ..., t_i, ..., t_d).

This approach aligns the directions of the gradients rather than their magnitudes, thereby addressing the issue of their different scales.

"Q: why the estimations are broken-down for each item in the vocabulary"

Thank you for your inquiry. We are happy to clarify your concerns. Given that the vocabulary of a typical large language model (LLM) often exceeds 10,000 items (d > 10k), the (t1,t2,...,ti,...,td)(t_1, t_2, ..., t_i, ..., t_d) we need to estimate constitute a high-dimensional vector. Estimating this vector in its entirety for each vocabulary item simultaneously would involve significant computational overhead.

By breaking down the estimation process, we can focus only on the possible words in the vocabulary (that have the top-k highest probability in the original output) and then estimate their corresponding tit_i one by one. This selective approach reduces computational costs, as well as avoids the noise introduced by less significant words.

Therefore, this breakdown is strategically designed to minimize computational load. For the sake of conciseness of the presentation, we move the discussion of computational considerations to Section 2.4 and still use d instead of k here.

”Q: why aggregated and normalized”

Normalization is applied individually to each layer nn as described in line 115. This is necessary because the vector components: (tˉ1(n),tˉ2(n),...,tˉi(n),...,tˉd(n))(\bar{t}^{(n)}_1, \bar{t}^{(n)}_2, ..., \bar{t}^{(n)}_i, ..., \bar{t}^{(n)}_d) do not inherently sum to one. Therefore, we should normalize these components to achieve an estimation, tˉ(n)\bar{t}^{(n)}, of the inner distribution for each layer n.

Aggregations are conducted across the {tˉ(n)}\{\bar{t}^{(n)}\} for all the layers.

Contrary to direct aggregation, after obtaining a normalized tˉ(n)\bar{t}^{(n)} for each layer, the aggregation process is not straightforward. The weights w(n)=id(tˉi(n))w^{(n)} = \sum_i^d (\bar{t}^{(n)}_i) in line 116 indicate the degree to whichthe logits L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)} and the normalized vectors (p1tˉ1(n),p2tˉ2(n),...,pitˉi(n),...,pdtˉd(n))(p_1 - \bar{t}^{(n)}_1, p_2 - \bar{t}^{(n)}_2, ..., p_i - \bar{t}^{(n)}_i, ..., p_d - \bar{t}^{(n)}_d) are well-aligned in terms of cosine similarity. A larger weight suggests that the estimations from tˉ(n)\bar{t}^{(n)} are more reliable, and thus, we assign greater weight in the aggregation process to such layers. Conversely, layers with smaller weights are assigned lesser importance in the aggregation.

We include a new ablation study to support our proposed methods following your suggestion. In one such study, we deviated from our established processes by crudely scaling the L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)}, and simply averaging these scaled differences across different layers, a method we denote as "ablation1." It bypasses the steps of breakdown, normalization, and aggregation that we have discussed above. The results show that our method achieves better result.

FactorTrufulQA(MC1)TrufulQA(MC2)TrufulQA(MC3)
llama2-7B-chat + ablation162.7333.6639.8331.47
+ SED65.1637.0863.8632.90
llama2-13B-chat + ablation166.2937.3345.031.98
+ SED67.0637.0963.7532.60

"Q: more realistic open-ended generation tasks"

Thank you for this valuable suggestion. We have conducted additional experiments on more realistic open-ended generations datasets, HotPotQA, Natural Question (NQ), TriviaQA (Tri). We adopt more evaluation metric, Exact Match(EM) and the F1.

ModelHotpotQA EMHotpotQA F1NQ EMNQ F1Trivia EMTrivia F1
Llama 2 7B chat19.620.121.820.444.444.3
+ DoLa20.421.323.521.545.245.3
+ Sed(ours)20.921.524.422.247.646.3
Llama 2 13B chat23.821.733.128.963.060.9
+ DoLa24.523.233.128.963.261.5
+ Sed(ours)25.024.534.631.663.362.2

). The results show that our method improves the performance in more realistic open-ended generations tasks.

Sincerely,

Authors

评论

Dear Area Chair,

Thank you so much for your time and efforts in facilitating the review and discussion process. We really appreciate it.

Dear Reviewer,

We sincerely thank you for reviewing and discussing our paper. Your suggestions have been incredibly helpful in enhancing our work. In our rebuttal, we have provided detailed explanations addressing your concerns, including more details on methodology, additional ablation studies, and further results on more realistic open-ended generation scenarios. We will definitely incorporate all the discussions and new results in our revision. Thank you so much for your valuable suggestions! Should you have any questions, whether about the methodology or if you need further explanations or additional results, do not hesitate to raise them. We are committed to resolving any issues to your satisfaction. We understand that you are very busy, so we deeply appreciate your time and effort. Thank you so much!

(We have noticed that sometimes the formulas in our rebuttal may not display correctly on OpenReview. A refresh of the browser may resolve them. However, if you continue to face difficulties or need more detailed explanations of these formulas, please let us know. We are prepared to provide all necessary support to make our research clear and understandable.)

Sincerely,

Authors

评论

I thank the authors for the responses and the extra results. Some of my concerns have been resolved and I have adjusted my scores correspondingly. Nevertheless, I still keep the borderline decision, especially after reading other reviewers’ comments, and I think it seems that many concerns still remain.

评论

Hello Reviewer 1uAz,

Please take a moment to read and acknowledge the author's response to your review.

Thanks, Area Chair

评论

Dear Reviewer,

Firstly, we sincerely thank you for acknowledging our responses and the additional results we provided. Your adjusted score and your valuable suggestions are greatly appreciated and motivate us to improve our manuscript further. Thank you so much!

Respectfully, we wish to further alleviate your concerns by reiterating that we have addressed all concerns raised by the other reviewers. Unfortunately, we have not yet received feedback from some reviewers on our rebuttal, but this absence does not imply that we have not addressed the concerns. For instance:

  1. As highlighted by Reviewer 12rF, we have successfully addressed the concerns regarding additional computational costs and further theoretical analysis. Considering that Reviewer Mgbd also raised similar issues, we have resolved Reviewer Mgbd's main concerns as well.

  2. Although error bars and statistical significance were not discussed in the most relevant literature [1,2,3,4], following Reviewer Mgbd’s advice, we still provided those discussions to show our method’s superiority.

  3. We have included more detailed explanations of our methodology to clarify aspects that were previously questioned.

Respectfully, we disagree with Reviewer Xg4n’s perspective. While we understand the concerns, it is standard practice for most papers to include detailed explanations and new results during the rebuttal phase. Considering that we have managed to complete the primary rebuttal within the 6000-character limit set by NeurIPS, we believe our revisions are not so significant as to warrant the rejection of our paper. This is consistent with common academic standards and does not deviate from what is typically expected in similar submissions, as evidenced by past proceedings of the conference.

We understand that you are very busy, and we deeply appreciate your time and effort. Thank you so much! Respectfully, we hope not to leave you with the impression that we have not addressed other reviewers' concerns, considering that we have indeed provided detailed explanations and additional results, and received positive feedback from Reviewer 12rF.

Thank you once again for your valuable suggestions and guidance! We really appreciate it!

Sincerely,

Authors

[1] Yung-Sung Chuang et al., "Dola: Decoding by contrasting layers improves factuality in large language models," 2024.

[2] Kenneth Li et al., "Inference-time intervention: Eliciting truthful answers from a language model," 2023.

[3] Shiqi Chen et al., "In-context sharpness as alerts: An inner representation perspective for hallucination mitigation," 2024.

[4] Yue Zhang et al., "Alleviating hallucinations of large language models through induced hallucinations," 2023.

审稿意见
4

In this work, the authors present a decoding method called Self-Evolution Decoding (SED) to enhance factuality. During the decoding process, SED first estimates the “inner knowledge distribution,” representing the knowledge the model “knows,” by analyzing the difference between the top-layer logits and intermediate layers. This estimated inner knowledge distribution is then used to adjust the LLM output logits, steering the model’s behavior towards greater factuality. Experiments across several datasets and LLMs demonstrate that SED significantly improves the LLM’s factuality.

优点

Originality: The approach of narrowing the gap between model generation and internal knowledge is innovative.

Significance: Enhancing the factuality of LLMs is a crucial research problem, and the positive results demonstrate the effectiveness of the proposed method.

缺点

  1. The motivation for this method is somewhat problematic. In section 2.2, the authors claim that the initial layers predominantly encode “lower-level” information while the later layers capture more “semantic” information. However, this does not imply that their difference is a good approximation of the gradient of KL divergence. If you believe this approximation is accurate, why not directly apply this approximated gradient to Equation 3?

  2. The experiments primarily focus on task performance but lack an investigation into how this method works. For example, in lines 145-147, the authors claim that the approximated P_inner cannot be directly used as it is not perfect. I believe presenting experimental results would be more convincing than a verbal analysis.

  3. The paper is not well-structured, and the concepts are confusing. Please refer to the questions below for more details.

问题

  1. The notion of logits \mathcal{L} is confusing, which is normally used to represent loss function.

  2. Notation is not consistent. In line 69, P refers to the probability distribution, where in line 71 P_L denotes the logits distribution. Is this a typo?

  3. Equation 2 is confusing, I initially thought it represented L^n - L^N introduced in Line 90, but there is no description of this equation. I spent quite a while understanding that it is just a derivation of the gradient of the KL function and is not related to line 90.

  4. In Line 107, "In this context" and "In this formulation" duplicate.

局限性

The authors acknowledge that the approximated P_inner is not perfect, but they do not provide sufficient experimental results to support this claim. Including more detailed experiments that explore the accuracy and limitations of this approximation would strengthen their argument.

作者回复

Dear Reviewer,

Thank you so much for taking the time to provide your feedback. Your comments and suggestions are invaluable to us, especially regarding our methodologies and presentations. We appreciate this opportunity to address your concerns.

"However, this does not imply that their difference is a good approximation of the gradient of KL divergence. If you believe this approximation is accurate, why not directly apply this approximated gradient to Equation 3?"

First Issue: Suitability of the Approximation

We are grateful for your insightful comments and are eager to further explain this aspect. Regarding why we interpret the difference L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)} as a suitable approximation for the gradient of the KL divergence, the core reason lies in our claim that the KL divergence from the real distribution to PL(N)\mathcal{P}_{\mathcal{L}}^{(N)} is smaller than that to PL(n)\mathcal{P}_{\mathcal{L}}^{(n)}. Thus, L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)} serves as an approximate gradient.

We can verify this claim directly because:

KL(PrealPL)=CE(Preal,PL)+H(Preal)KL(P_{real} \parallel P_{L}) = CE(P_{real}, P_{L}) + H(P_{real})

Here, CE represents cross-entropy and H denotes entropy. By comparing the cross-entropy across different layers, as illustrated in Figure 3 (which was unfortunately not referenced in our original manuscript), we find that the final layer exhibits a smaller loss value compared to earlier layers. This observation makes sense because, as discussed, the final layer directly engages with real-world labels through cross-entropy during training, making it more accurate. Additionally, as you mentioned, the final layer contains more “semantic” information, making it closer to the real-world labels.

Thus, L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)} is a good approximation of the gradient of KL(PrealPL)KL(P_{real} \parallel P_{L}). When estimating the gradient of KL(PinnerPL)KL(P_{inner} \parallel P_{L}), as per Equation 3, this gradient should be close to the KL(PrealPL)KL(P_{real} \parallel P_{L}) to benefit the decoding. Hence, we utilize L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)} as the source for appromximation.

Second Issue: Application to Equation 3

Regarding why we do not directly apply L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)} in Equation 3, it is important to consider that while L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)} is unconstrained, the gradients estimated in Equation 2 (e.g., p1t1,p2t2,...,piti,...,pdtd)p_1 - t_1, p_2 - t_2, ..., p_i - t_i, ..., p_d - t_d) are constrained within 1. Thus, direct substitution could lead to a mismatch in magnitudes. Proper normalization and subsequent aggregation of estimations from different layers are precisely what our method addresses in Section 2.3. Our approach does not naively scale to normalize; it provides a more interpretable and computationally efficient method, aligning the directions of the gradients rather than their magnitudes to address their different scales.

To further address your concerns, we include a new ablation study by directly scaling the L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)}, and simply averaging these scaled differences across different layers, a method we denote as "ablation1" in the following table

"The authors claim that the approximated P_inner cannot be directly used as it is not perfect. I believe presenting experimental results would be more convincing than a verbal analysis.

Thank you so much for your suggestion. We include the corresponding ablation study following your suggestion and denote it as "ablation2". In this study, we directly use PinnerP_{inner}

FactorTrufulQA(MC1)TrufulQA(MC2)TrufulQA(MC3)
llama2-7B-chat + ablation162.7333.6639.8331.47
+ ablation263.5925.2151.0926.25
+ SED65.1637.0863.8632.90
llama2-13B-chat + ablation166.2937.3345.031.98
+ ablation266.7027.0552.7228.46
+ SED67.0637.0963.7532.60

"The notion of logits L\mathcal{L} "

We appreciate this suggestion. To avoid confusion, we will replace it in our revision.

"In line 69, Is this a typo?"

We apologize for this inconsistency. Yes, it was a typo. In our revision, we will clarify that PLP_L in line 71 indeed refers to the probability distribution derived from the logits, maintaining consistency throughout the document.

"Equation 2 is confusing"

We apologize for any confusion caused by Equation 2. We have removed the subscript LL from PL\mathcal{P}_{\mathcal{L}} to clarify that it represents a general probability distribution, not specifically linked to 'logits' as previously implied. Additional explanations will be added to elucidate that this equation is a derivative of the gradient of the KL divergence function and is unrelated to the discussions around line 90.

"In Line 107, 'In this context' and 'In this formulation' duplicate."

Thank you for pointing out the redundancy. We will revise this part to enhance clarity and avoid duplication.

Lastly, we would like to express our heartfelt gratitude for the time and effort you have dedicated to reviewing our paper. We deeply appreciate your guidance and assure that all new results and findings will be included during our revision. Thank you!

Sincerely,

Authors

评论

Dear Reviewer,

Thank you so much for your time and engagement in our discussion. We noticed an issue with the display of mathematical symbols in the "First Issue: Suitability of the Approximation" section of our rebuttal. Therefore, we would like to re-explain with the correct display to ensure it is easy to read.

"However, this does not imply that their difference is a good approximation of the gradient of KL divergence. If you believe this approximation is accurate, why not directly apply it to Equation 3?"

First Issue: Suitability of the Approximation

Regarding why we interpret the difference L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)} as a suitable approximation for the gradient of the KL divergence, the core reason lies in our claim that the “KL divergence from the real distribution to early-layers' logits distribution” is smaller than the “KL divergence from the real distribution to early-layers' logits distribution”, which means KL(PrealPL(N))<KL(PrealPL(n)).KL(P_{real} \parallel P_{\mathcal{L}^{(N)}}) < KL(P_{real} \parallel P_{\mathcal{L}^{(n)}}). Thus, based on our experiences in gradient decent algorithm, we adopt L(n)L(N)\mathcal{L}^{(n)} - \mathcal{L}^{(N)} to serve as an approximation for the gradient direction.

We think the above claim makes sense because we notice that the cross-entropy satisfies CE(Preal,PL(N))<CE(Preal,PL(n)).CE(P_{real},P_{\mathcal{L}^{(N)}}) < CE(P_{real}, P_{\mathcal{L}^{(n)}}). Then based on the following relationship between the KL divergence and the cross-entropy:

KL(P_{real} \parallel P_{\mathcal{L}}) = CE(P_{real}, P_{\mathcal{L}}) + H(P_{real}) $$ (H denotes entropy) , we can derive $ KL(P_{real} \parallel P_{\mathcal{L}^{(N)}}) < KL(P_{real} \parallel P_{\mathcal{L}^{(n)}}) $. As for the reason why $CE(P_{real},P_{\mathcal{L}^{(N)}}) < CE(P_{real},P_{\mathcal{L}^{(n)}})$, first we verify this by empirically comparing the cross-entropy across different layers. As illustrated in Figure 3 (we will add the missing references in our revised manuscript), we find that the final layer exhibits a smaller CE loss value compared to earlier layers. This observation makes sense because, as discussed, the final layer **directly** engages with real-world labels through cross-entropy during training, making it more accurate. Additionally, as you mentioned, the final layer contains more “semantic” information, making it closer to the real-world labels. Based on the above discussion, we think $\mathcal{L}^{(n)} - \mathcal{L}^{(N)}$ is a good approximation of the gradient of $KL(P_{real} \parallel P_{\mathcal{L}^{(N)}})$. When estimating the gradient of $KL(P_{inner} \parallel P_{\mathcal{L}^{(N)}}) $ in Equation 3, this gradient should be close to the gradient of the $KL(P_{real} \parallel P_{\mathcal{L}^{(N)}}) $ to benefit the decoding because the inner knowledge $P_{inner}$ should be close to real-world knowledge to avoid making errors. Hence, we utilize $\mathcal{L}^{(n)} - \mathcal{L}^{(N)}$ as the source for approximation. **Second Issue: Application to Equation 3** Regarding why we do not directly apply $\mathcal{L}^{(n)} - \mathcal{L}^{(N)}$ in Equation 3, it is important to consider that while $\mathcal{L}^{(n)} - \mathcal{L}^{(N)}$ is unconstrained, the gradients of KL divergence (e.g., $p_1 - t_1, p_2 - t_2, ..., p_i - t_i, ..., p_d - t_d$ in Equation 2 ) are constrained within 1. Thus, direct substitution could lead to a mismatch in magnitudes. Proper normalization and subsequent aggregation of estimations from different layers are exactly what our method addresses in Section 2.3. Our approach does not naively scale to normalize; it provides a more interpretable and computationally efficient method, aligning the directions of the gradients rather than their magnitudes to address their different scales. To further address your concerns, we include a new ablation study by directly scaling the $\mathcal{L}^{(n)} - \mathcal{L}^{(N)}$, and simply averaging these scaled differences across different layers, a method we denote as "ablation1" in the following table. The result shows the direct application of $\mathcal{L}^{(n)} - \mathcal{L}^{(N)}$ in Eq 3 is not as effective as our method. | | Factor | TrufulQA(MC1) | TrufulQA(MC2) | TrufulQA(MC3) | |-----|-----|------|------|-----| | llama2-7B-chat + ablation1 | 62.73 | 33.66 | 39.83 | 31.47| | + SED | **65.16** | **37.08** | **63.86** | **32.90**| | llama2-13B-chat + ablation1 | 66.29 | **37.33** | 45.0 | 31.98| | + SED | **67.06** | 37.09 | **63.75** | **32.60** | **Lastly, we sincerely thank you for reviewing and discussing our paper. Your valuable suggestions have greatly enhanced our methodology presentation, particularly the ablation studies, which make our paper more comprehensive. We deeply appreciate your guidance and will incorporate all the above discussion during our revision. Should you have any further comments or questions, feel free to raise them. We are committed to addressing any concerns.** Sincerely, Authors
评论

Thank you for the detailed response, which addresses my concerns. As a result, I have slightly increased my assessment. However, I still believe this paper requires significant editing to include the necessary discussions and further polishing.

评论

Dear Reviewer,

Firstly, we are very grateful for your time and timely response. We are pleased to have addressed most of your concerns. Respectfully, we wish we could further address your concerns regarding the extent of modifications required.

  1. Methodology: We will integrate key formulas and discussions seamlessly into the current content, ensuring that these crucial elements are highlighted effectively.

  2. New Results and Ablation Studies: We will prioritize the enhancements you suggested by emphasizing these new results and moving less critical details to the appendix.

  3. Feedback from Other Reviewers: We will ensure that the most important results are included in the main text. If space limitations necessitate placing some results in the appendix, we will ensure they are clearly cited in the main text, providing explicit references to their exact locations in the appendix.

Respectfully, considering that we can complete the primary rebuttal within the 6000-character limit set by NeurIPS and that most papers require presenting more detailed discussions and results in their rebuttals, we hope to assure you that the extent of the modifications will be manageable and not as significant as perceived.

Unfortunately, we are unable to update the file with the latest edits during the rebuttal and discussion period. Should you have any further concerns, we can attempt to show the edited sections, especially the methodology part, directly in the "Official Comments" here to further resolve your concerns. We really appreciate your understanding.

Thank you once again for your time and efforts. We truly appreciate it and look forward to resolving your concerns further.

Sincerely,

Authors

评论

Dear Area Chair,

Thank you so much for your time and efforts in facilitating the review and discussion process. We really appreciate it.

Dear Reviewer,

We sincerely thank you for reviewing and discussing our paper. Your suggestions have been incredibly constructive in enhancing our work. In our rebuttal, we have provided detailed explanations addressing your concerns, including further clarification on the motivation and presentation of our methodology, and additional ablation studies to support our methods, following your valuable suggestions. We will definitely incorporate all the discussions and new results in our revision. Thank you so much for your valuable suggestions! Should you have any questions, whether about further explanations, additional results, or any other aspects that you find unclear, do not hesitate to raise them. We are committed to resolving any issues to your satisfaction. We understand that you are very busy, so we deeply appreciate your time and effort.

(We have noticed that sometimes the formulas in our rebuttal may not display correctly on OpenReview. A refresh of your browser may resolve them. However, if you continue to face difficulties or need more detailed explanations of these formulas, please let us know. We are prepared to provide all necessary support to make our research clear and understandable.)

Sincerely,

Authors

评论

Hello Reviewer Xg4n,

Please take a moment to read and acknowledge the author's response to your review.

Thanks, Area Chair

审稿意见
7

This paper introduces Self-Evolution Decoding (SED), a novel decoding strategy aimed at enhancing the factual accuracy of large language models (LLMs) without the need for external knowledge bases or additional fine-tuning. SED optimizes the outputs of LLMs by refining the logits from the final layer through the inherent self-evolution of hidden states, effectively reducing hallucinations and refocusing the probability mass on factual responses. Evaluations on several benchmarks, including TruthfulQA and FACTOR, demonstrate that SED outperforms existing methods like DoLa, achieving up to a 10% improvement in factual accuracy. The method is also compatible with other factuality-enhancing techniques, further boosting their effectiveness. While the empirical results are promising, the paper notes some computational overhead and calls for further theoretical analysis to better understand the mechanisms behind SED's success.

优点

  • Novelty: The paper presents a novel decoding strategy, SED, which improves the factual accuracy of large language models without requiring additional fine-tuning or external knowledge bases. This approach fills a crucial gap in the existing methodologies for improving the reliability and truthfulness of LLM outputs.

  • Comprehensive evaluation: The effectiveness of SED is validated across multiple benchmarks such as TruthfulQA, FACTOR, StrategyQA, and GSM8K. The results show that SED outperforms existing methods like DoLa and other baseline strategies, demonstrating significant improvements in factual accuracy and overall performance.

  • Compatibility: SED is its compatibility with other factuality-enhancing methods. The paper demonstrates how SED can be integrated with methods like Inference-Time Intervention, Activation Decoding.

缺点

  • Lack of theoretical analysis: While the empirical results are robust, the paper lacks a rigorous theoretical analysis to explain why SED improves the factual accuracy of LLMs. A better understanding of the underlying mechanics and theoretical justification for the approach would strengthen the contribution. Experimental Reproducibility and Statistical Significance:

  • Computation efficiency: Although SED improves factual accuracy, this comes at the cost of increased computational complexity compared to methods like DoLa. The paper mentions that SED operates slightly slower, which could be a drawback for applications requiring real-time performance. Further benchmarking on computational costs and scalability would be useful for assessing the practical applicability of SED.

问题

Could you provide more insights or theoretical justifications for the functioning and effectiveness of SED? Are there any specific properties of the inner knowledge distribution (P_{inner}) that you believe contribute to the success of SED?

局限性

Computationally inefficiency is the biggest limitation of the proposed method.

评论

Dear Reviewer,

Thank you very much for your time and supportive comments. We appreciate your suggestions and are committed to improving our paper to meet your expectations.

"A better understanding of the underlying mechanics and theoretical justification for the approach would strengthen the contribution."

We are grateful for your advice and plan to incorporate the following analyses to enhance the understanding of our approach:

The principal insight is that pre-trained LLMs exhibit variations in token distributions across different layers, particularly when comparing the output layer with the earlier layers. We have discovered that contrasting the early layers with the final layer can yield a more factual distribution over specific tokens.

We provide a demonstration in the attached PDF to further reveal how the SED mechanism benefits from this approach.

  1. By contrasting the final layer with each of the early layers, we estimate an inner distribution: tˉ(n)=1w(n)(tˉ1(n),tˉ2(n),...,tˉi(n),...,tˉd(n))\bar{t}^{(n)} = \frac{1}{w^{(n)}} (\bar{t}^{(n)}_1, \bar{t}^{(n)}_2, ..., \bar{t}^{(n)}_i, ..., \bar{t}^{(n)}_d) This is more precise and can be demonstrated by comparing Figure 1(a) and Figure 2(a) in the attached PDF. Our analysis in Section 2.4, Question 1, also delves deeper into this aspect. For most early layers, the estimated inner distribution tends to assign a higher probability to the correct tokens. However, DoLa's estimates are imprecise, leading to a heavy reliance on the selection of candidate layers. As shown in Figures 2(a) and 2(b), DoLa's choice to contrast the final layer with the zeroth layer results in both incorrect and correct tokens having the same probability.

  2. For different layer estimates, when we ensemble different layers, we do not simply average them. Instead, our SED method determines weights by calculating the cosine similarity, identifying layers with imprecise estimates. Thus, their influence is diminished, as illustrated in Figures 1(a) and 1(b), where the weights are reduced for layers that misestimate the inner distribution.

  3. Our approach integrates the inner distribution with the original distribution, as opposed to DoLa's method, which simply replaces the original distribution with the inner one. This integration is discussed in Section 2.4, Question 2.

"Further benchmarking on computational costs and scalability would be useful for assessing the practical applicability of SED. " Thank you so much for your suggestions. We have benchmarked our method against baseline models and Dola, and our approach does not significantly increase computational time—less than a 10% increase.

Model&MethodsDoLaSED (topk=5)SED (topk=20)SED (topk=50)
LLaMA-2-7B29.9330.4131.1532.70
LLaMA-2-13B39.5739.6141.1443.30
LLaMA-2-70B136.42138.33140.24143.12

This efficiency is largely due to several factors:

  • Optimized Operations: Most operations, including the calculation of gradients and cosine similarity, are accelerated using PyTorch. This optimization is crucial for maintaining low computational overhead.
  • Vectorized Operations: By utilizing vector operations extensively, we avoid excessive reliance on for-loops, which enhances the computation speed.

Despite the increase in computational load with a larger top-k, as demonstrated in our parameter analysis in Figure 5, a large top-k is unnecessary. In fact, increasing top-k can introduce more noise, reducing the effectiveness of our model. Therefore, our approach maintains a balanced computational overhead, making it feasible for practical applications.

We really appreciate your suggestions which are very important to our work. We are committed to incorporating these expanded discussions and findings in our revised manuscript to provide a more comprehensive understanding of our method. Thank you so much for your time and efforts.

Sincerely,

Authors

评论

Hello Reviewer 12rF,

Please take a moment to read and acknowledge the author's response to your review.

Thanks, Area Chair

评论

Thank you for the response and additional results. The authors have addressed my concern. I believe my ratings are still fair and decide not to change my scores.

评论

Dear Area Chair,

Thank you so much for your time and efforts in facilitating the review and discussion process. We really appreciate it.

Dear Reviewer,

We sincerely thank you for your time and efforts in reviewing and discussing our paper. Your comments and your suggestions are incredibly constructive in enhancing our work. We will definitely incorporate all the discussions and new results in our revision following your suggestions. Thank you so much!

Sincerely,

Authors

审稿意见
4

It introduces a novel decoding strategy named Self-Evolution Decoding (SED) aimed at enhancing the reliability and truthfulness of Large Language Models (LLMs). Unlike methods that depend on external knowledge bases or additional fine-tuning, SED is an intrinsic optimization technique that capitalizes on the self-evolution of LLMs' hidden states. The method refines the output during inference, akin to continued training, which improves accuracy and interpretability without sacrificing natural language fluency.

优点

The SED method is an original contribution that tackles the issue of factuality in LLMs by introducing a new decoding approach. This is a novel way to enhance outputs without relying on external data or model retraining.

The concept of optimizing an implicit objective function using the self-evolution of LLMs is creative and presents a new angle for improving model outputs during inference.

缺点

While the empirical results are positive, there is no enough theoretical analysis provided to support the method's effectiveness.

The paper does not report error bars or measures of statistical significance, which are important for understanding the variability of the results.

SED may introduce additional computational overhead during inference, which could be a concern for real-time applications.

问题

Please see weaknesses.

局限性

While SED has been tested on multiple datasets, there may be a need for further evaluation on an even broader range of datasets to ensure the method's generalizability.

The paper does not provide a rigorous optimization analysis of SED, which could help understand why it leads to more factual outputs.

作者回复

Dear Reviewer,

Thank you so much for taking the time to provide your feedback. Your comments and suggestions are invaluable to us, especially regarding our methodologies. We appreciate this opportunity to address your concerns.

"SED may introduce additional computational overhead during inference, which could be a concern for real-time applications."

Thank you for raising this concern. We have benchmarked our method against standard decoding and Dola by measure the latency (ms/token) across different configurations on different size of models. 0ur findings indicate that our approach increases computational time by less than 10%.

ModelDoLaSED (topk=5)SED (topk=20)SED (topk=50)
LLaMA-2-7B29.9330.4131.1532.70
LLaMA-2-13B39.5739.6141.1443.30
LLaMA-2-70B136.42138.33140.24143.12

This minimal increase is due to:

  • Optimized Operations: Most operations, including the calculation of cosine similarity, are accelerated using PyTorch, which is crucial for maintaining low computational overhead.
  • Vectorized Operations: By extensively utilizing vector operations, we reduce reliance on for-loops, thereby enhancing computation speed.

Although there is an increase in computational cost with a larger top-k, our parameter analysis in Figure 5 shows that a large top-k, such as 50, is not necessary. Thus our approach maintains a acceptable computational overhead, making it feasible for real-time applications.

"While the empirical results are positive, there is not enough theoretical analysis provided to support the method's effectiveness."

We are grateful for your advice and plan to enhance our analysis to better understand why SED improves the factual accuracy of a Large Language Model (LLM). The principal insight is that pre-trained LLMs exhibit variations in token distributions across different layers, particularly when comparing the output layer with the earlier layers. We have discovered that contrasting the early layers with the final layer can yield a more factual distribution over specific tokens.

We provide a demonstration in the attached PDF to further reveal how the SED mechanism benefits from this approach.

  1. By contrasting the final layer with each of the early layers, we estimate an inner distribution:

    tˉ(n)=1w(n)(tˉ1(n),tˉ2(n),...,tˉi(n),...,tˉd(n))\bar{t}^{(n)} = \frac{1}{w^{(n)}} (\bar{t}^{(n)}_1, \bar{t}^{(n)}_2, ..., \bar{t}^{(n)}_i, ..., \bar{t}^{(n)}_d)

    This is more precise and can be demonstrated by comparing Figure 1(a) and Figure 2(a) in the attached PDF. Our analysis in Section 2.4, Question 1, also delves deeper into this aspect. For most early layers, the estimated inner distribution tends to assign a higher probability to the correct tokens. However, DoLa's estimates are imprecise, leading to a heavy reliance on the selection of candidate layers. As shown in Figures 2(a) and 2(b), DoLa's choice to contrast the final layer with the zeroth layer results in both incorrect and correct tokens having the same probability.

  2. For different layer estimates, when we ensemble different layers, we do not simply average them. Instead, our SED method determines weights by calculating the cosine similarity, identifying layers with imprecise estimates. Thus, their influence is diminished, as illustrated in Figures 1(a) and 1(b), where the weights are reduced for layers that misestimate the inner distribution.

  3. Our approach integrates the inner distribution with the original distribution, as opposed to DoLa's method, which simply replaces the original distribution with the inner one. This integration is discussed in Section 2.4, Question 2.

"While SED has been tested on multiple datasets, there may be a need for further evaluation on an even broader range of datasets to ensure the method's generalizability."

Thank you for this valuable suggestion. We have conducted additional experiments on more realistic open-ended generations datasets, HotPotQA, Natural Question (NQ), TriviaQA (Tri). We adopt more evaluation metric, Exact Match(EM) and the F1.

ModelHotpotQA EMHotpotQA F1NQ EMNQ F1Trivia EMTrivia F1
Llama 2 7B chat19.620.121.820.444.444.3
+ DoLa20.421.323.521.545.245.3
+ Sed(ours)20.921.524.422.247.646.3
Llama 2 13B chat23.821.733.128.963.060.9
+ DoLa24.523.233.128.963.261.5
+ Sed(ours)25.024.534.631.663.362.2

). The results show that our method improves the performance in more realistic open-ended generation tasks.

Sincerely,

Authors

评论

Dear Reviewer,

Thank you for your thoughtful comments and the time you've invested in reviewing our manuscript. In response to your suggestions, we would like to provide some context and additional explanations regarding the error bars or other measures of statistical significance in our initial submission.

The reason we did not include error bars in our initial manuscript is that we followed the general settings from the recent studies on factuality decoding, such as DoLa [1], ITI [2], AD [3], and ICD [4] methods. These papers typically do not report error bars either. This is mainly because current factuality decoding approaches are more focused on the paradigm of greedy decoding, where outputs are deterministically selected based on maximum likelihood. Consequently, the output variability is inherently limited by the deterministic nature of the decoding method. Therefore, to maintain a fair comparison with these established methods, we also chose not to include error bars in our analysis.

Considering your advice on statistical significance, we have explored the potential for variability due to different data subsets by employing the bootstrap method, which involves multiple resamplings of the same dataset. In our revised analysis, we calculated the 95% confidence intervals (95%CI) using the bootstrap method with 1,000 bootstrap samples on Factor Dataset. Each sample was generated by randomly resampling the data, with replacement, to simulate the effects of data variability. We have reported the results of different methods using this approach in the table below:

95%CI95%CI
llama2-7B-base[56.44,59.89]llama2-7B-chat[54.90,58.48]
+ DoLa[61.39,64.76 ]+ DoLa[54.81,58.32]
+ SED[65.56,68.98]+ SED[63.46,66.77]
llama2-13B-base[61.88,65.43]llama2-13B-chat[60.35,63.79]
+ DoLa[55.21,58.89]+ DoLa[56.21,59.75]
+ SED[69.17,72.41]+ SED[65.40,68.74]
Our finding shows,
  1. Superiority of Our Method (SED): Across all configurations, whether using the LLaMA 7B or 13B models, our approach SED consistently achieves higher lower bounds and upper bounds in the confidence intervals compared to the base models and those augmented with DoLa only. This consistent outperformance suggests that the SED significantly improves the model's accuracy.

  2. Non-Overlapping Confidence Intervals: In most cases, the confidence intervals of our SED-enhanced method do not overlap with those of the other methods. This lack of overlap is statistically significant as it indicates that the improvements observed with our method are not due to random variations within the data, but are a result of the SED augmentation.

The clear separation and higher confidence intervals associated with our method suggest that the differences in performance are statistically significant.

Lastly, we sincerely appreciate the time and effort you have dedicated to reviewing and discussing our paper. Your suggestions have been immensely valuable to our research, and we plan to incorporate all the discussions and new results into our revised paper. Thank you so much. If you have any further questions or concerns, whether about the methodology, presentation, or additional experimental results, do not hesitate to raise them. We are committed to addressing your concerns and meeting your expectations. Once again, we deeply appreciate your feedback and look forward to your suggestions.

Sincerely,

Authors

[1] Yung-Sung Chuang et al., "Dola: Decoding by contrasting layers improves factuality in large language models," 2024.

[2] Kenneth Li et al., "Inference-time intervention: Eliciting truthful answers from a language model," 2023.

[3] Shiqi Chen et al., "In-context sharpness as alerts: An inner representation perspective for hallucination mitigation," 2024.

[4] Yue Zhang et al., "Alleviating hallucinations of large language models through induced hallucinations," 2023.

评论

Dear Area Chair,

Thank you so much for your time and efforts in facilitating the review and discussion process. We really appreciate it.

Dear Reviewer,

We sincerely appreciate your time and efforts in reviewing and discussing our paper. Your insights have been incredibly constructive and have significantly enhanced our work. In our rebuttal, we provided detailed explanations addressing your concerns, including:

  1. Providing results on the computational cost to demonstrate that our method is acceptable for real-time applications, along with the reasons for its efficiency.
  2. Providing more theoretical analysis and demonstrations to further elucidate how our method works and why it is effective.
  3. Following your valuable suggestions, we have included more experimental results on more realistic open-ended generation scenarios.
  4. Following your valuable suggestions, we have also included a discussion of error bars and statistical significance to ensure the reliability of our method.

Thank you so much for your suggestions. We will definitely incorporate all the discussions and new results in our revision. Should you have any questions, whether about further explanations, additional results, or any other aspects that you find unclear, do not hesitate to raise them. We are committed to resolving any issues to your satisfaction. Thank you so much!

Sincerely,

Authors

评论

Dear Reviewer Mgbd,

We hope our rebuttal addresses your concerns effectively. As we are approaching the end of the author-reviewer discussion period, we would like to see if you have any remaining questions or comments you'd like us to clarify/discuss. We understand that you are very busy, so we would really appreciate it if you could take a minute to review our rebuttal, as suggested by the Area Chair.

In our rebuttal, we provided the detailed explanations addressing your concerns, including:

  1. Providing results on the computational cost to demonstrate that our method is acceptable for real-time applications, along with the reasons for its efficiency.

  2. Providing more theoretical analysis and demonstrations to further elucidate how our method works and why it is effective.

  3. Following your valuable suggestions, we have included more experimental results on more realistic open-ended generation scenarios.

  4. Following your valuable suggestions, we have also included a discussion of error bars and statistical significance to ensure the reliability of our method.

Your insights would be valuable for our submission, and we would greatly appreciate any further feedback you may provide before the end of the discussion. Thank you so much for your time and efforts! We really appreciate it!

Sincerely,

Authors

评论

Hello Reviewer Mgbd,

Please take a moment to read and acknowledge the author's response to your review.

Thanks, Area Chair

作者回复

Dear Reviewers,

We express our gratitude towards all reviewers for their time reviewing our submission and providing constructive feedbacks. Along with our rebuttal, we are including a PDF file containing figures that further support our analysis of why the SED method is effective. This additional material aims to provide a deeper understanding of the mechanisms underlying SED's efficacy.

Sincerely, Authors

最终决定

SUMMARY

The paper introduces a new decoding method for generating text from LLMs. The main idea is to modify the final output logits which are used for generation by adding in an additional term that seeks to optimize early-exit logits .

REASONS TO ACCEPT

The reviewers agree that the paper provides an original contribution, and that the new decoding method is particularly interesting because it doesn't require a separate knowledge base or additional finetuning to improve an LM's access of its internal knowledge at generating-time. The reviewers also note the impact of the problem being tackled (enhancing factuality in LM generations) and that the proposed method can be used in conjunction with other factuality methods, making it even more useful.

REASONS TO REJECT

Two of the reviewers point out that the proposed method ought to be backed by theoretical analysis of why it works. One reviewer requested further benchmarking on computational costs and scalability of the method. Multiple reviewers mentioned not understanding aspects of the method. They show their method improves answering ability on TruthfulQA and CoT.

CONCLUSION

As a note to the authors, the need for significant modifications is a valid reason to reject a paper, and it is well within Xg4n's purview as a reviewer to suggest that a submission that needs significant modifications to pass the bar for publication at NeurIPS ought to undergo an additional round of reviews. However, in this case, the changes made by the authors are sufficiently close to the original paper that I believe this should not be grounds for rejection. If the modifications and clarifications requested by reviewers are incorporated into the paper, I think it warrants acceptance to NeurIPS.