Resolving Knowledge Conflicts in Large Language Models
We propose a three-step protocol of resolving LLM knowledge conflicts while evaluating and improving LLMs to fulfill these goals.
摘要
评审与讨论
The paper investigates the knowledge conflicts in large language models, an interesting and hot topic in the LLM research community. Specifically, it conducts the investigation in a fine-grained manner, i.e., identifying knowledge conflicts, pinpointing conflicting information segments, and providing distinct answers or viewpoints in conflicting scenarios. Based on the experimental analyses, it further proposes instruction-based approaches to improve LLM's ability to handle knowledge conflicts.
接收理由
- The topic of resolving knowledge conflicts in large language models is interesting and promising.
- Breaking down the evaluation in a fine-grained manner could reveal some detailed factors of knowledge conflict handling in LLMs.
- The evaluation is comprehensive, including the dataset and experimental analyses.
拒绝理由
- The title (or goal) of this paper is resolving knowledge conflicts in large language models, but it puts relatively little effort into resolving methods.
- The datasets cannot simulate the complexity of real-world knowledge conflict, and maybe existing works could be referred to improve the dataset construction, i.e., [1-3].
[1] Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts [2] Tug-of-War Between Knowledge: Exploring and Resolving Knowledge Conflicts in Retrieval-Augmented Language Models [3] Untangle the KNOT: Interweaving Conflicting Knowledge and Reasoning Skills in Large Language Models
给作者的问题
No detailed questions.
Thank you so much for your feedback and comments!!
title/goal: In this work, we propose to resolve knowledge conflicts by identifying knowledge conflicts, pinpointing conflicting segments, and generating distinct responses. Neither parametric nor nonparametric knowledge is guaranteed to be true, and LLMs should grant users the agency to make informed decisions. To this end, we introduce a framework for knowledge conflict simulation and quantitative evaluation. We further propose new instruction-based approaches and run analysis on different knowledge domains and conflict generation methods. The entire study is working towards resolving knowledge conflicts by identifying conflicts, pinpointing differences, and providing both-sided answers to aid users, not judging whether one is correct or the other.
complexity of real-world knowledge conflict: Thanks for pointing this out and we also discussed it in the limitation section. In the real-world, knowledge conflicts happen due to misinformation, varying perspectives, time-sensitive information, or knowledge updates. Our dataset, employing multiple conflict generation methods and settings, can cover these scenarios to a large extent, since the majority of them are entity-based changes. We also run an additional experiment on varying perspectives, which are beyond entity-level. Experimental results align with our conclusion that LLMs can perform well above random in identifying the existence of knowledge conflicts. Currently, there is still a lack of real-world knowledge conflict datasets and all the existing works simulate knowledge conflicts synthetically including the works suggested by the reviewer:
[1] substitutes memory answers with a same-type entity for PopQA and flips the memory answer for StrategyQA, and then instructs ChatGPT to generate counter-memory from scratch.
[2] also employs ChatGPT to create counterfactual answers and conflicting evidence.
[3] uses word-level edits as well, by randomly selecting an entity from the ego network and replacing it with a different entity.
Though they also employ synthetic conflicts, they are valuable references and we will cite and discuss them in the final version. Due to the fact that different models have different parametric knowledge and it is hard to fully and accurately get the parametric knowledge of different LLMs, such datasets might be hard to construct, and we are eager to see future work on this.
Dear reviewer,
We would like to once again thank you for your valuable feedback, and hope you can get a chance to see our responses to your concerns.
We also ran an additional experiment on varying perspectives, a type of knowledge conflict beyond word-level edits, using the PrimeValue dataset [1] to further address your concern on the complexity of real-world knowledge conflicts. The dataset includes controversial situations such as “giving money to your son”, “giving blood”, and “electing Barrack Obama”. We elicit parametric viewpoints of GPT-4-turbo and employ GPT-3.5-turbo to create opposite viewpoints following [2], composing an evaluation dataset of 350 positive samples and 350 negative samples. Employing the CoT prompting method results in a precision of 0.98, a recall of 0.78, and an F1 score of 0.87 in Task 1. This aligns with our conclusion that LLMs can perform well above random in identifying the existence of knowledge conflicts, though under this setting, the results are limited to the assumption that the LLM only has one unique firm viewpoint on each situation. Please see details below.
| n | TP | TN | FP | FN | Acc | Precision | Recall | F1 | |
|---|---|---|---|---|---|---|---|---|---|
| Zero-shot | 698 | 61 | 348 | 1 | 288 | .59 | .98 | .18 | .30 |
| Few-shot | 700 | 173 | 347 | 3 | 177 | .74 | .98 | .49 | .66 |
| CoT | 700 | 272 | 345 | 5 | 78 | .88 | .98 | .78 | .87 |
[1] Taylor Sorensen et al. Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties. AAAI 2024.
[2] Jian Xie et al. Adaptive chameleon or stubborn sloth: Unraveling the behavior of large language models in knowledge clashes. ICLR 2024.
Thank you for the responses. Locating the conflicts is certainly a step toward resolving them. However, with many similar works before, I think more efforts (or pages) should be put into the resolving methods. As for the knowledge conflict simulation, the unique points of the proposed methods could be highlighted, and those common ideas better refer to previous works. Thanks for the detailed clarification and further experiments again, and I updated my score accordingly to encourage the authors to make it a better work.
This paper studies the topic of knowledge conflict in large language models (LLMs). It first presents an evaluation benchmark to test models' ability to identify and handle conflicts between parametric knowledge and given content, which consists of three tasks and corresponding datasets. The paper also proposes methods to improve models' performance on each task.
接收理由
-
Knowledge conflict is an important and interesting topic for LLMs, especially in retrieval-augmented generation.
-
The benchmark for knowledge conflict is a valuable contribution that could be helpful for future research.
-
The paper includes extensive experiments that provide robust evidence for the proposed methods.
拒绝理由
-
The proposed approaches are similar to the methods of DisentQA (Neeman et al., 2022).
-
The conflict is constructed by replacing entities, which is artificial and may not cover most real-world situations of knowledge conflict.
-
The title is overstated, as the work does not actually "resolve knowledge conflict."
给作者的问题
-
Can you explain the major differences between your methods and those in DisentQA?
-
Can you identify and address more types of conflicts beyond entity conflicts?
Thanks for your valuable feedback!!
DisentQA: As we mentioned in the related work, DisentQA is only relevant to the Task 3 setting, and it focuses on training data augmentation methods while we focus on instruction-based methods. We also introduce the entity shuffling approach to address their limitation which can only be employed when answers are named entities, and break down experimental results by knowledge domains which are not covered in their work. In addition, we generate synthetic knowledge datasets using our proposed framework instead of relying on the NQ dataset, since NQ has been reasonably addressed in existing works [1-2] and we want to refrain from assuming existing datasets as the parametric knowledge. We are the first to propose the desiderata and concrete objectives towards resolving knowledge conflicts which is the main contribution of this work.
Conflict is artificial: We also discussed it in the limitation section. Our dataset can cover real-world scenarios to a large extent, since the majority of them are entity-based changes. Currently, there is a lack of real-world knowledge conflict datasets and all the existing works simulate knowledge conflicts synthetically [3-5], including the DisentQA suggested by the reviewer.
title: We use “resolving knowledge conflicts” instead of “knowledge conflicts resolved” since this work is about resolving knowledge conflicts by identifying conflicts, pinpointing differences, and providing both-sided answers to aid users, not judging whether one is correct or the other.
beyond entity conflicts: We conduct an additional experiment on varying perspectives utilizing the PrimeValue dataset [6], which includes controversial situations such as “giving money to your son” and “giving blood”. Experimental results align with our conclusion that LLMs can perform well above random in identifying the existence of knowledge conflicts, though under this setting, the results are limited to the assumption that the LLM only has one unique firm viewpoint on each situation.
| n | TP | TN | FP | FN | Acc | Precision | Recall | F1 | |
|---|---|---|---|---|---|---|---|---|---|
| Zero-shot | 698 | 61 | 348 | 1 | 288 | .59 | .98 | .18 | .30 |
| Few-shot | 700 | 173 | 347 | 3 | 177 | .74 | .98 | .49 | .66 |
| CoT | 700 | 272 | 345 | 5 | 78 | .88 | .98 | .78 | .87 |
[1] Poolingformer. [2] No Answer is Better Than Wrong Answer. [3] Adaptive chameleon or stubborn sloth. [4] DisentQA. [5] Entity-based knowledge conflicts in question answering. [6] Value Kaleidoscope.
Thanks for your response. I have updated my score accordingly.
This paper provides an evaluation framework to assess LLMs' ability to acknowledge and identify knowledge conflicts , i.e., conflicts of knowledge from an external sources vs. the parametric (internal) knowledge of the LLMs themselves. The authors proposes that in such conflict scenarios, the LLM should acknowledge the existence of this conflict and generate multiple answers based on both perspectives. It proposes three tasks to evaluate these abilities of existing LLMs , and proposes several novel strategies to improve the performance of these three tasks. Finally it also delves deeper in analyzing the performance of these tasks depending on factors such as the domain and other hyperparameters.
Quality: this paper is high quality in my opinion. It extends previous works in proving the LLM's behaviors in knowledge conflicts and carefully designs experimental materials and a series of tasks from a variety of domains. The experimentation and analysis are comprehensive and thorough in giving us a full picture and detailed understanding of this problem space and the performance of various methods (including a novel proposed method). It further proposes novel prompting methods which in most cases resulted in an improved performance on these tasks. Finally, it proposes a reasonable desired behavior for the LLMs in such scenarios.
Clarity : very clear. especially appreciate the extensive information in the appendix.
Originality: medium
Significance: medium high
接收理由
Please see details above.
This paper provides a principled way to thoroughly analyze the behavior of LLMs by zooming in on a specific problem space. It also proposes a desired behavior of the LLMs and proposes a method to improve that behavior.
拒绝理由
n/a
Thank you so much for your positive feedback!! We hope our work could serve as a starting point towards rethinking and resolving knowledge conflicts. Instead of evaluating LLMs’ preference between parametric and non-parametric knowledge under different scenarios, more attention should be paid on seeking the desirable behaviors of LLMs when knowledge conflicts arise, which are identifying and pinpointing knowledge conflicts and generating distinct responses under our hypothesis. We provide evaluation protocols, results and findings, as well as instruction-based new approaches towards this end.
This work studies the problem about knowledge conflict: where discrepancy arises between the internal parametric knowledge of LLMs and non-parametric information provided in the prompt context. The authors first talk about the desired behavior from LLM when the conflict arises. Then they introduce a simulated experimental environments to elicit conflicts and evaluate LLM behaviors. They find that it is hard for current LLMs to generate "robust" response in scenarios with conflicts.
接收理由
-
The studied problem is very important. The knowledge conflict between internal parametric knowledge of LLMs and non-parametric information from in-context is very common. It is also related to knowledge updating and forgetting by in-contexting.
-
The organization of paper is very clear. The author clearly demonstrates the design logic of the framework and evaluation.
-
The evaluations are very conprehensive.
-
The authors provide some interesting findings and also give simple and effective methods to overcome the shortcomings of current LLMs.
拒绝理由
This paper does not have any obvious shortcomings.
给作者的问题
I am curious about the results of fine-tuning in Section 6. Could the authors provide an analysis on why FT does not perform very well and suggest some possible approaches to mitigate this issue?
Thank you so much for your insightful and thorough comments!! We are super glad to hear that you also value this topic and find our findings interesting. Regarding your question on the fine-tuning results, fine-tuning does lead to a substantial improvement compared to prompting methods based on our experiments (F1 scores improves from 0.802 to 0.953 under Task 1 and 0.621 to 0.745 under Task 2 as shown in the Table 7). Though fine tuning the model under the same knowledge domain may result in better performance, we employ training and testing data from different domains to avoid the chance of overfitting. Instead of instruction fine-tuning, maybe we could leverage the idea of alignment to further improve the fine-tuning performance. For instance, we could use conflicting segments as the positive exemplars and nonconflicting segments as the negative exemplars to teach the model to pinpoint conflicting segments.
Thanks for your explanations. I will keep my score.
The authors construct a benchmark to simulate knowledge conflict that could be encountered by LLMs, and evaluate how LLMs behave in these scenarios. Through detailed analysis the authors show how LLMs act on conflicts both within the parameters and from the context. In addition, they propose some solutions accordingly.
The reviewers all acknowledge the motivation and experimental design. There has been some criticism on the design of simulated dataset, and choice of entity conflicts mostly, and the difference between existing work. Author rebuttals mostly addresses these concerns.