PaperHub
6.0
/10
Poster4 位审稿人
最低3最高8标准差1.9
6
7
8
3
3.8
置信度
正确性3.0
贡献度2.8
表达2.8
NeurIPS 2024

Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

We explore the knowledge boundary of LLMs by investigating their responses to semi-open-ended questions, using an auxiliary model to uncover ambiguous answers and highlighting LLMs’ challenges in recognizing their knowledge limits.

摘要

关键词
large language modelknowledge boundaryquestion answering

评审与讨论

审稿意见
6

This paper investigates the knowledge boundaries of Large Language Models (LLMs) using semi-open-ended questions, revealing limitations in LLMs' understanding. The authors introduce a novel method involving an auxiliary model to discover low-probability ambiguous answers, constructing a dataset of 953 questions across 32 domains. Their findings show that GPT-4 frequently produces unqualified answers and struggles with accurate self-evaluation, highlighting the need for improved detection of LLM knowledge boundaries.

优点

  1. This work introduces a novel approach to evaluate LLMs using semi-open-ended questions, revealing limitations in existing methods focused on close-ended questions. The proposed method of identifying low-probability ambiguous answers is innovative and well-motivated.
  2. Experiments show significant improvements in understanding LLM knowledge boundaries, particularly where models like GPT-4 may hallucinate, enhancing evaluation reliability.
  3. The paper is well-written and easy to follow.

缺点

  1. The evaluation scope of this paper is limited, as it only assesses the performance of GPT-4 Turbo, which may be insufficient to represent the overall performance of LLMs. Have the authors analyzed other mainstream large language models such as Claude or LLaMA? It would be better to provide the performance of other large language models on this dataset.

  2. The practical implications of the findings are not fully demonstrated. It is recommended to provide specific examples of applications to illustrate how these findings can improve the reliability of LLMs in practical use. Further explanation is needed on how improving training data or algorithms can mitigate these issues and enhance the practical utility of the models.

  3. The paper mentions that GPT-4 performs poorly on semi-open-ended questions, but it may not have thoroughly analyzed the different types of errors (such as factual errors, logical errors, etc.) and their causes.

问题

  1. Has the author analyzed the performance of other large language models except GPT-4 on semi-open datasets?
  2. Can you provide a more detailed analysis of the types of errors made by GPT-4 on semi-open-ended questions?
  3. How do you ensure the reliability and consistency of human annotations in verifying ambiguous answers?

局限性

Yes.

作者回复

Thank you for your thoughtful review and valuable questions! We have incorporated all the suggested experiments and the results meet the expectation. We are confident that we have addressed all of your concerns as outlined below. Based on our new experimental findings and explanations, we appreciate it if you would reconsider the final evaluation of our work.

Q1: It is better to evaluate more LLMs.

Following your advice, we analyze the performance of the Claude model (claude-3-sonnet-20240229) on semi-open-ended questions and find it also performs poorly and generates many unsatisfying answers.

Unqualifie Answer%(↓)Inaccurate Evaluation %(↓)Hallucinated Question %(↓)
GPT-440.1528.4782.90
Claude37.1533.7388.00

The above table shows the evaluation results of two powerful LLMs. Results show that similar to GPT-4, Claude also performs poorly on the semi-open-ended questions.

Due to the time limit, we randomly sample 100 questions from our dataset and evaluate the performance of Claude. As the most advanced LLMs perform poorly on this task, we need human annotation to fact-check each generated answer, which is time-consuming. Specifically, it takes 2 minutes for an annotator to assess and double-check the truthfulness of each answer. Our dataset contains 1k questions, each of which corresponds to 13 tail-answers on average for verification. Even if we hire more than 5 annotators, it takes us 72 hours (more than 7 days if they work 10 hours per day).

Q2: Provide more application examples and practical implications of the findings.

Perceiving LLMs' knowledge boundaries is important to understand and alleviate hallucination [1,2]. Ambiguous answers for semi-open-ended questions are highly likely beyond the knowledge boundaries of LLMs (see Sec 4.4). Discovering ambiguous answers benefits many applications, including:

  1. It helps detect the knowledge scope of LLMs more faithfully. Many close-ended hallucination evaluation benchmarks face the danger of data contamination [3, 4]. Semi-open-ended questions are easy to design and correspond to a large number of undocumented answers;
  2. Flagging ambiguous answers with higher uncertainty enhances the LLM outputs [5, 6];
  3. Identifying ambiguous answers helps achieve selective retrieval that augments LLM with indispensable external information while reducing the distraction of irrelevant data [7, 8, 9].
  4. It helps align LLMs for a more honest generation by teaching the LLM to admit its knowledge limit on the knowledge it is unfamiliar with (ambiguous answers) [10, 11, 12].

We also find that in real life, semi-open-ended questions are quite common, indicating that the potential impact of our work is quite large. To estimate the proportion of semi-open-ended questions, we randomly sample 1k questions from an open-source corpus and conduct statistics to find that 33.6% of questions are semi-open-ended.

Q3: Detailed analysis of types of errors

Following your suggestion, we categorize different types of errors made by our target LLM and analyze their causes according to evaluation results.

Error TypesFactual InconsistencyFactual Fabrication
Ratio91.45%8.55%

Following [1], we distinguish different types of hallucinations:

  1. Factual inconsistency takes up 91.45% of errors for semi-open-ended questions. It happens when the answer can be grounded in real-world information, but mismatches certain requirements in the question.
  2. Factual fabrication leads to 8.55% errors, which occurs when the answer is unverifiable from public sources.

Besides, we found that 86.15% evaluated answers met some parts of the requirement in the question while failing to satisfy the rest requirements. This may be because some conditions in the question overshadow others, leading to unqualified answers [13].

Our focus is the detection of LLM's knowledge limitations. Logical errors are usually observed in reasoning tasks, which is not the primary focus of this work. We will study the logical errors for QA tasks in the future.

Q4: How to ensure the reliability and consistency of human annotations?

We hired 11 annotators with Master's degrees, and provided clear evaluation guidelines (see Line 563) and feedback to human annotators during the evaluation process, thereby ensuring the reliability of the annotations across the entire dataset. Finally, we cross-check the annotation results.

Reference

[1] A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

[2] Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method

[3] An Open Source Data Contamination Report for Large Language Models

[4] Investigating Data Contamination in Modern Benchmarks for Large Language Models

[5] How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

[6] How to Approach Ambiguous Queries in Conversational Search: A Survey of Techniques, Approaches, Tools, and Challenges

[7] Self-Knowledge Guided Retrieval Augmentation for Large Language Models

[8] When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

[9] REPOFORMER: Selective retrieval for repository-level code completion

[10] Alignment for Honesty

[11] Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism

[12] Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback

[13] Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models

评论

Dear Reviewer frNN,

We wish to express our sincere gratitude for your invaluable feedback! We kindly request that you review our responses to your observations and consider revising your assessments accordingly.

We believe that our explanations and additional experiments have thoroughly addressed your queries and concerns, Should you have any additional questions or require further clarification, please feel free to contact us. Your final evaluation and potential score update would be greatly appreciated and would significantly contribute to the refinement of our work.

Thank you for your dedication and thoughtful reviews. We look forward to your final ratings.

Best regards,

Paper 10289 Authors

评论

The responses have addressed my concerns. I have raised my score to 6.

评论

Thank you for increasing your rating from 3 to 6!

Thank you again for your valuable time, insightful suggestions, and encouragement! We appreciate your recognition that our method is "innovative and well-motivated", "experiments show significant improvements", and our paper is "well-written and easy to follow". It is our great honor to receive your support during the discussion phase.

Best regards,

Submisson10589 Authors

审稿意见
7

This paper focuses on detecting the “knowledge bounding” of the current large language models (LLMs), which would be helpful in handling the well-known hallucination problem in LLMs. In this paper, the authors explore in a new question answering setting (i.e. semi-open-ended questions). The authors employ LLM-based approach to construct semi-open-ended questions and obtain answers from a target LLM. Employing a open-source LLM, which is flexible to edit the parameters/variables, to detect black-box LLMs’ drawback. The proposed method that calculating the nearest semantic representation to select the related answers seems interesting. Finally, the paper not only constructs a dataset to find the knowledge boundary for GPT-4, but also indeed discovers 82.90% of GPT-4’s answers are not so satisfied. It also founds that 40.15% of its hard (ambiguous) answers generated are unqualified.

优点

  1. The proposed method sounds novel for handling the knowledge boundary discover tasks. The method utilizes an open-source Large Language Model (LLM) to aid a black-box LLM, GPT-4, in identifying its knowledge limitations. They pinpoint words with low probabilities according to the model variable of the open-source LLM. Subtle modifications to the representation of the output layers facilitate the extraction of answers that are most analogous for LLMs.

  2. The proposed semi-open-ended question task sounds important to explore. The task is more challenging than the current QA tasks, since the number of candidate answers and answer space (and may be the correct answers) is not fixed and deterministic.

  3. The paper writing is quite clear and the paper organization is easy to follow.

  4. This paper also fall into a popular and important direction. It tries to solve the effect of hallucination in LLMs in another way: different from case-by-case detecting the hallucination cases, this paper aims to discover the knowledge boundary of a given LLM (GPT-4).

  5. The experiments are solid enough. It includes multiple base models (Llama with different size) and multiple evaluation metrics. The overall performance also includes study cases.

  6. As GPT-4 is quite powerful, detecting the knowledge boundary of GPT-4 and find the shortcomings of GPT-4 is not an easy duty. The outcome of this paper is quite attractive, since it discovers 82.90% of GPT-4’s answers are not so satisfied with a simple LLama2 model.

缺点

  1. The proposed model works well on detecting the knowledge boundary in the setting of semi-open-ended question. Even if that setting makes sense, (1) I am not sure about how many questions in the real applications belong to “semi-open-ended question”. (2) how about the effectiveness of the proposed method on normal QA tasks (e.g. multiple-choice test). Could you please provide some insights on the proposed models’ potential strengthen on the normal QA tasks?

  2. The model name used in the paper should by consistent. For example, the authors use both “LLaMA-2-13b” (in line 252) and “LLaMA-2-13B” (table 1) in the paper.

  3. Typos: In line 252, it should be “we use two LLaMA-2-13b models”. The authors missed to add “model”. In table 1, the title of the second column should be “Auxiliary Model” instead of “Auxiliary Model Size”, since LLaMA-2-13B is a name instead of a kind of size.

  4. What is the meaning of the underline results in Table 3? The authors should make it clear of each mark or notation in the paper is well-described.

问题

  1. Though the proposed model shows its effectiveness on semi-open-ended question setting, I am wondering the potential future work on normal QA tasks (e.g. multiple-choice test). Could you please provide some insights on the proposed models’ potential strengthen on the normal QA tasks?

局限性

The authors' limitation section indeed addressed some concerns about this paper, which is fine with me.

作者回复

Thanks for your thoughtful and encouraging feedback! We have carefully incorporated your suggestions and provided explanations as follows, aiming to enhance the quality and robustness of our research.

Q1: How many questions in the real applications belong to "semi-open-ended questions"?

Following your suggestion, we find that in real life, semi-open-ended questions are quite common. We extract all question statements from an open-source general-domain corpus OpenWebText [1] and randomly sample 1k questions. We conduct statistics to find that approximately 33.6% of the questions are semi-open-ended, indicating that the research problem we are addressing has strong practical significance. We distinguish semi-open-ended questions according by querying ChatGPT.

Q2: Insights on the proposed models' potential strengthen the normal QA tasks.

On normal QA tasks, our method can help discover highly delusive wrong answers by reducing the probability of the ground truth answer. By comparing them with the ground truth, we may explain the factual hallucination problems at a more granular level. Besides, they can be used to construct a more challenging benchmark for normal QA tasks by including more delusive wrong answers.

Our approach of modifying the LLM representations to guide answer generation may provide insight for different kinds of normal QA tasks:

  1. It may help alleviate the hallucinations in knowledge-extensive QA tasks via representation engineering.
  2. Editing LLM representations considering existing answers can reduce the probability of semantically related words, helping to generate more diverse answers for open-ended QA tasks.

Besides, ambiguous answers found by our approach benefit QA systems in many ways, including:

  1. Flagging ambiguous answers with higher uncertainty enhances the LLM-based QA systems [2, 3];
  2. Identifying ambiguous answers helps achieve selective retrieval that augments LLM-based QA systems with indispensable external information while reducing the distraction of irrelevant data [4, 5, 6].

We will discuss these potential research directions in our revised paper and explore them in our future works.

Q3: Inconsistent model name & grammatical errors.

Thank you for your careful reading! We will unify the naming of model names and resolve all grammatical errors in the revised version.

Q4: Meaning of the underlined results in Table 3.

The underlined results are either incorrect or unverifiable according to the ground truth, belonging to "Unqualified answers" in our categorization. We will explain the underlined results in more detail in both the caption of Table 3 and the main text to make it easier to understand.

Reference

[1] Aaron Gokaslan and Vanya Cohen. 2019. OpenWebText Corpus.

[2] How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

[3] How to Approach Ambiguous Queries in Conversational Search: A Survey of Techniques, Approaches, Tools, and Challenges

[4] Self-Knowledge Guided Retrieval Augmentation for Large Language Models

[5] When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

[6] REPOFORMER: Selective retrieval for repository-level code completion

评论

The authors have provided some detailed responses to the reviews and questions, which makes sense with me. I will keep my positive score.

审稿意见
8

In this submission, the authors aim to explore the detection of large language models’ (LLMs) knowledge boundary, which is a borderline to tell us what can LLMs really know. The detection of knowledge boundary wound play a crucial role to help the researchers to deal with hallucination. Different from the widely-used QA setting, the paper focuses on another QA setting (semi-open-ended questions), which may be ignored by other researches but seems common in our real life. To detect the knowledge boundary, the authors hire a small-scale open-source model (Llama) to help the GPT-4 to detect GPT-4’s boundary. This paper have some interesting findings. For example, GPT-4 underperform the auxiliary model, LLaMA-2-13B in 50% cases.

优点

  1. The authors explore in a new question answering setting (i.e. semi-open-ended questions). This setting is unexplored in QA tasks around LLM, and it seems quite interesting and useful for LLMs.

  2. Detecting the knowledge bounding is a promising and crucial direction to achieve a better usage of LLM and deal with the notorious hallucination problem.

  3. The proposed idea makes sense and novel enough, which use a open-source LLM to assist the black-box LLM (GPT-4) to detect GPT-4’s knowledge boundary by accessing the prob dist and find the low-prob words. The minor changing of the representation of output hiddens helps to obtain the most similar answers for LLMs.

  4. The paper writing is quite clear to follow with detailed graphical descriptions (the overview figure).

  5. The performance of the proposed methods seems effective in detecting GPT-4’s knowledge boundary. It found that in some (50%) cases, GPT-4 underperform the auxiliary model, LLaMA-2-13B.

缺点

  1. The setting of “semi-open-ended questions” is quite interesting and seems useful and common in our real life. However, is the term of “semi-open-ended questions” widely used in the QA research area? Although I am not so familiar with with QA and its terminology, I suggest the authors conduct a fully survey of the related terms and explain more about the meaning of “semi-open-ended questions” in the next version.

  2. The proposed method works well on the “semi-open-ended questions”. It would be better to give a detailed statistic (e.g. questionnaire) or empirical study or related analyses about “the proportion of semi-open-ended questions out of the whole QA scenario”. It shows the practical use of the proposed method in our real life.

  3. Could you please illustrate the full version of the cases in Table 6 and Table 7 (Some cases are omitted due to the space limit)?

问题

  1. As mentioned in Weaknesses, could you please give a fully version of cases in Table 6 and 7?

局限性

Yes, the limitation section addressed by the authors seems fine with me.

作者回复

We are immensely grateful for your insightful and positive comments. we have addressed each point with careful consideration to ensure that our findings are presented with greater precision and rigor.

Q1: Explanation regarding the meaning of "semi-open-ended questions"

Thank you for your insightful comment and suggestion. Based on our research, we have not found evidence of the term "semi-open-ended questions" being widely used in the QA research area. These questions, though having multiple potential correct answers, are challenging to fully enumerate. For example, when asked to "Tell me about some exercise habits that are easy to overlook but are good for your health," there are many correct answers, yet the question remains constrained by the context of 'exercise habits', 'easy to overlook' and 'good for your health'.

In the next version, we will provide a more detailed explanation of the term "semi-open-ended questions" including diverse examples to facilitate a comprehensive understanding of our study.

Q2: The proportion of semi-open-ended questions out of the whole QA scenario

Following your suggestion, we find that in real life, semi-open-ended questions are quite common. We extract all question statements from an open-source general-domain corpus OpenWebText [1] and randomly sample 1k questions. We conduct statistics to find that approximately 33.6% of the questions are semi-open-ended, indicating that the research problem we are addressing has strong practical significance. We distinguish semi-open-ended questions by querying ChatGPT.

Q3: The full version of the cases in Table 6 and Table 7

Following your suggestion, we demonstrate the full cases in Table 6 and Table 7 of our paper as follows. We will provide the full version in our next version.

Semi-open-ended QuestionTell me a list of foods that are rich in Vitamin A but low in fat.
GPT-4 Response for Semi-open-ended Question1.Carrots 2. Spinach 3. Sweet Potato 4. Kale 5. Butternut Squash 6. Red Bell Peppers 7. Apricots 8. Broccoli 9. Romaine Lettuce 10. Cantaloupe 11. Yellow Corn 12. Tangerines 13. Tomatoes 14. Papaya 15. Peaches 16. Mangoes 17. Watermelon 18. Green Peas 19. Kiwi 20. Honeydew Melon 21. Asparagus 22. Black-Eyed Peas 23. Plums 24. Pineapple 25. Blueberries 26. Grapefruit 27. Oranges 28. Green beans 29. Brussel sprouts. 30. Guava 31. Dandelion Greens 32. Swiss Chard 33. Mustard Greens 34. Arctic Char 35. Prunes 36. Nectarines 37. Rose Hips 38. Parsley 39. Watercress 40. Radish 41. Red Cabbage 42. Beet Greens 43. Endives 44. Arugula 45. Pomegranates 46. Raspberries 47. Red Leaf Lettuce 48. Pumpkin 49. Bok Choy 50. Turnip Greens 51. Red Grapes 52. Passionfruit 53. Collard Greens 54. Zucchini 55. Strawberries 56. Clementines 57. Lima Beans 58. Cucumbers 59. Yellow Bell Peppers 60. Pawpaw. 61. Winter squash 62. Cantaloupes 63. Tuna 64. Persimmon 65. Pink Grapefruit 66. Honeydew melon 67. Guavas 68. Papayas 69. Passion fruit 70. Kiwi fruit 71. Blackberries 72. Cherries 73. Goji berries 74. Raspberry 75. Fenugreek leaves 76. Okra 77. Chayote 78. Eggplant.
Auxiliary Model Response1.Bell peppers 2. Liver 3. Collards 4. Carrots 5. Spinach 6. Yogurt 7. Salmon 8. Whole grains 9. Fortified cereals 10. Clams 11. Milk 12. Shrimp 13. Whole eggs 14. Meat such as beef liver 15. Plantains 16. Fresh herbs like basil, parsley and dill 17. Cheese (low-fat varieties) 18. Vegetable oils such as soybean oil or corn oil
Semi-open-ended QuestionTell me a list of world capitals where more than two languages are considered official.
GPT-4 Response for Semi-open-ended Question1. London, United Kingdom 2. Montreal, Canada 3. Brussels, Belgium 4. Ottawa, Canada 5. Bern, Switzerland 6. Johannesburg, South Africa 7. Suva, Fiji 8. Biel/Bienne, Switzerland 9. Singapore, Singapore 10. Luxembourg City, Luxembourg 11. New Delhi, India 12. Addis Ababa, Ethiopia 13. Nairobi, Kenya 14. Helsinki, Finland 15. Vienna, Austria 16. Maputo, Mozambique 17. Oslo, Norway 18. Reykjavik, Iceland 19. Valletta, Malta 20. Nicosia, Cyprus 21. Dublin, Ireland 22. Vaduz, Liechtenstein 23. Tirana, Albania 24. Andorra la Vella, Andorra 25. Baku, Azerbaijan 26. Manama, Bahrain 27. Dhaka, Bangladesh 28. Saraevo, Bosnia and Herzegovina 29. Sofia, Bulgaria 30. Copenhagen, Denmark 31. Tallinn, Estonia 32. Tbilisi, Georgia 33. Baghdad, Iraq 34. Jerusalem, Israel 35. Astana, Kazakhstan 36. Beirut, Lebanon 37. Tripoli, Libya 38. Kuala Lumpur, Malaysia 39. Abuja, Nigeria 40. Islamabad, Pakistan 41. Doha, Qatar 42. Bucharest, Romania 43. Riyadh, Saudi Arabia 44. Madrid, Spain 45. Lome, Togo 46. Ankara, Turkey 47. Kiev, Ukraine 48. Dubai, United Arab Emirates 49. Canberra, Australia 50. Praia, Cape Verde 51. N'Djamena, Chad 52. Zagreb, Croatia 53. Prague, Czech Republic 54. Beijing, China 55. San Jose, Costa Rica 56. Santo Domingo, Dominican Republic 57. Quito, Ecuador 58. Cairo, Egypt 59. Port Vila, Vanuatu 60. Windhoek, Namibia 61. Kathmandu, Nepal 62. Wellington, New Zealand 63. Asuncion, Paraguay 64. Lisbon, Portugal 65. Moscow, Russia 66. Dakar, Senegal 67. Ljubljana, Slovenia
Auxiliary Model Response1. Brussels, Belgium 2. Ottawa, Canada 3. Panama City, Panama 4. Lima, Peru 5. Toronto, Canada 6. Paris, France 7. Hong Kong, China 8. Moskva, Russia 9. Rome, Italy 10. Malabo, Equatorial Guinea 11. Stockholm, Sweden 12. Athens, Greece 13. Tokyo, Japan 14. Jakarta, Indonesia 15. Mexico City, Mexico 16. Bangkok, Thailand 17. São Paulo, Brazil 18. Buenos Aires, Argentina 19. Caracas, Venezuela 20. Bogota, Colombia

Reference

[1] Aaron Gokaslan and Vanya Cohen. 2019. OpenWebText Corpus.

评论

Thanks for your responses. I have read the authors' rebuttal, which has solved my concerns.

After carefully considering the others' review comments and the author's rebuttal, I will keep my score.

审稿意见
3

The paper presents a study of generating answers to questions from the tail of an LLM's distribution (GPT-4). They begin by constructing a dataset by generating questions with multiple answers from GPT-4. For each question, they continue decoding multiple answers and define the first 75% of generated answers as "common-sense answers" and the last 25% as "ambiguous answers". Their objective is to then to develop methods for generating such "ambiguous answers" from GPT-4 more efficiently, not relying on prompting many times (as they did to construct their dataset) or decoding parameters (which are not fully accessible for many API services).

Their approach relies on using an auxiliary language model (LLama) to generate low-probability answers by reducing the probability of existing answers.They evaluate their method for generating low-probability answers using their constructed dataset, evaluating its ability to recover "ambiguous answers" as defined above.

They conclude their work by performing analysis of these tail answers ("ambiguous" answers in their dataset and answers generated using their method), using retrieval+GPT-4 to verify each of these tail answers for correctness. They find that roughly half of these answers could were correct using thier system.

优点

This work develops a method for generating a large, diverse answer candidate set, which may be useful for a number our other tasks.

The authors perform several ablation experiments, demonstrating the efficacy of different components of their method and sensitivity to hyperparameters.

缺点

Methods for generating a diverse set of answer candidates are evaluated against their ability to recover the set of tail-answers from GPT-4. Analysis, however, demonstrated that roughly half of these tail-answers are incorrect. Evaluations are therefore designed to generate answers that match GPT-4's tail distribution and generating the same set of incorrect answers.

The evaluation dataset is quite small and results only demonstrate minor improvement without significance testing.

Many components of the work lack explanation. The authors reference "human annotation" for verifying answers; however, it is not clear how this is done, or what the exact instructions were. While the checklist notes that full annotation instructions and compensation details were provided in supplementary materials, but I did not find these. The retrieval-based system used for validating answers also lacks description, only noting that they used Microsoft Copilot to perform retrieval.

问题

See the last point in weaknesses above.

Also, its somewhat unclear to me how LLAMA model is used as an auxiliary model for GPT-4. Is it auxiliary because GPT-4 can be used to validate generated answers?

局限性

See final note under weaknesses

作者回复

Thank you for your thoughtful review and valuable questions! We have incorporated the suggested experiment and provided important clarifications. We hope that we have addressed your concerns and resolved possible misunderstandings. Based on our clarifications, we would greatly appreciate it if you would reconsider the final evaluation of our work.

Q1: Clarification on the possible misunderstanding of our task

We kindly point out "Methods ... to recover the set of tail-answers from GPT-4 … roughly half of these tail-answers are incorrect." in your review is a misunderstanding. It should be clarified that "Methods ... are evaluated based on their ability to identify shortcomings (new ambiguous answers) in the target LLM."

Our task is not to “recover the set of tail-answers from GPT-4”. Instead, our motivation is to identify LLMs’ shortcomings by discovering more different ambiguous answers for LLMs where they tend to make mistakes (beyond the knowledge boundary of GPT-4), refer to Lines 11, 39 and 61 of our paper for details. In next version, we will be clearer about our task to avoid such misunderstandings.

Q2: Unclear about how the auxiliary model is used

Intuitively, the auxiliary model aims to generate hard (ambiguous) answers that reveal GPT-4's shortcomings—answers that GPT-4 is easy to make mistakes (as discussed in Lines 71 and 181). That is, we employ the auxiliary model (LLaMA) to uncover many new ambiguous answers that GPT-4 struggles to produce. Our findings show that 50% new ambiguous answers found by the auxiliary model reflect the shortcomings of GPT-4 (see Sec 4.5). Specifically, we use the Pseudo-inverse model embedding to estimate the nearest semantic representation of the existing answers (from GPT-4), then reduce their generation probability to produce new answers with lower probabilities (see Sec 3.3).

Q3: Improvement is minor

Following your suggestion, we conducted a bootstrap significance test to compare our method with the baselines. We found that the improvement in the Average Overlap Ratio (AOR) is statistically significant under the t-test with p<0.05. This indicates that our method effectively reduces answer repetition without compromising overall performance.

The core empirical contribution of our work extends well beyond the results presented in Table 1 of our paper. A more significant contribution is identifying GPT-4's shortcomings (please see Line 85): Our findings reveal that GPT-4 yields unsatisfactory results in 82.9% of the questions. Furthermore, about 50% of the new ambiguous answers identified by our method fall outside the knowledge boundary of GPT-4 (see Sec 4.5).

Our performance has been acknowledged by all the other reviewers. Reviewer frNN praises our work, stating, "Experiments show significant improvements in understanding LLM knowledge boundaries", Reviewer hUqe comments, "The experiments are solid enough", and Reviewer bcni observes, "The performance of the proposed methods seems effective in detecting GPT-4’s knowledge boundary."

Q4: The dataset is quite small

Our dataset is comparable in size to many other hallucination evaluation datasets within the research community, as shown in the following table. It covers multiple domains and is highly effective at identifying the knowledge boundary of GPT-4. It successfully identifies shortcomings in GPT-4's responses for 82.9% of the questions, using only vanilla prompts.

DatasetTruthfulQA(Lin et al., 2022)HaluQA(Cheng et al., 2023)FreshQA(Vu et al., 2023)FELM(Chen et al., 2023d)ChineseFactEval(Wang et al., 2023a)
Size817450600817125

Constructing and expanding the dataset is expensive and time-consuming. Specifically, It takes a human annotator about 2 minutes to assess and double-check the truthfulness of each answer. Our dataset contains approximately 1k questions, each with an average of 13 tail-answers to be verified. Considering the requirement for annotators to read and assess the credibility of the retrieved information (following our annotator principle in Appendix. A), the total working hours are 435 hours. As we paid 8 dollars per person per hour, it cost us 3483 dollars to construct the dataset. Even if we hire more than 5 annotators, it would still take more than 72 hours (more than 7 days if they work 10 hours per day), exceeding the time limit of the rebuttal period.

Q5: Evaluation and human annotation guidance need a clearer explanation

We kindly point out that we have introduced the human annotation procedure in Appendix A and provided full guidelines in the file named "human guide.docx" in the supplementary material, as well as a reminder in Appendix G. Specifically, in human evaluation, we ask the annotators to read the judgments generated by the retrieval-based evaluation system (Microsoft Copilot), assess the authority of the retrieved information, and evaluate the degree of certainty of the tone of the Copilot judgments. Then, annotators categorize answers into correct, incorrect, and unverifiable (see Line 563 of our paper and the fourth note in "human guide.docx" in the supplementary material).

In retrieval-based evaluation, we instruct a RAG system with well-designed instructions (please see details in Sec 3.4) to verify the truthfulness of each tail answer. Specifically, we concatenate each candidate's answer with the question and prompt Microsoft Copilot to search online for related information, generate a summary, and make judgments (see Appendix G for full instructions). In our revised paper, we will explain these components in the experimental setting to avoid such confusion.

评论

Dear Reviewer e3tW,

We wish to express our sincere gratitude for your invaluable feedback! We kindly request that you review our responses to your observations and consider revising your assessments accordingly.

We believe that our explanations and additional experiments have thoroughly addressed your queries and concerns, Should you have any additional questions or require further clarification, please feel free to contact us. Your final evaluation and potential score update would be greatly appreciated and would significantly contribute to the refinement of our work.

Thank you for your dedication and thoughtful reviews. We look forward to your final ratings.

Best regards,

Paper 10289 Authors

评论

As the response phase draws close, we are happy to explain if you have any remaining concerns or further advice. Your expertise and constructive criticism are invaluable to us, and we are keen to utilize the remaining time effectively to address any of your remaining or new questions.

We have received the responses from Reviewer frNN and hUqe. Reviewer frNN recognize all the additional experiments and explanations and thinks they satisfy the reviewers' concerns. The reviewer raised the rating from 3 to 6. Reviewer hUqe is also satisfied with our rebuttal and keeps the positive rating (7).

Thank you once again for dedicating your valuable time to our paper!

Best regards,

Submisson 10289 Authors

评论

Dear Reviewers e3tW,

We are sincerely grateful for your thoughtful feedback and suggestions! We have taken your observations to heart and addressed all the concerns you raised. All other reviewers have responded positively to our rebuttals. Specifically, we are thankful to Reviewer frNN for reading our rebuttal and for increasing the rating from 3 to 6, which indicates our new experiments and explanations are quite satisfactory. Reviewer hUqe has also expressed satisfaction with our rebuttal and has decided to maintain the positive rating of 7. Furthermore, Reviewer bcni has noted: “After carefully considering the others' review comments and the author's rebuttal, I will keep my score” (8).

As the discussion period draws close, we sincerely invite you to review our responses and reconsider your assessments. Should you have any unresolved concerns, please do not hesitate to contact us.

Thank you once again for your dedication! We look forward to your final ratings!

Best regards,

Paper 10289 Authors

最终决定

This paper studies the problem of identifying "knowledge boundaries" on semi open-ended question answering settings. This problem has been studied for factual queries where the model either knows or doesn't know the answer and classic calibration methods can be used. However, in the more open-ended setting, a model may know some answers and be unaware of others. They construct a dataset of these cases by sampling from the LLM. They then augment the LLM's answers to these questions with an auxiliary model. This process uses an open-source LLM to generate additional low-probability candidates by subtracting off a matrix computed via a pseudoinverse computed from existing answers. They check the correctness of these answers using RAG. Finally, they verify whether these ambiguous answers are within or beyond the LLM's knowledge boundary: does the LLM generate correct answers, and does it successfully identify via self-evaluation which answers are correct or not?

Results show that the authors' approach discovers more ambiguous answers than baselines. Moreover, GPT-4 in general does so-so at the task of recognizing its knowledge boundary: it frequently generates bad answers, is not able to check them, and does not often stop from generating bad answers when continuously prompted.

Strengths

  • Interesting and timely problem

  • Generating the additional answers by subtracting off a pseudoinverse is a nice method for this

  • The overall experimental design is well-conceived and the results for GPT-4 are interesting

  • Most reviewers' concerns are addressed

Weaknesses

  • Applicability of the setting: I like the open-endedness, but I agree with bcni and hUqe that this set of questions is a bit narrow. It would be nice if there were a clearer picture here of how this could broaden to other settings.

  • The use of human annotation in 4.1 is confusing and not explained in depth.

  • Focus on LLM self-eval: this is a bit narrow and doesn't look at other possible techniques like those based on probabilistic calibration. It's not clear if this gives a precise sense of where a model's knowledge boundary is.