PaperHub
4.3
/10
Rejected4 位审稿人
最低3最高6标准差1.3
6
3
3
5
4.0
置信度
正确性2.8
贡献度2.0
表达2.5
ICLR 2025

Too Big to Fool: Resisting Deception in Language Models

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

This paper introduces a powerful evaluation method showing that larger language models are more resilient to misleading prompts and better at using truthful in-context hints.

摘要

关键词
Large Language ModelsEvaluationMisinformationIn-Context LearningWorld ModelsReasoning

评审与讨论

审稿意见
6

The paper investigates how LLM inference is influenced by information that is provided in-context when this information contradicts or supports the LLMs internal world model. The authors hypothesize that larger LLMs will be more robust to misleading in-context information. They conduct experiments by altering prompts from existing benchmarks to include hints or misleading information and measure how this affects the performance of models of varying sizes. The result supports the authors' hypothesis. They also conduct a number of control experiments to rule out alternative explanations, such as that larger models ignore in-context information or simply memorize correct results from their training data.

In the context of multiple-choice question answering benchmarks, the authors convincingly demonstrate their claim with thorough control experiments. However, I feel that their results are limited by only considering this setting and that it is unclear how well the results would generalize to more realistic applications of LMs. For this reason, I weakly lean towards rejecting the paper but am willing to change my score if this concern is addressed.

优点

  • The paper includes well thought-out control experiments that rule-out alternative explanations for the main results. I found the results on memorization particularly interesting.
  • Experiments are conducted on multiple families of LMs and on a large number of question-answering benchmarks.
  • The paper is well-written and easy to follow.

缺点

Limited experimental setting.

In my opinion, the paper's main weakness is that the experimental setting is limited and not very realistic. Therefore it is unclear how universal their results are. The paper only considers multiple-choice question answering benchmarks. Prompts are made to be "deceptive" by simply adding text that claims a wrong answer is correct, which is not very realistic.

The results could be made more convincing by considering a broader experimental setting. For example, the paper could conduct similar experiments on math or coding benchmarks. This is my main concern and if it were addressed I would raise my score.

It is questionable if the evaluation method is a novel contribution.

In the introduction, the paper claims that their methodology of applying small changes to task definitions is a novel contribution. This is questionable, as several previous works, including some of the papers cited on lines 99 and 100, evaluate the robustness of model answers under perturbations. To me it seems like a stretch by the authors to claim that they have introduced a new methodology by adding specific sentences to prompts from existing benchmarks.

问题

  • Have you considered different settings from multiple-choice question answering? If not, why not?

  • How do your results on memorization relate to the existing literature on memorization in LLMs and data contamination in evaluations? It would be good if this was mentioned in the background section.

评论

We truly appreciate your insightful feedback on our work. We are pleased to learn that you found our control experiments well thought-out and that the results on memorization were particularly interesting. Your recognition of our efforts to conduct experiments across multiple families of language models and a large number of question-answering benchmarks is greatly appreciated. We are also glad that you found the paper well-written and easy to follow.

We aimed to address the concerns and appreciate the opportunity to elaborate further on this aspect:


The experimental setting is limited to multiple-choice questions

Our decision to focus initially on multiple-choice question-answering benchmarks was rooted in the specific objectives of our study. Multiple-choice formats provide a controlled environment where we can systematically measure performance using clear, objective metrics like accuracy, accuracy drop, and relative accuracy drop.

Evaluating open-ended tasks, on the other hand, introduces significant challenges due to the subjective nature of potential answers and the difficulty in establishing objective evaluation metrics. Common generative metrics like BLEU or ROUGE focus on surface-level n-gram overlaps, which may not accurately reflect the correctness or appropriateness of a response, especially when assessing a model's resilience to deception. For instance, a model might generate a syntactically correct but factually incorrect answer, and these metrics would not adequately penalize this. Human evaluation is another approach, but it is resource-intensive, and we couldn't pursue it.

With this in mind, we agree that expanding our experimental setting to encompass a broader range of tasks would strengthen our findings. In response to your suggestion, we will extend our experiments to include some generative open-ended benchmarks in the revision of our paper.


Prompts are made to be "deceptive" by simply adding text that claims a wrong answer is correct, which is not very realistic

We recognize that the deceptive prompts in our study are intentionally simplified to ensure scalability and maintain a controlled experimental environment across large datasets. Our primary objective was to empirically determine whether larger models are more resilient to misinformation. To the best of our knowledge, our study is the first to empirically demonstrate this emergent property.

Crafting customized misleading hints for each prompt would have been impractical due to the extensive manual effort required. Additionally, generating such content using sophisticated LLMs would have necessitated thorough validation for quality and authenticity as true misinformation, which would have required significant additional manual effort. By standardizing the way we introduced deception, we were able to effectively isolate and analyze how models integrate in-context information with their internal knowledge at scale, specifically across large datasets.


It is questionable if the evaluation method is a novel contribution

Our methodology introduces specific innovations, such as prompt unification and the design of targeted interventions, which enable more precise testing of model behaviour. These novel aspects contribute valuable insights by refining the evaluation process and offering a clearer understanding of model performance under various conditions. However, we acknowledge that the methodology as a whole may not warrant being emphasized as a primary contribution. Therefore, we have removed this claim from the introduction to present a more balanced representation of our work.



In conclusion, we truly value your reviews. We hope that the revisions and clarifications will influence and improve your overall opinion. If we've managed to resolve your principal concerns and questions, we'd be thankful for your endorsement through an elevated score of our submission. On the other hand, if there are remaining issues or questions on your mind, we're more than willing to address them.

PS. We will upload the latest version which consists of open-ended generative benchmarks and more literature review on memorization.

评论

Thank you for the rebuttal.

Unfortunately, I feel that my main concern has not been alleviated as the promised additional experiments are not included in the current paper. I appreciate that a more open-ended setting introduces challenges regarding the evaluation of responses, but this does not change that currently the paper's contribution is limited. Regarding concerns about the accuracy of metrics like BLEU and ROGUE, the authors could use open-ended generation tasks with more objective evaluation criteria such as math or coding benchmarks.

Further, I disagree with the claim in the main response that this work is the first to show that confirm that larger language models are less affected by misinformation. This is also shown by table 4 of Xu et al (2023) [1].

In my view, the authors tackle an important topic, and if more general experiments are added this work could be an important contribution to the study of deception in LLMs. However, in it's current state, I do not feel comfortable with raising my score.

[1] https://arxiv.org/abs/2312.09085v5

评论

Thank you very much for highlighting your remaining concerns. We sincerely appreciate your engagement and your perspective on the potential impact of our research.

Regarding the promised additional experiments, we have been working on incorporating these open-ended generation tasks, and we have now included them in the latest version, specifically in the new Appendix G.

Concerning our claim about misinformation robustness, thank you for pointing out the findings in Xu et al. (2023). We will include this reference in the background section to ensure our work is accurately situated within the context of prior research.


We hope these updates address your concerns and look forward to any further feedback you may have. If you feel that our revisions have satisfactorily addressed your comments, we would greatly appreciate your consideration of improving our score.

评论

Thank you for your response. The additional content in the newest update has addressed my concerns to some degree and I have raised my score. I am only raising it to a 6 for the following reasons:

  • The new results for open-ended tasks were only added for the Directive Instruction experiment, as the authors instruct the model to output either double the correct answer or 0. No open-ended generation experiments were conducted for the experiments from 4.1 or 4.3. I would also suggest that the authors include the prompt template for this experiment, though this does not affect my score.
  • Considering comments by other reviewers, I agree that there is too much vagueness in the discussion of world models and memorization. For memorization, I think the results mainly concern contamination of the training data in the sense of benchmark data that was exactly present in the training data. However, the author's propose that they build on Hartmann et al's definition which "extends beyond exact recall to include abstract facts present in a small subset of documents". Firstly, I do not understand what exactly the authors mean with this, as Hartmann et al describe multiple types of memorization for different types of information. Regardless, I do not think the experiments support claims about any type of memorization except exact data contamination (memorization of verbatim text according to Harmann et al).
  • Regarding world models, the results provide evidence for the greater robustness of larger models in the sense that the models maintain their accuracy in spite of conflicting information. However, I do not see how this is evidence for a "compact, coherent, and interpretable" world model as opposed to sophisticated pattern matching. The experiments are conducted on instruct models which are trained to be helpful assistants. It may well be that they are trained to answer correctly even on questions or instructions that contain some conflicting or false information. In this case, it seems plausible that the higher robustness of larger models is simply the product of better pattern matching, "without forming a coherent or interpretable understanding of the data generation process".

I appreciate the authors' efforts in clarifying their claims and providing additional experiments. I believe the paper still lacks some experimental thoroughness and conceptual clarity. However, in light of the provided results, as well as importance and novelty of the topic, I am weakly endorsing it.

评论

Thank you very much for your thoughtful feedback and for already increasing the score once. We genuinely appreciate the time you've taken to engage deeply with our paper.

To address your concerns, we have revised the paper in the last submission. Let's address your concerns one by one:


The new results for open-ended tasks.

The results in the previous version were specifically focused on the Deception experiments, supporting Section 4.1. In our latest revision, we added the "Directive Instruction" experiments, which support Section 4.2. Thus, the open-ended generation experiments are conducted for both Deception and Directive Instruction tasks, each corresponding to Sections 4.1 and 4.2 respectively.

It is important to note that conducting Context Removal experiment (to support Section 4.3) is uninformative, as removing the question results in the performance of all models collapsing to zero (this is s observed in our experiments). This outcome is expected because, unlike the scenario described in Section 4.3, the prompt does not contain any answer choices for the model to infer task-relevant information. Consequently, the model's predictions revert to a random baseline, yielding performance close to zero under the exact match metric.


For memorization [...]

We acknowledge that our original framing was unclear, and we appreciate the opportunity to clarify this important aspect of our work. Specifically, we now state that our experiments focus primarily on assessing verbatim memorization to determine its impact on the models' resilience to misleading prompts. As previously stated in the paper, our findings are about data contamination (verbatim memorization). We realize that our wording in Section 2 did not accurately convey this definition, and accordingly, we have adjusted the corresponding paragraph.


Regarding world models [...]

Your insights directly reflect the open debate we highlighted in Appendix G, and we want to clarify our position further. In particular, your interpretation aligns closely with the school of thought that supports the sophisticated pattern matching hypothesis. Our paper aims to contribute to this debate by providing empirical evidence that supports the formation of internal world models in LLMs. We designed our experiments to test whether the observed robustness could be solely attributed to pattern matching or if it indicates a deeper level of understanding.

While these models are trained to be helpful and accurate, if their robustness were solely due to better pattern matching, we would expect them to reproduce misleading information from the prompt, as they would tend to prioritize recent and contextually relevant data. Though, our experiments show that larger models prioritize their internal knowledge over misleading cues, integrating and reconciling new information with their existing knowledge to select the answer. To further explore whether the observed robustness could be attributed to sophisticated pattern matching, we introduced additional controls in our experiments (Sections 4.2 and 4.3) to minimize the impact of potential artifacts.



Once again, we appreciate your valuable comments, which have helped us improve the clarity and precision of our paper. We hope these efforts merit further positive consideration.

审稿意见
3

The paper explores the resilience of LLMs to deceptive in-context information, focusing on how models of different sizes respond to misleading prompts. The study uses multiple-choice benchmarks to compare model behavior, concluding that larger models show greater robustness against such deceptive inputs. The authors assert that this resilience is not merely due to ignoring in-context information or memorization but stems from the models' ability to integrate information with their internal "world models."

优点

  • The study addresses a pertinent issue in the deployment of LLMs—robustness against misleading or deceptive information. With the increasing use of LLMs across various applications, understanding their behavior in such scenarios is essential for ensuring reliability and safety, highlighting the relevance and timeliness of the study.
  • The authors present a clear and structured evaluation framework. By leveraging controlled experiments and multiple-choice benchmarks, they systematically analyze model behavior under different conditions, such as deception, guidance, and directive instructions. This rigorous approach enables consistent comparisons across models of varying scales and architectures, drawing attention to the core issue of robustness against deception.

缺点

  • Insufficient Exploration of Underlying Mechanisms: The assertion that larger models possess more robust "world models" is a plausible hypothesis but remains inadequately supported by the experiments presented. While the control tests (e.g., context removal, directive instructions) attempt to eliminate explanations like memorization, they do not sufficiently clarify how models integrate conflicting information. To substantiate claims about the mechanisms underlying resilience, more detailed analyses, such as probing internal representations or employing causal interventions, are needed.

  • Limited Scope and Generalizability: Although the methodology is sound, the study's reliance on multiple-choice benchmarks constrains its scope. Controlled settings like these do not adequately reflect the complexity of real-world scenarios, where deceptive prompts are often subtler and more nuanced. This limitation undermines the generalizability of the findings, restricting their applicability beyond the specific benchmark tasks used in the study. Addressing such limitations by incorporating open-ended tasks or more diverse datasets would improve the practical relevance of the results. The exclusive focus on instruction-tuned models introduces potential biases, as the findings may not generalize to other model types, such as generative LLMs. A more comprehensive evaluation that includes models not optimized for following instructions would provide a better understanding of how scale impacts resilience across different architectures and training paradigms.

  • The authors examine robustness to deceptive information in relation to parameter scaling, but only present two models per family. While this approach is understandable, a more comprehensive study would be valuable to strengthen the findings.

问题

  • At Line 414, the authors suggest that if a model has memorized associations between answers and questions, it might still achieve high accuracy even without the question. However, this claim is not backed by any citations or experimental evidence. Furthermore, the discussion around memorization in Lines 426-429 lacks clarity. The authors argue that if memorization were the primary factor, the "contaminated" model would maintain high accuracy even without the question, while the DCLM-7B model would perform at chance level. However, there is no verification of whether the Llama model is contaminated, making this comparison problematic. Additionally, terms like "reasoning" and "memorization" are not well-defined, which undermines the validity of the claims.

  • While the study demonstrates that larger LLMs are more resilient to deceptive in-context information, it fails to adequately explain the underlying reasons. The claim that larger models can "effectively reconcile conflicting information” is not convincingly proven, as the experiments do not provide deeper insights into the mechanisms that facilitate this behavior.

评论

Thank you for your positive feedback on our paper. We are pleased that you found our study relevant in addressing LLM robustness against misleading information, a crucial issue as these models are increasingly deployed in critical applications. We appreciate your recognition of our structured evaluation framework and our insights.

We have carefully considered the concerns raised and would like to address them as follows:


The assertion that larger models have more robust "world models" is inadequately supported by the experiments.

We appreciate your insightful feedback and your interest in exploring the underlying mechanisms of how larger models integrate conflicting information. Your suggestion to employ methods such as probing internal representations or causal interventions is indeed valuable and represents an exciting direction for future research.

Our study primarily aims to identify what factors contribute to the improved resilience to misleading prompts in larger models by systematically ruling out alternative explanations. While our experiments demonstrate that larger models are less susceptible to deception and can effectively reconcile conflicting information, we acknowledge that we do not explore how this resilience occurs at the level of internal computational mechanisms.

In Appendix A, we have provided an initial qualitative analysis that offers insights into how reasoning patterns can be affected across different model sizes. We are also preparing additional results of this nature to further strengthen our findings and support the publication of this work.


The study's reliance on multiple-choice benchmarks constrains its scope.

Our decision to focus initially on multiple-choice question-answering benchmarks was rooted in the specific objectives of our study. Multiple-choice formats provide a controlled environment where we can systematically measure performance using clear, objective metrics like accuracy, accuracy drop, and relative accuracy drop.

Evaluating open-ended tasks, on the other hand, introduces significant challenges due to the subjective nature of potential answers and the difficulty in establishing objective evaluation metrics. Common generative metrics like BLEU or ROUGE focus on surface-level n-gram overlaps, which may not accurately reflect the correctness or appropriateness of a response, especially when assessing a model's resilience to deception. For instance, a model might generate a syntactically correct but factually incorrect answer, and these metrics would not adequately penalize this. Human evaluation is another approach, but it is resource-intensive, and we couldn't pursue it.

With this in mind, we agree that expanding our experimental setting to encompass a broader range of tasks would strengthen our findings. In response to your suggestion, we will extend our experiments to include some generative open-ended benchmarks in the revision of our paper.


Controlled settings like these do not adequately reflect the complexity of real-world scenarios.

We recognize that the deceptive prompts in our study are intentionally simplified to ensure scalability and maintain a controlled experimental environment across large datasets. Our primary objective was to empirically determine whether larger models are more resilient to misinformation. To the best of our knowledge, our study is the first to empirically demonstrate this emergent property.

Crafting customized misleading hints for each prompt would have been impractical due to the extensive manual effort required. Additionally, generating such content using sophisticated LLMs would have necessitated thorough validation for quality and authenticity as true misinformation, which would have required significant additional manual effort. By standardizing the way we introduced deception, we were able to effectively isolate and analyze how models integrate in-context information with their internal knowledge at scale, specifically across large datasets.

评论

The authors examine robustness to deceptive information in relation to parameter scaling, but only present two models per family.

Our initial analysis was constrained by factors beyond our control. For Mistral, only two model sizes were publicly released, limiting our capacity to include more in our evaluation. Regarding Gemma, although a larger 27B model exists, we encountered significant issues where the model's performance did not align with reported results, likely due to bugs in the release. To maintain the integrity and reproducibility of our findings, we focused on the stable, reliable versions.

In the case of Llama, a larger 405B model was available. While we were unable to include it before the submission deadline due to computational resource limitations, we are committed to adding it in the final version of the paper. Similarly, for Phi, we initially excluded Phi-small to maintain consistency across families, as all others had two sizes available. We agree that including Phi-small could enhance the comprehensiveness of our study and are prepared to include it if it would address your concerns.



In closing, we deeply value your insightful feedback. We trust that these clarifications provided align with the conference's high standards. If our responses have successfully addressed your key concerns, we would greatly appreciate your consideration in revisiting the score assigned to our submission. Should any questions or uncertainties remain, we warmly welcome the opportunity to address them further.

评论

Thank you for the detailed response and clarifications provided in your rebuttal. While I appreciate the efforts made to address the raised concerns, the key issues highlighted in my initial review remain unresolved.

Specifically:

  • The scope of the study is still limited, relying on controlled benchmarks that do not adequately reflect real-world complexities. Although the authors have acknowledged this limitation and proposed extending the study to open-ended tasks, these additions are yet to be implemented. Additionally, while I recognize the challenges and limitations of text-generation metrics, these scores are standard in the evaluation of open-ended tasks and should still be included. Their inclusion would provide a baseline comparison and enhance the study’s relevance, even if supplemented by qualitative or human evaluations in future iterations.
  • The study's generalizability is constrained by the limited number of models analyzed within each family. I acknowledge that including larger model sizes can be computationally expensive and that, for certain models like Mistral, only two sizes were available. However, the central hypothesis of the work is grounded in differences in behavior across scales, making a more granular exploration of model sizes necessary to substantiate the claims and increase the robustness of the findings.
  • As I suggested in my original review, some of these limitations could be balanced by a stronger understanding of the internal mechanisms that regulate deception, showing how these mechanisms also depend on model size. While I acknowledge that the authors have explored this direction in the appendix, it remains underexplored and does not sufficiently address the gap in understanding the causal relationship between scale and robustness.

Given these persisting limitations, I will retain my initial scores. I appreciate the authors’ willingness to incorporate feedback and encourage them to expand the scope, methodology, and depth of their analysis in future work.

That said, I believe this study addresses an important topic—enhancing our understanding of how deception in language models manifests and evolves across scales. This is a critical area of research, and I encourage the authors to continue working on this topic, addressing the noted limitations, as it holds great potential to advance the field.

评论

We sincerely appreciate the time and effort you are putting in evaluating our work. We would like to address your remaining concerns as follows:

1. Scope of the Study and Evaluation Metrics: Regarding the benchmarks, we have added a new Appendix G to our latest revision. This appendix includes evaluations on two well-established open-ended generative benchmarks. We acknowledge that this addition is still limited, but we are planning to increase the scope further in the final version.

2. Limited Number of Models Analyzed: Given our constraints, we were unable to expand this aspect during the rebuttal period. We agree that a more extensive analysis would enhance the robustness of our findings, and we plan to address this.

3. Understanding Internal Mechanisms Regulating Deception: To address your third concern, we have extended our qualitative analyses in Appendix A. This section examines how misleading hints affect the generative behavior of LLMs. While this analysis is limited, it represents our best effort within the rebuttal period. We plan to further deepen this.



We hope that these revisions adequately address the key issues you have raised. Given the substantial updates made in response to your feedback, we kindly request that you consider reassessing your evaluation of our paper.

评论

Thank you for your detailed rebuttal and the significant effort you have made to address the concerns raised in my initial review. I recognize and appreciate the updates provided. These additions demonstrate a clear commitment to improving the work based on reviewer feedback.

However, while these efforts are commendable, the key issues I outlined in my original review remain unresolved to a substantial degree, and in particular:

  • Open-Ended Benchmarks: The inclusion of GSM8K and MATH in Appendix G is a positive step, but the reliance on "exact match" as the primary metric is overly rigid and fails to capture partial correctness or semantic equivalence. Additionally, the less pronounced performance drop in larger models warrants further investigation, potentially with a larger delta in model sizes to better isolate trends.

  • Depth of Analysis Across Model Families: The central hypothesis regarding scale-dependent behavior requires a more granular analysis of model sizes within each family to convincingly substantiate the claims. The inability to include additional models is understandable given the constraints, but it significantly impacts the strength of the conclusions drawn.

Given these persisting limitations, my initial assessment remains unchanged. I continue to believe that the study addresses a highly relevant and important topic with potential for significant impact. However, the current submission does not sufficiently meet the bar for acceptance in its present form.

I encourage the authors to refine this work further, ensuring that these limitations are addressed in the main part of the paper in future iterations to fully unlock its potential impact on the field.

审稿意见
3

The paper has conducted thorough experimentation on LLMs deception behaviour. Authors also have compared the deception and comparison with smaller and larger language models.

优点

  1. The paper provides comparison of the deception from smaller language models to large language models.

  2. The idea of providing misleading information to the LLMs and testing their behaviour is interesting.

  3. The paper presents wide range of ideas or possibilities with the deception and have presented the analysis which is quite interesting.

  4. The datasets cover wide range of domains like maths, science and other questions as well.

缺点

  1. The authors have used multi choice question type for the deception analysis they would have used question answering datasets where input is text and output is also a text but not multi choice.

  2. The experimentation and prompting looks really not so suitable. If you are prompting incorrect answer as an hint and using it to answer the correct option would not be the right way of prompting. Instead you can prompt the process like "refrigerator does not conduct electricity or does not contain iron." instead of directly prompting "The correct answer is A.". Always remember the LLMs provide the answer by reasoning if you really want to check deception instead of saying about the answer you have to misinform the process or thinking.

  3. Everyone knows Larger models have lesser deception because they understand better and not only that if your LLM has the ability to think or if any prompting technique is used the deception automatically decreases. O1 is less deceptive compared to others because the model has the ability to think.

  4. The section 4.2 is not really an important section in my opinion. I do not think any LLM will ignore the text. They are machines and always take every input given by the humans for the better response.

  5. In the experimentation GPT Models are absent which are really important models. Also the authors should implement O1 as well. if O1 seems to be not affordable they must at least implement GPT-4o or GPT-3.5.

问题

  1. Why a question answering dataset which is not multi choice is implemented?

  2. Why the latest models like GPT models are not implemented? They are the most popular and most of the readers would like to see their performance.

  3. You have prompted the incorrect answer as hint instead of the thinking process. Why not prompting the incorrect thoughts or incorrect way of solving it?

评论

Thank you for your feedback on our paper. We appreciate the recognition of our comparative analysis of deception between smaller and larger language models, and the interest in our approach of providing misleading information to test model behavior. Additionally, we value the acknowledgment of the diverse range of ideas explored in our analysis, as well as the breadth of the datasets used.

While we appreciate the time you've invested in reviewing our paper, we feel it is necessary to address several fundamental misunderstandings in your assessment.


The authors have used a multi-choice question type for the deception analysis they would have used question answering datasets where input is text and output is also a text but not multi-choice.

Our decision to focus initially on multiple-choice question-answering benchmarks was rooted in the specific objectives of our study. Multiple-choice formats provide a controlled environment where we can systematically measure performance using clear, objective metrics like accuracy, accuracy drop, and relative accuracy drop.

Evaluating open-ended tasks, on the other hand, introduces significant challenges due to the subjective nature of potential answers and the difficulty in establishing objective evaluation metrics. Common generative metrics like BLEU or ROUGE focus on surface-level n-gram overlaps, which may not accurately reflect the correctness or appropriateness of a response, especially when assessing a model's resilience to deception. For instance, a model might generate a syntactically correct but factually incorrect answer, and these metrics would not adequately penalize this. Human evaluation is another approach, but it is resource-intensive, and we couldn't pursue it.

With this in mind, we agree that expanding our experimental setting to encompass a broader range of tasks would strengthen our findings. In response to your suggestion, we will extend our experiments to include some generative open-ended benchmarks in the revision of our paper.


If you are prompting incorrect answer as an hint and using it to answer the correct option would not be the right way of prompting

We recognize that the deceptive prompts in our study are intentionally simplified to ensure scalability and maintain a controlled experimental environment across large datasets. Our primary objective was to empirically determine whether larger models are more resilient to misinformation. To the best of our knowledge, our study is the first to empirically demonstrate this emergent property.

Crafting customized misleading hints for each prompt would have been impractical due to the extensive manual effort required. Additionally, generating such content using sophisticated LLMs would have necessitated thorough validation for quality and authenticity as true misinformation, which would have required significant additional manual effort. By standardizing the way we introduced deception, we were able to effectively isolate and analyze how models integrate in-context information with their internal knowledge at scale, specifically across large datasets.


Everyone knows Larger models have lesser deception because they understand better

We respectfully note that the assertion that larger models "have the ability to think" is a topic of active debate within the community. While state-of-the-art LLMs exhibit impressive performance across various tasks, attributing human-like understanding or thinking to them is not universally accepted. Some researchers argue that LLMs primarily learn statistical correlations from training data without developing a deeper understanding of the data-generating processes. From this perspective, LLMs are essentially sophisticated pattern matchers. Conversely, other scholars suggest that by compressing extensive training data, LLMs develop more compact and coherent internal representations of the data's generative processes. However, even within this perspective, there is recognition that the models' understanding is limited and does not equate to genuine "thinking" as humans experience it. Our work lies within this school of thought, aiming to explore how such internal representations contribute to resilience against misinformation.

While it is expected that larger language models might be less affected by misinformation, our study is, to the best of our knowledge, the first to empirically demonstrate this property. Previous works have explored LLM vulnerabilities and scaling effects but were constrained by the performance limitations of earlier releases of language models, preventing empirical validation of the relationship between model size and resilience to misinformation.

We hope this clarifies our position and addresses your concerns. We believe our findings offer a meaningful contribution to the ongoing discourse on the capabilities and limitations of LLMs.

评论

The section 4.2 is not really an important section in my opinion. I do not think any LLM will ignore the text. They are machines and always take every input given by the humans for the better response.

The statement that LLMs "are machines and always take every input given by the humans for the better response" doesn't necessarily imply that all models handle in-context information in the same way. It is true that language models process all input provided to them, but our focus in Section 4.2 is not on whether the models process the input text at a surface level, but rather on how they weigh and integrate legitimate in-context information when generating responses.

This insight is crucial for understanding the underlying reason for the emergent property of resilience to deception, as we rule out alternative explanations for the observations in the paper. Other reviewers have also noted the value of this analysis, which suggests that this aspect of our work provides meaningful contributions to the field.

We hope this clarifies the importance of Section 4.2 and addresses your concerns.


In the experimentation GPT Models are absent which are really important models. Also the authors should implement O1 as well

We appreciate your suggestion and understand the value of evaluating widely used models like GPT. However, as mentioned in our paper, our objective was to understand how varying capacities within the same model family handle misleading in-context information. To achieve this, we require models with known parameter counts and training methodologies.

Proprietary models like GPT-3.5 and GPT-4 do not publicly disclose these critical details, making it difficult to include them in a study aimed at isolating the effects of model size. As Figures 2 and 3 in the paper demonstrate, performance metrics are depicted with respect to model parameter count. Without access to detailed information for proprietary models, including them would not provide meaningful insights.



In conclusion, we hope the clarifications and also the revisions have resolved your concerns. If your major questions and concerns have been addressed, we would appreciate it if you could support our work by increasing your score. If there are more questions/concerns, please let us know.

审稿意见
5

This paper investigates whether larger LLMs are more robust to misleading cues in prompts. Experiments across various models and benchmarks show that larger models exhibit smaller relative accuracy drops, demonstrating better resilience to deception in prompt. To eliminate the alternative hypothesis that the larger model ignores the appended misleading cues, additional tests were conducted: one added direct cues to the correct answer, and another instructed the model to avoid the correct answer. Both confirmed that resilience is not due to cue ignorance. Furthermore, in question-removal experiments on MMLU, the non-trivial accuracy (>>25%) for both overfitted and non-MMLU-trained models reveals that LLMs possess advanced data processing capability (i.e, fill in the missing details) beyond memorization. This possibly explains why LLMs can be resilient to misleading cues.

优点

  1. The paper provides valuable insights into the scaling effects on model resilience to misleading prompts, complementing existing LLM scaling studies.

  2. The paper is well-organized and clearly written.

  3. The main and additional experiments thoroughly validate the central hypothesis.

缺点

  1. Marginal Contribution: Compared with previous works investigating LLM vulnerability to prompt manipulation, as explained in Section 2, it seems the most significant difference lies in the sole focus on parameter scaling effects, specifically 2B-70B models. It is expected that larger models, benefiting from refined training, demonstrate improved knowledge retention, instruction following, and in-context learning, following the emergent abilities analysis [1]. While the scaling investigation on LLM resilience to prompt deception is thorough, the insights presented do not significantly advance the current understanding, especially considering similar works (e.g., [2]) have conducted experiments across multi-sized models.

  2. Potential Overclaim Regarding "World Models": Throughout the paper, the authors repeatedly use the term "internal world model (of an LLM)", but as the author may notice in related work discussions, the "World Models" concept is only vaguely defined (and never formally defined in this paper), and whether LLM possesses "World Model" is still under debate, not to mention its "robustness". It is confusing to prove the robustness of loosely-defined concepts. The experiments and findings would remain valid without drawing connections to "world models," making this connection appear unnecessary and potentially misleading.

  3. Results for main experiments are not thoroughly analyzed: Although the study evaluates models and datasets extensively, it lacks in-depth dataset-wise and model-wise analysis. For example, it seems the dataset collection being evaluated is unbalanced -- some datasets have fewer answer choices or much fewer instances than others. This requires further justification for inclusion and dedicated analysis of experiment results. Also, when reporting average performance in Figures 2 and 3, it is not clear how the authors balance the results across datasets considering the imbalanced datasets, and the decreasing/increasing trend on some datasets (thin lines) looks not that significant or inconsistent with the average trend. In Appendix B and C, the inconsistency is better demonstrated (e.g., in MathQA) and there are no related discussions. Additionally, the grouping of models with differing architectures (e.g., Mistral-7B-Instruct vs. Mixtral-8×22B-Instruct) complicates the analysis of scaling effects. The distinct behavior of Gemma-series models is noted but not adequately explored.

  4. Confusing Analysis on Memorization Experiments: The authors interpret the above-chance-level performance as "LLMs can leverage their world models to fill in missing information" as the main conclusion for this section. However, this is not convincing, because above-chance performance does not necessarily imply human-like reasoning (i.e., "fill in missing information first and then answer"), especially given potential selection biases in multiple-choice questions [2]. The weak chance-level baseline further undermines this claim. Second, the discussion admits the challenge of isolating memorization but nonetheless asserts the presence of a robust inference capability, which is unconvincing. Moreover, the focus on "smaller models" (7B-level) in this section is inadequate given the paper's emphasis on scaling. The definition of "memorization effects" also appears to be overly narrow, failing to consider more nuanced forms of knowledge recall, such as paraphrased content or related factual knowledge from training data.

References:

[1] Wei, Jason, et al. "Emergent Abilities of Large Language Models." TMLR 2022.

[2] Alzahrani, Norah, et al. "When benchmarks are targets: Revealing the sensitivity of large language model leaderboards." ACL 2024.

问题

  1. How do you obtain averaged relative accuracy drops across datasets? In particular, how do you deal with imbalanced datasets (e.g., GPQA only has 448 samples while MathQA has 37.2K samples)

  2. What is the interpretation of inconsistent accuracy drop change trends in individual datasets (e.g., MathQA)?

  3. How do you draw the conclusion that the model's ability to infer missing details "is not simply a byproduct of memorization" based on experiment results in Section 4.3? How does this result extrapolate to larger models as the scaling effect is the main evaluation dimension of this paper?

评论

Thank you for your thoughtful and constructive feedback on our paper. We are delighted that you find our work provides valuable insights into the scaling effects, complementing existing studies on LLMs. We also appreciate your positive remarks regarding the clarity of our paper, as well as your recognition of how our main and additional experiments thoroughly validate our central hypothesis.

We have carefully considered the concerns raised and would like to address them as follows:


Potential Overclaim Regarding “World Models”

You raise an important point regarding the definition of the "world model" in the context of LLMs. We appreciate this observation, as the concept indeed carries some ambiguity that could lead to different interpretations. To address this, we aim to clarify how we define the world model in LLMs in our work, while also connecting it to the broader literature for a clearer understanding.

To understand world models, let's take a moment to look at two main hypotheses about what LLMs have actually learned:

Some researchers [1, 2] posit that LLMs primarily learn a vast collection of statistical correlations from their training data without forming a coherent and interpretable understanding of the data-generating processes. In this view, LLMs are sophisticated pattern-matchers that are highly effective at predicting the next word based on learned associations but lack deeper comprehension.

In contrast, other studies [3, …, 7] suggest that LLMs, through the process of compressing extensive training data, develop more compact, coherent, and interpretable models of the generative processes underlying the data, essentially forming an internal "world model". Such a model enables the agent to assess the probability of different elements and concepts, allowing it to determine what is more likely, plausible, or less probable within a given context [8].

For instance, Li et al. [3] demonstrated that LLMs can learn linear representations of spatial and temporal concepts, indicating that they encode structured knowledge about space and time within their internal representations. Another work [4] showed that transformers trained on next-token prediction for the game Othello develop explicit internal representations of the game state. Subsequently, [5] revealed that these representations are linear and interpretable, suggesting that the models internally capture the game's rules and state transitions.

Our work is grounded in the latter hypothesis: LLMs build internal world models that go beyond surface statistics. We recognize that providing a clear definition of "world model" and situating our work within the existing discourse is essential for clarity and precision. In light of your feedback, we incorporated this clarification into the revised submission in the new Appendix F and Section 2.

We agree that the experiments and findings stand on their own merit, independent of the term "world model". Nevertheless, we believe that discussing this concept is valuable because it aligns our work with ongoing efforts in the research community to explore and formalize how LLMs internally represent and process knowledge. Some of the authors of this work are researchers in RL, which has motivated our interest in contributing to the broader discourse on formalizing these internal structures in LLMs. If you feel this connection is unnecessary or misleading, we are open to removing it.


References

[1] Emily M Bender and Alexander Koller. Climbing towards NLU: On meaning, form, and understanding in the age of data. ACL 2020.

[2] Yonatan Bisk et al. Experience grounds language. ACL 2020.

[3] Wes Gurnee, Max Tegmark. Language Models Represent Space and Time. ICLR 2024.

[4] Kenneth Li et al. Emergent world representations: Exploring a sequence model trained on a synthetic task. ICLR 2022.

[5] Neel Nanda et al. Emergent linear representations in world models of self-supervised sequence models. BlackboxNLP @ EMNLP 2023.

[6] Belinda Z Li et al. Implicit representations of meaning in neural language models. ACL 2021.

[7] Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. ICLR 2021.

[8] Yann LeCun. A Path Towards Autonomous Machine Intelligence. 2022.

评论

Marginal Contribution

We agree that it is generally anticipated, based on emergent abilities analysis, that larger language models might be less affected by misinformation. However, while this expectation exists, to the best of our knowledge, our study is the first to empirically demonstrate this emergent property.

Previous works, including those cited in Section 2 (which includes the paper [2] you cited), have explored LLM vulnerabilities and scaling effects. However, these studies faced performance limitations of earlier models that prevented them from empirically validating the specific relationship between model size and resilience to misinformation.

Specifically, earlier releases of LLMs did not perform well enough on the benchmarks we utilized to observe meaningful differences in resilience to deception. The overall low performance made it difficult to detect significant relative drops or improvements attributable to deceptive prompts.

Interestingly, as our work progressed, the release of newer models made these differences between various capacities much clearer. The most recent models have achieved sufficient proficiency on standard benchmarks, enabling us to conduct a detailed and controlled empirical analysis of how model scaling affects resilience to deceptive in-context information.


Results for main experiments are not thoroughly analyzed

We agree that conducting an in-depth dataset-wise and model-wise analysis would enhance the validity and interpretability of our findings. Our primary objective was to provide a broad evaluation across multiple models and datasets to identify general trends. To address the imbalance in dataset sizes and the number of answer choices, we averaged the results uniformly across all benchmarks. Specifically, each bold point in Figures 2 and 3 represents the mean of the Relative Accuracy Drop across benchmarks, treating all datasets equally regardless of their size or number of answer choices. We acknowledge that this approach may have led to some overlooked nuances.

We also recognize that certain datasets, such as MathQA, exhibit trends that differ from the average, and that grouping models with differing architectures can complicate the analysis of scaling effects. Due to time constraints during the rebuttal period, we were unable to include these detailed analyses. However, we are committed to thoroughly exploring these aspects in the camera-ready version of the paper, including discussions on dataset-specific trends and a more nuanced examination of model architectures, particularly focusing on the distinct behavior of the Gemma-series models.

We hope that our acknowledgment of these areas for improvement, along with our commitment to addressing them in the final version, will be taken into consideration during your re-evaluation.


Confusing Analysis on Memorization Experiments

First, regarding the interpretation of above-chance performance, we agree that our original statement suggesting that "LLMs can leverage their world models to fill in missing information" was overstated. We have changed this phrase in the revised submission.

Second, concerning the assertion of robust inference capabilities despite the challenge of isolating memorization effects, we acknowledge that our initial discussion may have been unconvincing. We have tempered our claims and now emphasize that, while we cannot entirely dismiss memorization, our findings suggest that factors beyond direct memorization contribute to the models' performance.

Third, regarding the focus on smaller models in the memorization experiments, we would like to clarify our approach. We overfitted Llama-8B to simulate data contamination that could occur in larger models, as mentioned in the paper, while using DCLM-7B as a control to compare against this contamination. This setup allowed us to explore the alternative hypothesis that the observed resilience of larger models to deceptive prompts might be attributed to memorization from data contamination.

Lastly, we recognize that our initial definition of "memorization effects" was overly narrow. We have not yet fully addressed this but plan to expand our definition and provide a more detailed discussion in future revisions.



To wrap up, we appreciate your insightful observations and have diligently applied them to better our work. We hope that the revisions and elaborations made meet the high standards expected by the conference. If we have adequately responded to your key concerns and questions, we kindly ask for your consideration in enhancing the score you've allotted to our submission. However, if you still have any additional queries or concerns, we invite you to share them with us.

评论

Sorry for the late reply and I want to thank the authors for their detailed responses. My replies below:

  1. Regarding "World Models": The author presented a nice introduction to this ongoing debate and included some interesting papers to read. I really appreciate the authors' efforts on this part and definitely encourage the author to add them to your future versions of papers. (Minor: Li et al., should be [4] rather than [3])

  2. Regarding "Marginal Contributions": If I understand correctly, the main difference between this work and many previous works cited in the paper is that this paper evaluates some "newer models". While this might be true, however, I am not sure whether this warrants a contribution sufficient to meet the high standard of the venue.

The most recent models have achieved sufficient proficiency on standard benchmarks, enabling us to conduct a detailed and controlled empirical analysis of how model scaling affects resilience to deceptive in-context information.

I am not sure what the authors mean -- how does the good performance on standard benchmarks motivate the evaluation of the model performance on deceptive in-context information? I find it hard to understand the flow here. My interpretation is that the authors suggest recent models might be "overfitting" to standard benchmarks and lacking robustness, leading to over-optimization issues. However, this hypothesis is not novel—there are already many studies on robustness, out-of-domain generalization, and data contamination. 

3. Regarding thorough experiment analysis: I’m glad the authors found my comments helpful and appear to be working on incorporating additional analyses, though these are not yet reflected in the current revision. I understand the time constraints, but for a benchmark/evaluation-focused paper, in my view, fine-grained analyses and careful evaluations are critical to making it stand out, especially given the existing work in this field. Without such analyses, the conclusions may appear too broad and shallow to offer meaningful insights. It remains unclear what the specific plan is for addressing these concerns, and I cannot base my review on content that has not yet been presented.

  1. Regarding Memorization Effects: I appreciate the extra writing efforts that the authors have spent, but similar to my last comment, while the authors promised to think more carefully on this part, I think there is still a lot of uncertainty and vagueness there and I cannot raise my scores based on content that I have not seen.

Overall, I want to thank the great work the authors have conducted. This is a great initial step and I can see the potential of this work. However, the current version falls short of the standards I expect for this venue. If there were an option to assign a score like 5.5, I would raise my score slightly, but unfortunately, no such option exists.

评论

Thank you for your elaborating on your concerns as well as acknowledgment of the potential of our work. We also appreciate you recognizing our efforts in presenting the ongoing debate around World Models. In the current version, we have added the discussion and references to Appendix G. We hope our clarification will help address some concerns you raised.


Regarding "Marginal Contributions": If I understand correctly, the main difference between this work and many previous works cited in the paper is that this paper evaluates some "newer models" [...] how does the good performance on standard benchmarks motivate the evaluation of the model performance on deceptive in-context information?

First, we want to clarify the contribution of our work in the context of prior studies. A critical aspect of our research is the necessity for models to have baseline performance sufficiently above random chance to meaningfully assess the impact of deceptive prompts. Why? Because when a model's original accuracy is close to the random baseline, it does not provide a reliable foundation for measuring the Relative Accuracy Drop when deceptive information is introduced, making it difficult to draw meaningful conclusions about resilience. Low baseline accuracy leads to high variability, which complicates drawing definitive conclusions about the models' resilience to deceptive prompts.

For instance, as you previously mentioned, in our current results, we observe that on datasets like MathQA, even the most recent models exhibit trends that deviate from the general pattern we report. At the same time, the majority of models achieve accuracy levels close to the random baseline of 20% on this dataset. This limitation was even worse in previous studies, where earlier language models did not perform well enough on other benchmarks too, preventing a meaningful analysis of resilience to misinformation

It is likely that the property of increased resilience to deceptive prompts did not meaningfully emerge in older generations of models. Such emergent behaviors may require a certain level of model capacity and sophistication that earlier models simply did not have. As a result, prior research may not have observed this phenomenon because the models available at the time were not advanced enough to exhibit it. Our study leverages newer, larger models where this property begins to meaningfully emerge, allowing us to empirically demonstrate the relationship between model size and resilience to misinformation.


Regarding memorization effect: [...] I think there is still a lot of uncertainty and vagueness there.

We have expanded the background section to provide a more comprehensive definition of memorization as well as situating our contributions within the existing body of research. If it is still not convincing, or you have any further questions and suggestions, we would greatly appreciate your feedback.


If you find our responses satisfactory for some of your concerns, we kindly request you to consider raising your score to support the contributions of our work.

评论

Dear Reviewers,

We sincerely appreciate the time and effort each reviewer has invested in evaluating our work. We are pleased that multiple reviewers found our paper well-written and that our experiments rigorously validated our hypothesis across different settings (KPhv, jNvk, XY5x). Reviewer XY5x emphasized the relevance of our study in addressing LLM robustness against misleading information, crucial for ensuring reliability in real-world applications, while TvM2 highlighted the novel idea of testing deception in LLMs. Additionally, the structured approach, comprehensive analysis, and coverage across diverse benchmarks were well-received (KPhv, TvM2, XY5x, jNvk).

Your insightful concerns have been helpful in improving the clarity, rigor, and scope of our paper. Below, we address the key points raised across the reviews.

Our study is, to the best of our knowledge, the first to empirically confirm that larger language models are less affected by misinformation, a property generally anticipated based on emergent abilities analysis. Previous works have investigated vulnerabilities in LLMs; however, due to the performance limitations of earlier versions, these studies were unable to empirically validate the specific relationship between model size and resilience to misinformation.

To achieve this, we narrowed our scope to cover as many benchmarks as possible, focusing on the potential relationship between parameter count and LLM resilience to deception. By standardizing prompt alterations, we effectively isolated and analyzed how models integrate in-context information with internal knowledge at scale.

Additionally, our emphasis on multiple-choice formats, as communicated to reviewers and noted in the paper's conclusion, provided a controlled environment to systematically measure performance using objective metrics such as accuracy and relative accuracy drop. To address some concerns about broader applicability, we will include some open-ended generative benchmarks (in the final submission for rebuttal early next week) to further validate our findings.

A revision of our paper has been uploaded, addressing most comments and queries raised by the reviewers. However, we will upload the latest version which consists of open-ended generative benchmarks and more literature review on memorization. The revised paper incorporates the following updates, all of which are highlighted in orange for the reviewers' ease of reference:

  • World Model Definition: Added a distinct section (Appendix F) and a paragraph in Section 2 elaborating on the definition of the world model in the context of LLMs, grounding it within the current literature.

  • Textual Modifications: Made minor textual changes throughout the paper, such as providing a more accurate statement regarding the LLM's capabilities when context is removed in Section 4.3. Moreover, we no longer present the evaluation strategy as a novel contribution in Section 1.

Thank you once again for your critical insights and support in evaluating our paper. We look forward to your continued feedback.

评论

Dear Reviewers,

As promised previously, we have now uploaded a new revision (our third submission) that incorporates the following updates:

1. Open-ended Generative Benchmarks: We have added a new Appendix G, which includes an evaluation on two well-established open-ended generative benchmarks: MATH and GSM8K.

2. Literature Review on Memorization: Section 2 has been expanded to include a discussion on memorization, situating our contributions more effectively within the existing body of research. Specifically, we explore memorization in LLMs, building on [1].

3. Extended Qualitative Analyses: In Appendix A, we have examined how nuanced and realistic misleading hints affect the generative behavior of LLMs, providing insights into how reasoning patterns change across different model sizes.

These updates have been integrated alongside our responses to individual reviewer comments. The new material is clearly marked in orange to facilitate ease of review.


We hope that this revised submission meets your expectations and further strengthens the contributions of our work. We would greatly appreciate any positive reassessment of our submission based on these improvements.



[1] Valentin Hartmann, Anshuman Suri, Vincent Bindschaedler, David Evans, Shruti Tople, and Robert West. Sok: Memorization in general-purpose large language models 2023.

评论

Dear Reviewers,

We have submitted the final revision of our paper (the fourth submission), addressing the latest concerns:

1. Directive Instruction Experiments in Appendix G: We added these experiments on the open-ended tasks to support Section 4.2.

2. Minimal Textual Modifications in Section 2: Minimal adjustments were made to improve clarity (regarding a concern raised by Reviewer jNvk)

Thank you for your valuable feedback!

AC 元评审

Paper presents comparison of the deception LM/LLMs in a wide range of domains, but the reviewers felt that the paper had more marginal contribution and do not have sufficient analysis for the some of the claims made.

审稿人讨论附加意见

NA

最终决定

Reject