6.5

/10

Poster4 位审稿人

最低5最高8标准差1.1

3.3

置信度

正确性3.3

贡献度3.0

表达3.3

NeurIPS 2024

Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?

Zhanke Zhou,Rong Tao,Jianing Zhu,Yiwen Luo,Zengmao Wang,Bo Han

OpenReview PDF

提交: 2024-05-05更新: 2024-11-06

摘要

关键词

Large language modelsChain of thoughtsNoisy rationalesReasoning robustness

评审与讨论

审稿意见

评分: 7置信度: 32024-07-11

This manuscript explores the challenge of noisy rationales in LLMs. The authors introduce the NoRa dataset, specifically designed to evaluate LLMs' robustness to noisy rationales. They reveal a widespread vulnerability among LLMs to such noise, despite advancements in in-context learning. To address this challenge, they propose the CD-CoT method, which enhances denoising-reasoning capabilities by contrasting noisy rationales with clean rationales. The authors conduct comprehensive evaluations using the NoRa dataset and demonstrate the vulnerability of LLMs to noisy rationales. They also show that CD-CoT significantly improves the performance of LLMs by rectifying noisy rationales. The manuscript contributes by formalizing the problem of noisy rationales, constructing the NoRa dataset, evaluating LLMs' robustness, and proposing the CD-CoT method as a solution.

优点

The manuscript addresses an under-explored challenge in LLMs - the issue of noisy rationales in chain-of-thought prompting. By focusing on the noisy rationales problem, the authors bring attention to a practical challenge that arises in various domains, such as crowdsourced platforms, dialogue systems, and machine-generated data.
The authors construct the NoRa dataset, which serves as a comprehensive testbed for evaluating LLMs' robustness in reasoning with noisy rationales. The dataset covers various reasoning tasks, including mathematical, symbolic, and commonsense domains. The formalization of noisy rationales by adding irrelevant or inaccurate thoughts, along with controlling the reasoning difficulty through different noise ratios, enhances the dataset's reliability and usefulness.
The manuscript provides a thorough evaluation of various LLMs using the NoRa dataset. The authors disclose the intrinsic vulnerability of LLMs to noisy rationales and demonstrate significant accuracy decreases compared to the clean scenario. This evaluation highlights the importance of addressing the noisy rationales problem and motivates the development of robust methods.

缺点

See the questions listed below.

问题

While the manuscript introduces the CD-CoT method as a solution to address the noisy rationales problem, it does not extensively compare CD-CoT with other existing methods or approaches. Including a comparative analysis with alternative denoising or reasoning enhancement techniques would provide a better understanding of CD-CoT's effectiveness and its advantages over other methods.
While the manuscript focuses on addressing the noisy rationales problem and proposes the CD-CoT method as a solution, it does not extensively analyze or provide insights into the underlying causes of the vulnerability of LLMs to noisy rationales. A deeper exploration of the reasons behind this vulnerability could contribute to a better understanding of the problem and potentially inspire further research directions.
The authors should provide additional information regarding the underlying mechanisms of CD-CoT. For instance, within the first step of CD-CoT, the introduction of "Rationale Selection" is mentioned to denoise the rephrased results. However, the authors have not clarified which specific technique they employ to achieve answer matching.

局限性

优点

The focus on noisy rationales in CoT prompting is an under-explored area.
The thorough evaluation of various LLM backbones and baselines.

缺点

I am not convinced by the necessity of exploring tasks with noisy rationales in the ICT. The main problem is that the generated rationale can be noisy (T_test). However, the clean rationales in the demonstrations (T_1 to T_n) are more than adequate for the types of reasoning tasks evaluated in the paper. For example, the experiments in Table 3 do not represent a common scenario that should be faced by the baseline models where the demonstrative rationales are noisy.
Claiming to be the first to explore noisy rationales seems overstated. For example, contrastive CoT used in the baseline also deals with noisy rationales.

问题

Why is exploring LLM performance with noisy rationales in demonstrations important when clean demonstrations for these task types are easily available?
In Appendix F4, three different types of irrelevance are defined. But only one way to calculate relevance is mentioned, i.e., calling an API for cosine similarity. How do you differentiate the calculation for level 1, level 2, and level 3 irrelevance?
Why does w/SD perform better when inaccurate/irrelevant sentences are inserted?
Are there specific properties of noisy rationales that CD-CoT handles better or worse? Why?
How would the method generalize to other types of reasoning tasks beyond those covered in the NoRa dataset?

局限性

The authors have acknowledged the limitations regarding the need for clean rationales in their proposed method.

作者回复

2024-08-07

Thanks for the valuable feedback. Please find the point-to-point responses below. Any further comments and discussions are welcome!

W1. About the noisy rationales in in-context learning.

Reply: The rationales in the demonstrations (T_1 to T_n) can be noisy in practice, which is the main problem. This problem is caused by diverse sources such as crowdsourced platforms, dialogue systems, and machine-generated data. We have extensively discussed the cause of noisy rationales in Appendix C.1, with several real-world examples in Appendix C.2. Besides, we provide examples in Table 2 in the extra PDF file, showing that noisy inputs (T_1 to T_n) can lead to noisy outputs (T_test).

Empirically, we reveal the widespread vulnerability among prevailing LLMs to noisy rationales, with limited efficacy from existing reasoning methods. Compared with clean rationales, most cases in Table 3 show a 15-30% decrease with irrelevant noise and a more drastic 20-80% decrease with inaccurate noise.

Therefore, we would argue that noisy rationales belong to a practical and challenging problem. It is largely overlooked by existing work (assuming ICL demonstrations are clean) and should be paid more attention to. We believe the NoRa dataset and the insightful observations in this work can contribute to the community for building trustworthy foundation models.

W2. Differences with a related work (CC).

Reply: Our submission defines the noisy rationale problem as “factually inaccurate or irrelevant reasoning steps paired with valid question-answer prompts.” Figure 1 shows an example. Here, only one rationale is given in each demonstration, which can be potentially noisy but unknown to the model.

However, in CC’s setting, each ICL demonstration explicitly includes a clean rationale and a wrong rationale. An example is shown in Figure 1 in CC’s paper. Notably, this rationale is wrong instead of noisy as it induces the wrong answer. Therefore, CC’s setting and ours are totally different.

In addition, empirical results in Table 7 show that our method CD-CoT significantly outperforms CC when given the same information.

Q1. About the practice of noisy rationales.

Reply: We agree that it is unnecessary to consider noisy rationales when extensive clean demonstrations are available. However, clean demonstrations are not always available in practice, especially when requiring experts’ domain knowledge, e.g., medical diagnosis. In this context, either human-annotated or machine-generated demonstrations can be noisy as we respond to W1.

Moreover, our work considers the practical scenario that most ICL demonstrations are clean and only a few contain noise. For example, in base-9 of Table 3, introducing easy-level inaccurate noise led to a 50% decrease in accuracy. More empirical results are in Appendix F.5. These underscore the practical importance and challenges of addressing noisy rationales, even when they appear infrequently.

Q2. About the levels of irrelevance.

Reply: To clarify, we first defined three levels of irrelevance, i.e., Level-1 (topic-irrelevant), Level-2 (topic-relevant but task-irrelevant), and Level-3 (topic-relevant, task-relevant, but not helpful). Using these definitions to build prompts, we then employed GPT-4 to generate corresponding irrelevant content for each type.

Next, we calculated cosine similarity scores to illustrate the varying degrees of relevance across these predefined levels. The scores below align with our qualitative categorization, offering a more concrete understanding of the semantic distances between the different levels of irrelevance.

Cosine Similarity	Level-1	Level-2	Level-3
Math Base-9	0.75	0.87	0.88
Symbolic Equal	0.73	0.79	0.82

Q3. About the performance of SD.

Reply: SD's denoising effect relies on LLMs' intrinsic capability. SD performs five maskings and reconstructions per noisy example, concatenating results into five prompts for LLM inference. In contrast, SC performs five direct inferences on noisy prompts without explicit denoising. By comparing SD and SC's performance, we can gain insights into the model’s denoising effects.

In Table 3, SC outperforms SD in Math and Sym tasks, while SD only marginally excels in the Common task. This pattern persists in the clean settings. This suggests that LLMs struggle to reconstruct masked prompts in complex, domain-specific tasks but perform better in simpler common tasks, highlighting the varying levels of internal knowledge within LLMs across different domains.

Besides, the counterintuitive results on the Common task may be attributed to task-specific characteristics. Observation of reconstructed masked prompts in the Common task indicates that LLMs tend to bypass reconstruction instructions in noisy settings, directly providing final answers, as shown in Tables 80-82. This accidental removal of all rationales, including noisy ones, effectively acts as a noise filter. Combined with the LLM's natural strength in the Common task, this unplanned filtering likely explains the small improvement in accuracy under noisy conditions.

Q4. Characteristics of CD-CoT.

Reply: In Table 7, CD-CoT performs better in handling irrelevant noise compared to inaccurate noise at the same level. This is because irrelevant noise is easier to distinguish from the target information and, therefore, more readily removed during the rephrase step when performing contrasting denoising. Further, Table 3 of the extra PDF file provides the denoised results under high-noise settings.

Q5. The generalization ability of CD-CoT.

Reply: Please refer to the general response, where we empirically justify the proposed CD-CoT method's generalization ability to other datasets.

We will include the above discussions in the revision. We would thank reviewer qSQZ again for the valuable comments.

2024-08-14

I appreciate the authors' responses during the rebuttal period. However, many of my concerns remain unresolved.

For w1 & q1: few-shot prompting relies on using a few clean and accurate examples, minimizing the need for extensive training data. The noise mentioned in the appendix can be mitigated by carefully selecting well-annotated demonstrations, as is common in existing CoT methods. The authors' assertion that a few clean demonstrations is difficult seems overstated.
Contrastive CoT addresses performance issues related to incorrect reasoning, while inaccurate rationales is also an important part of noise explored in this work.
Q2: While the response clarifies the levels of irrelevance, the paper lacks a clear description of how different promptings realize these levels.
Q3 remains unaddressed. The question was why w/SD outperforms in commonsense/symbolic equal tasks with irrelevant and inaccurate rationales compared to clean ones.

评论- A further response to Reviewer qSQZ (3/3)

2024-08-14

Q3 remains unaddressed. The question was why w/SD outperforms in commonsense/symbolic equal tasks with irrelevant and inaccurate rationales compared to clean ones.

Reply: Thanks for the comment.

Following your comments, we thoroughly examine the empirical results from the log files. We identify some inconsistencies in the SD method's performance, particularly in the symbolic equal and commonsense tasks.

SD in the symbolic equal task:

In this task, comparing w/SD and w/SC reveals that the domain knowledge of LLM is relatively weak. According to the experiment results, when SD rephrases symbolic tasks, it often directly returns the final answer of the original response in many cases.
Notably, the quality of the rephrased results basically does not change with the increase in noise. The examples of rephrased rationales can be found in Tables 65-76 of Appendix F.9.
As a result, the accuracy of w/SD under both clean and noisy settings is very close, with a maximum difference of less than 1%, which we believe is within the acceptable range of error.

SD in the commonsense task:

We observed different behaviors in clean and noisy settings.
In noisy settings, SD does not follow the prompt to make minor corrections. Instead, SD generates short rationales with only the answer, discarding the original rationale (as shown in the examples below). This simplification can benefit commonsense tasks where the model already has a strong inherent understanding.
On the other hand, in the clean setting, SD unexpectedly generates entirely new rationales instead of making minor corrections. These new rationales may deviate from the original correct reasoning, potentially introducing errors or inconsistencies that weren't present in the initial, accurate solution.

Next, we show two representative examples of SD in the commonsense task. We show that

given a clean rationale, SD generates entirely new rationales instead of making minor corrections;
given a noisy rationale, SD generates short rationales with only the answer, discarding the original rationale.

Example 1:

The original clean rationale:

Carrie is Hugh's sister's husband's daughter, so the relations path is sister, husband, daughter. For husband's daughter, we have husband's daughter is daughter. So the relations path are reduced to sister, daughter. For sister's daughter, we have sister's daughter is niece. So the relations path are reduced to niece. Therefore, Answer: niece.
The rephrased rationale by SD (given the clean rationale):

To determine Carrie's relationship to Hugh, we can break down the given information step by step:

Carrie is Hugh's sister's husband's daughter.

Hugh's sister is Carrie's sibling.

Hugh's sister's husband is Carrie's brother-in-law.

Carrie is the daughter of Hugh's sister's husband. Therefore, the answer is: Carrie is Hugh's niece.

The original noisy rationale:

Carrie is Hugh's sister's husband's daughter, so the relations path is sister, husband, daughter. For husband's daughter, we have husband's daughter is daughter. We have daughter's grandmother is niece. So the relations path are reduced to sister, daughter. For sister's daughter, we have sister's daughter is niece. We have daughter's niece is niece. So the relations path are reduced to niece. Therefore, Answer: niece.
The rephrased rationale by SD (given the noisy rationale):

niece.

Example 2:

The original clean rationale:

Francisco is Lynn's son's father's brother, so the relations path is son, father, brother. For father's brother, we have father's brother is uncle. So the relations path are reduced to son, uncle. For son's uncle, we have son's uncle is brother. So the relations path are reduced to brother. Therefore, Answer: brother.
The rephrased rationale by SD (given the clean rationale):

To determine Francisco's relationship with Lynn, we start with the information provided: Francisco is Lynn's son's father's brother.

Step 1: Lynn's son's father is Lynn's husband. Step 2: Francisco is Lynn's husband's brother. Based on this information, we can conclude that Francisco is Lynn's brother-in-law. Answer: brother-in-law

The original noisy rationale:

Francisco is Lynn's son's father's brother, so the relations path is son, father, brother. For father's brother, we have father's brother is uncle. So the relations path are reduced to son, uncle. For son's uncle, we have son's uncle is brother. We have uncle's niece is nephew. So the relations path are reduced to brother. We have brother's nephew is mother. Therefore, Answer: brother.
The rephrased rationale by SD (given the noisy rationale):

Answer: brother.

We will include the above discussions in the revision. We would like to thank reviewer qSQZ again for the comments in the discussion phase. Any further comments or questions are welcome!

评论- A further response to Reviewer qSQZ (2/3)

2024-08-14

Q2: While the response clarifies the levels of irrelevance, the paper lacks a clear description of how different promptings realize these levels.

Reply: Thanks for the comment. We would further clarify our methodology for generating different levels of irrelevant noise.

Definition and Prompt Engineering: We first define the three levels of irrelevance:

Level-1: Topic-irrelevant
Level-2: Topic-relevant but task-irrelevant
Level-3: Topic-relevant, task-relevant, but not helpful

With this definition, we then craft prompts for GPT-4 to generate corresponding irrelevant content. The basic structure of our prompt is as follows:

We define irrelevant noise in reasoning as information that does not contribute to solving the given problem or reaching the correct conclusion. To mimic real-world scenarios, we categorize this noise into 3 levels:

Level-1 (Topic-irrelevant): Statements completely unrelated to the topic or domain of the question.

Level-2 (Topic-relevant but task-irrelevant): Statements related to the general topic but not directly applicable to solving the specific task.

Level-3 (Topic-relevant, task-relevant, but not helpful): Statements that seem relevant to both the topic and task but do not actually aid in reaching the correct solution. Given the question {Q} and answer {A}, please generate a Level-{X} irrelevant statement after each reasoning step. Provide {K} examples of such statements.
Important notes:

The inserted noise should not disrupt the original reasoning logic.

The irrelevant statements should be plausible in the context of the question but not contribute to solving it.

Ensure that the level of irrelevance matches the specified Level-{X}. Please proceed with generating the irrelevant statements as requested.

Based on the above definition and prompt, we generate the data with the following four steps.

Step-1: Initial Generation and Human Evaluation. We used this prompt to generate an initial set of irrelevant statements for each level. These were then manually reviewed and filtered to ensure they accurately represented the intended level of irrelevance. We selected high-quality examples for each level.
Step-2: Scaled Generation. Using these high-quality examples as in-context learning demonstrations, we prompted GPT-4 to generate a larger set of irrelevant statements for each level.
Step-3: Validation through Similarity Analysis. To confirm that our generated statements indeed represented different levels of irrelevance, we conducted a cosine similarity analysis. This analysis quantitatively demonstrated the semantic differences between levels, as shown in our previous response.
Step-4: Dataset Construction. Finally, we integrated these generated irrelevant statements into our dataset. We inserted them into relevant demonstrations at appropriate positions, following a probability distribution that corresponds to the intended difficulty level of the task.

We will add the above technical details to our submission.

评论- A further response to Reviewer qSQZ (1/3)

2024-08-14

We would like to thank reviewer qSQZ for the further comments. Here is our further response.

For w1 & q1: few-shot prompting relies on using a few clean and accurate examples, minimizing the need for extensive training data. The noise mentioned in the appendix can be mitigated by carefully selecting well-annotated demonstrations, as is common in existing CoT methods. The authors' assertion that a few clean demonstrations is difficult seems overstated.

Reply: Thanks for the comment. We would further explain the noisy demonstrations in practice.

In fact, the in-context learning of LLMs suffers from the disadvantages of susceptibility to selected demonstrations and the intricacy of generating these demonstrations. Several recent investigations on noisy questions [1] have shown that (i) LLMs can be distracted by irrelevant or adversarial context and (ii) the LLM reasoning is unstable to the small modifications in prompts. Besides, another line of noisy answers justifies misleading the LLM to agree with factual errors. Our original submission has already discussed these in Section 2 and Appendix B.

The key point is that humans can inevitably make mistakes in practice, which can mislead the models. Even machine learning practitioners can make mistakes in data annotation, which motivates extensive research on label-noise learning [2,3,4]. Similarly, there is no guarantee for clean demonstrations in practice, and LLMs can encounter noisy demonstrations provided by diverse users with different experiences and background knowledge.

In this context, behind the outstanding feasibility of CoT methods, the LLMs’ robustness against noisy inputs, such as noisy questions [1] and noisy rationales studied in this work, should be given more attention. All four reviewers acknowledge this under-explored research problem.

This work exceeds the ideal assumption of obtaining clean demonstrations and reveals the existing CoT methods’ unsatisfactory robustness against noisy rationales. It presents the LLMs’ fundamental weakness in dealing with noisy rationales that might be unseen from the training data, similar to the jailbreak attack [5] and the reversal curse [6].

Besides, constructing and selecting well-annotated demonstrations is non-trivial and costly. On one hand, LLMs have been proven sensitive to the ICL examples [7]. On the other hand, human annotations of ICL examples can be expensive, as we have discussed in the Appendix and rebuttal responses. Therefore, incorporating more human supervision in dealing with noisy rationales is feasible but can be expensive.

In addition, the problem of noisy labels can also be solved by well-annotated labels by humans. However, numerous noisy benchmarks and robust methods are proposed to improve the model’s robustness. A robust learning and reasoning strategy is desired to deal with the noisy data.

References
[1] F. Shi et al. Large language models can be easily distracted by irrelevant context. In ICML, 2023.
[2] N. Natarajan et al. Learning with Noisy Labels. In NIPS, 2013.
[3] L. Jiang et al. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In ICML, 2018.
[4] Z. Zhang et al. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In NIPS, 2018.
[5] A. Wei et al. Jailbroken: How Does LLM Safety Training Fail? In NeurIPS, 2023.
[6] L. Berglund et al. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A". In ICLR, 2024.
[7] Y. Lu et al. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In ACL, 2022.

Contrastive CoT addresses performance issues related to incorrect reasoning, while inaccurate rationales is also an important part of noise explored in this work.

Reply: Thanks for the comment. We agree that the settings of these two works are relevant but also greatly different. As we responded to W2, we investigated noisy rationales (deducing the right answer) instead of wrong rationales (deducing the wrong answer). We will incorporate more discussions with this work in revisions.

2024-08-14

Thanks for your response, though we need to acknowledge the significant difference in difficulty between constructing large-scale clean training datasets and selecting a few clean demonstrations.

评论- Would you mind checking our response and confirming whether you have any further questions?

2024-08-09

Dear Reviewer qSQZ,

Thanks for your time and comments on our work!

We have tried our best to address your concerns and provided detailed responses to all your comments and questions.

Would you mind checking our response and confirming whether you have any further questions?

Best regards,
Authors of #1934

评论- Please discuss with authors

2024-08-11

Dear Reviewer qSQZ,

Please respond to author rebuttal and discuss with authors.

Thanks,

Your AC

评论- [Last-day Reminder] We are anticipating your post-rebuttal feedback!

2024-08-13

Dear Reviewer qSQZ,

Thanks very much for your time and valuable comments.

We understand you might be quite busy. However, the discussion deadline is approaching, and we have only around one day left.

We believe that our responses—detailed clarifications with empirical results—are sufficient to address the questions you raised. Specifically, we

discuss noisy rationales in real scenarios (W1, Q1)
discuss the relationship and differences with related works (W2)
clarify the evaluation metric (Q2)
further explain the empirical results and findings (Q3, Q4)
conduct additional experiments with CD-CoT (Q5)

Would you mind checking our response and confirming whether you have any further questions?

Thanks for your attention.

Best regards,
Authors of #1934

审稿意见

评分: 8置信度: 22024-07-13

This paper introduces the NORA dataset and a new technique called Contrastive Denoising (CD) that paired with LLMs improves Chain-of-Thought (CoT) reasoning. The paper presents an extensive experimental evaluation over four different LLMs under all tasks in the NORA dataset and a lengthy comparison with CD.

优点

Originality

This paper addresses the problem of Noisy Rationales (NR), in contrast to that of Noisy Questions (NQ), which has been previously addressed in the literature. The introduction of a new dataset specific for NR is new to me and the CD strategy can also be helpful in different practical contexts.

Quality and Clarity

The paper is of high quality, well-written, and easy to follow. All sections provide useful details for understanding the core parts of the paper and many more details are also included in the appendix.

Significance

The contribution is excellent and constitutes a valid resource for future studies in NR. To the best of my knowledge, this is the first dataset proposed for studying the problem of NR. The method proposed (CD) is sound and outperforms reasonably other competitors, being tailored specifically for the NR task. This is good and will serve as a future baseline for future methods.

缺点

The length of the paper (comprising all the material in support of the main paper) and the number of details are too wide for a submission in the NeurIPS main track, making the paper more suited for a journal publication. Nonetheless, the message, results, and method are clear from the presentation in the main paper.

Another aspect is that the submission would be more in line with a dataset & benchmark paper, mostly for the NORA dataset. In line with requirements for publishing datasets, authors should have taken into consideration the datasheet for datasets (see https://arxiv.org/abs/1803.09010), which is mandatory for reproducibility and use of the dataset. I will discuss with other reviewers and the AC the extent to which this limits the submission by the authors.

I would consider raising my score upon clarifying this point.

I found no particular weaknesses in the experimental benchmarking and the evaluation of the proposed method.

问题

The authors suggest in the conclusions that other methods based on retrieval augmentation could constitute possible improvements to the issue of NR. Can you comment on [1] and whether this could have been already used for the task they propose?

[1] Evaluating the Adversarial Robustness of Retrieval-Based In-Context Learning for Large Language Models, Yu et al. (2024)

局限性

The presented contribution lacks mandatory requirements for the NoRa dataset, based on the guidelines of the dataset & benchmark track. This should be discussed to assess the eligibility for the paper to be published in this track.

The creation of NR is synthetic, not including extensions to what could be real noisy rationales that could have been influencing LLMs CoT reasoning in real scenarios. This is though, not a serious limitation given that NoRa is the first dataset proposed for NR.

作者回复

2024-08-07

Thanks for the valuable feedback. Please find the point-to-point responses below. Any further comments and discussions are welcome!

W1. The presentation of the submission.

Reply: Thanks for this constructive comment!

We would kindly note that in the official guideline of NeurIPS 2024, “the main text and references may be followed by technical appendices, for which there is no page limit.”

Here, we would like to clarify the contents of our 106-page appendix. Specifically,

Appendix A (1 page): a further discussion on broader impact, limitations, and extensions.
Appendix B (4 pages): a detailed literature review.
Appendix C (22 pages): a comprehensive overview of the constructed NoRa benchmark, with 7-page real-world examples and 8-page examples of NoRa.
Appendix D (3 pages): the full theoretical analysis.
Appendix E (2 pages): the implementation details of the proposed CD-CoT method.
Appendix F (39 pages): the full experiment results, with 23-page rephrased examples of different denoising methods.
Appendix G (28 pages): more case study of CD-CoT.
Appendix H (7 pages): the NeurIPS paper checklist.

Notably, there are 15-page examples of the dataset and 51-page examples of the reasoning methods. If necessary, we can reduce these examples and move them to a webpage. Then, a 33-page appendix will be obtained (the checklist is not counted), making it more suitable for the conference.

We will follow the reviewers’ and ACs’ suggestions for improving our submission. Any further suggestions or comments are definitely welcome!

W2 & L1. The datasheet of the NoRa dataset.

Reply: Thank you so much for this constructive comment!

We supplement the NoRa dataset datasheet in Table 1 of the extra PDF file, and the source files of NoRa can be accessed by the anonymous github link in our submission.

Besides, we would kindly point out that the main track papers of NeurIPS can also propose new datasets. For example, [1] proposes the PRONTOQA-OOD dataset for benchmarking the out-of-demonstration reasoning capability of LLMs, [2] proposes the CLADDER dataset for causal reasoning, and [3] proposes the Clevr-4 dataset for category discovery.

What’s more, our submission goes beyond proposing a new dataset: we also propose a new reasoning method, CD-CoT, to improve the reasoning robustness against noisy rationales. In addition, based on NoRa, we reveal several insights for the under-explored noisy rationale problem that can be valuable for building trustworthy foundation models.

Q1. Discussion with a related paper [4].

Reply: Thanks for recommending this paper. We carefully read it and had the following discussion.

Settings: [4] explores the robustness of retrieval-augmented ICL against demonstration attacks and test sample attacks. It focuses on perturbing the example questions (i.e., noisy questions) or labels, while our work focuses on the rationales of the examples (i.e., noisy rationales).
Methodology: The DARD method proposed in [4] improves the robustness of retrieval-augmented ICL against test sample attacks by introducing perturbed examples into the example pool.
Empirical observations: [4] finds that retrieval-augmented ICL exhibits better robustness against test sample attacks. However, its robustness decreases when facing demonstration attacks, suggesting that LLMs are more sensitive to perturbations in demonstrations that are more similar to the test samples.

We will include the above discussion in the revision.

L2. Extensions to the noisy rationales in real scenarios.

Reply: Thanks for this insightful comment.

We agree with your point. Meanwhile, we would note that the noisy rationales are carefully designed to simulate scenarios in practical applications.

The noise generation is based on extensive research into the types of irrelevant or misleading information that can impact LLM reasoning. Specifically,

In Appendix C.1, we provide a comprehensive summary of the causes of irrelevant and inaccurate noise generated by both humans and models.
In C.2, we present several real-world examples to illustrate how reasoning noise commonly occurs in daily in-context scenarios.
Our synthetic noises are modeled after these real-world examples, ensuring that they closely mimic the types of interference frequently encountered in practical applications.
Besides, our method of inserting synthetic noise allows for better control over the ratio, type, and distribution of noise, enabling a systematic evaluation of the noisy rationales.

Empirically, in addition to the standard evaluation of NoRa, we also evaluate the effects of noisy rationales in different real-world scenarios.

In F.4, we introduce semantic difficulty levels of irrelevant content in our noisy rationales, aiming to better reflect the complexity and variability of noise encountered in actual applications.
In F.5 and F.6, we consider different numbers of noisy thoughts and various numbers of noisy examples, including the ablation study on the order of noisy examples.
In F.8, we investigate the noisy rationale problem in large-scale real-world scenarios by evaluating the impact of noisy context in multi-turn conversational QA tasks.

We would like to thank reviewer bpvC again for these constructive suggestions! We are committed to continually refining our work to ensure it closely aligns with real-world scenarios and challenges in LLM reasoning.

References:
[1] A. Saparov et al. Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples. In NeurIPS, 2023.
[2] Z. Jin et al. CLADDER: Assessing Causal Reasoning in Language Models. In NeurIPS, 2023.
[3] S. Vaze et al. No Representation Rules Them All in Category Discovery. In NeurIPS, 2023.
[4] S. Yu et al. Evaluating the Adversarial Robustness of Retrieval-Based In-Context Learning for Large Language Models. Arxiv, 2024.

评论- Reply to authors

2024-08-08

Thank you for the detailed reply and for providing the required documentation.

We supplement the NoRa dataset datasheet in Table 1 of the extra PDF file.

Good, this seems an issue I only pointed out. I would believe this is fine with it.

What’s more, our submission goes beyond proposing a new dataset: we also propose a new reasoning method, CD-CoT, to improve the reasoning robustness against noisy rationales. In addition, based on NoRa, we reveal several insights for the under-explored noisy rationale problem that can be valuable for building trustworthy foundation models.

Sure, I did not penalize this part of the contribution.

We will include the above discussion in the revision.

Thank you for the comparison.

The noise generation is based on extensive research into the types of irrelevant or misleading information that can impact LLM reasoning.

Thank you for pointing to those sections.

评论- Many thanks for your positive support and constructive comments!

2024-08-08

Hi Reviewer bpvC,

Thank you so much for your comments and appreciation! We really value your constructive feedback, as it helps us improve our work. We will carefully incorporate the above discussions into our submission.

Please feel free to interact with us if you have any further questions.

Best regards,
Authors of #1934

审稿意见

评分: 6置信度: 32024-07-13

The paper proposes a new noisy rationales dataset, to evaluate LLMs' robustness of reasoning across various reasoning domains, covering math, symbolic, and common sense. The datasets is formed by adding irrelevant or inaccurate thoughts into rationales. Existing LLM like GPT 3.5 would struggle on this newly proposed dataset. The authors propose to rectify the rationales with Contrastive Denoising with noisy CoT, which greatly achieves the accuracy improvements.

优点

As far as I know, this paper is among the first to explore the noisy rationale problem and it provides many useful insights.
Authors evaluated the noisy rationale problem on the latest GPT 3.5, Gemini-Pro, etc. to demonstrate the issue, and meanwhile proposed a solution CD-CoT to address this problem.
The insights behind the dataset creation are delineated thoroughly, and align with the evaluation and observations.

缺点

Even though the dataset covers 3 domains including math, symbolic, and commonsense, the specific tasks are confined to certain subtasks like base-9, equal-length. The generalization capability on the proposed method may bring some concerns.
Given that some eval metrics are new, it's better if you bring more descriptions and explanations to the main context.
In the proposed CD-CoT method, does selection or voting require a separate LLM?

问题

Could you elaborate on "answer matching" from step 2?
When you generate the rationales, would N incur large computation needs?
Is CD-CoT sensitive to the prompt designs?

局限性

N/A

作者回复

2024-08-07

Thanks for the valuable feedback. Please find the point-to-point responses below. Any further comments and discussions are welcome!

W1. The generalization ability of the proposed CD-CoT method.

Reply: Thanks for this valuable comment.

Please refer to the general response, where we further discuss and empirically justify the proposed CD-CoT method's generalization ability to other datasets.

W2. About the evaluation metric.

Reply: Thanks for this helpful comment.

We would like to clarify the usage of evaluation metrics.

The evaluation metric used in the main content is the accuracy introduced in Section 4.
The other metric, Normalized Difference in Accuracy (NDA), in the appendix, is only an auxiliary tool for analyzing empirical results. This metric quantifies the efficacy of a given LLM and denoise method under the noisy scenario (details in Appendix F.2).

Please note that putting the introduction and analysis of the NDA metric in the main context will make it too crowded. Besides, NDA does not influence the empirical observations and analysis of the main content.

Therefore, we introduce the NDA metric in the appendix and have built a jump link in Section 4. We will further clarify the usage of metrics in the revision.

W3 & Q1. Technical details of the proposed CD-CoT method.

Reply: Thanks for this technical question.

The rationale selection (step 2) and answer voting (step 4) do not require using an LLM.

Specifically, the rationale selection (step 2) selects the rationales to deduce the true answer.

As the rephrased rationales can still contain noisy information, and each rationale can deduce an answer to the question, we select the rationales that the corresponding answers match the given (true) answer of this demonstration. This is called the “answer matching” and does not require an LLM for inference.
For example, (Q, R, A) indicates the question, rationale, and answer of a given noisy demonstration. The rationale rephrasing (step 1) obtains three rephrased demonstrations: (Q1, R1, A1), (Q2, R2, A2), and (Q3, R3, A3). Then, if A1=A2=A and A3!=A, we will select the first two rephrased rationales, R1 and R2.
Namely, only rephrased results with consistent answers are retained, forming the refined candidate pool for that noisy demonstration of in-context learning.

The answer voting (step 4) does not require an LLM as well.

Given D answers from step 3, we equally vote them into a final answer.
For example, if the answer set is {1,1,1,2,3} where D=5, the answer voting will select “1” as the final answer for its highest frequency.

We will clarify the above in the revision. Please also refer to Appendix E. 2, where more technical details and the full algorithm of CD-CoT are introduced.

Q2. When you generate the rationales, would N incur large computation needs?

Reply: Thanks for this insightful comment.

In the main content, we present the impact of parameters M, D, and C on token usage in Table 10.

Here, we conduct additional experiments to figure out the effect of varying N on the computational cost. Specifically, we maintain a constant number of reasoning repetitions D=5 while adjusting N and other parameters. Here are the configurations for testing:

N=1, M=1, C=[5], D=5
N=2, M=2, C=[3,2], D=5
N=3, M=2, C=[3,2], D=5
N=4, M=2, C=[3,2], D=5
N=5, M=2, C=[3,2], D=5 (the default configuration)

These experiments are conducted on the NoRa-Math base-9 task with irrelevant hard noise. The table below shows the total number of tokens consumed by CD-CoT for complete reasoning on 300 test samples. This includes tokens used for both rephrasing and reasoning steps.

N	1	2	3	4	5
tokens	1071560	1408845	1532606	1656617	1780095

As we can observe, the number of tokens generated increases as N increases. This growth in token count directly correlates with increased computational needs. Notably, the computational cost does not scale linearly with N.

Q3. Is CD-CoT sensitive to the prompt designs?

Reply: Thanks for this insightful comment.

Section 5.1 provides the prompt for contrastive rationale rephrasing. Here, we generated several variants of prompts to investigate their sensitivity to the proposed CD-CoT method.

A simpler, shorter prompt:

Here are two examples: the first one has proper explanation and answer, while the second one has distracted explanation and correct answer. Please follow the first example's explanation and provide the correct explanation and answer for the second one.

A more complex, longer prompt:

The following presents two examples of the same type of task. The first example contains both a correct explanation and a correct answer. The second example, however, includes a distracted explanation but still provides the correct answer. Your task is to analyze these examples and then provide a revised version explanation of the second example along with its answer. Ensure that your revised explanation is logically consistent with the first example.

Then, we conduct additional experiments on Math Base-9 to compare these three prompts. The results in the table below show that the performance of CD-CoT is only marginally influenced by these prompts.

method	Irrelevant-medium	Inaccurate-medium
Base Model	0.28	0.08
CD-CoT w/ original prompt	0.49	0.48
CD-CoT w/ short prompt	0.46	0.46
CD-CoT w/ long prompt	0.47	0.48

Note that CD-CoT's prompt remains simple. Cooperating with advanced methods for iterating prompts and rationales, such as APE [1] and Star [2], can further improve this reasoning method.

References:
[1] Y. Zhou et al. Large language models are human-level prompt engineers. In ICLR, 2023.
[2] E. Zelikman et al. Star: Bootstrapping reasoning with reasoning. In NeurIPS, 2022.

评论- Would you mind checking our response and confirming whether you have any further questions?

2024-08-09

Dear Reviewer HgRq,

Thanks for your time and comments on our work!

We have tried our best to address the concerns and provided detailed responses to all your comments and questions.

Would you mind checking our response and confirming whether you have any further questions?

Best regards,
Authors of #1934

评论- Please discuss with authors

2024-08-11

Dear Reviewer HgRq,

Please respond to author rebuttal and discuss with authors.

Thanks,

Your AC

评论- [Last-day Reminder] We are anticipating your post-rebuttal feedback!

2024-08-13

Dear Reviewer HgRq,

Thanks very much for your time and valuable comments.

We understand you might be quite busy. However, the discussion deadline is approaching, and we have only around one day left.

We believe that our responses—detailed clarifications with empirical results—are sufficient to address the questions you raised. Specifically, we

conduct additional experiments with CD-CoT (W1, Q2, Q3)
clarify the evaluation metric (W2)
clarify the technical details of the CD-CoT method (W3)

Would you mind checking our response and confirming whether you have any further questions?

Thanks for your attention.

Best regards,
Authors of #1934

作者回复

2024-08-07

A General Response by Authors:

We sincerely thank all four reviewers for their thoughtful suggestions on our submission.

We have received four reviews with positive ratings 6,6,5,5. We are glad that all the reviewers have good impressions of our work, including

an under-explored and critical problem (bpvC, HgRq, qSQZ, ajBs)
construct a valuable dataset (bpvC, ajBs)
a novel and helpful method to address the problem (bpvC, HgRq)
comprehensive experiments and several insights (HgRq, qSQZ, ajBs)
well-written and good presentation (bpvC, HgRq, qSQZ).

In the rebuttal period, we have provided detailed responses to all the comments and questions point-by-point. Specifically, we

provide the datasheet of the NoRa dataset (W2 for bpvC)
discuss the relationship and differences with related works (Q1 for bpvC, W2 for qSQZ, Q1 for ajBs)
discuss noisy rationales in real scenarios (L2 for bpvC, W1, Q1 for qSQZ)
clarify the evaluation metric (W2 for HgRq, Q2 for qSQZ)
clarify the technical details of the CD-CoT method (W3 for HgRq, Q3 for ajBs)
conduct additional experiments with CD-CoT (W1, Q2, Q3 for HgRq, Q5 for qSQZ)
further explain the empirical results and findings (Q3, Q4 for qSQZ, Q2 for ajBs)
provide a detailed revision plan for improving the presentation (W1 for bpvC), which will be implemented in submission.

Besides, in the extra one-page PDF file, we provide the datasheet of NoRa (Table 1), examples that noisy inputs can lead to noisy outputs (Table 2), and examples of denoised results under high-noise settings (Table 3).

Regarding W1 for reviewer HgRq and Q5 for reviewer qSQZ, in the following, we further discuss and empirically verify the generalization ability of the proposed CD-CoT method to other datasets.

Recall that the constructed NoRa benchmark covers five prevailing datasets from three different domains. The current LLMs present significant vulnerability to noisy rationales in all five datasets of NoRa, while the proposed method CD-CoT has shown advanced and consistent robustness against noisy rationales. Specifically,

Robustness with different datasets: CD-CoT consistently outperforms other methods in all five datasets in NoRa.
Robustness with different noise levels: The results shown in Tab. 7 demonstrate the remarkable robustness of CD-CoT to varying noise levels. Across the Math, Symbolic, and Commonsense tasks, the performance decline of CD-CoT remains modest as the noise level increases.
Robust with different LLMs: The results in Tab. 9 further indicate that CD-CoT substantially improves over all three other baselines on the more powerful LLMs. Even on the relatively smaller Mistral-8x7B, CD-CoT significantly outperforms the other baselines on most tasks.

Note that Section 3 introduces a general framework for generating noisy rationales with existing datasets. This means more datasets can be integrated into NoRa if necessary for future research. Here, we conduct additional experiments to generalize CD-CoT to three new datasets that are not covered in NoRa, including GSM-8K, Blocksworld, and BIG-Bench Hard Dyck Languages. Specifically, GSM-8K: A math dataset of linguistically diverse grade school math word problems. Blocksworld: A planning dataset simulating block stacking tasks. BIG-Bench Hard Dyck Languages: A symbolic dataset designed for predicting the sequence of closing parentheses in a Dyck-4 word.

Then, we generate noisy rationales and compare the following setups: Zero-shot: Base model with no demonstration. CoT (clean rationales): Base model with 3 clean demonstrations. CoT (noisy rationales): Base model with 3 noisy demonstrations. CD-CoT (noisy rationales): Base model with 3 noisy demonstrations and our CD-CoT method.

Dataset	Zero-shot	CoT (clean rationales)	CoT (noisy rationales)	CD-CoT (noisy rationales)
GSM-8K (300 questions)	84.3	87.7	84.3	86.0
Blocksworld (200 questions)	2.0	25.0	13.0	25.5
BIG-Bench Hard Dyck Languages (250 questions)	12.4	40.8	29.2	35.2

The reasoning accuracy in the table above shows that CD-CoT consistently outperforms the zero-shot setting and standard CoT prompting under noisy rationales. This is consistent with the findings in our submission, showing CD-CoT’s strong capability of generalization to new datasets. These empirical results and discussions will be included in the revision.

Lastly, we would appreciate all reviewers again. Would you mind checking our response and confirming whether you have any further questions? We are anticipating your feedback during the discussion period!

最终决定Accept (poster)

2024-09-25

This paper addresses an important and under-explored challenge in large language models - their vulnerability to noisy rationales in chain-of-thought prompting. Overall, reviewers agree that It makes several valuable contributions:

Introducing the NoRa dataset to evaluate LLM robustness to noisy rationales
Revealing widespread vulnerability of LLMs to such noise
Proposing the CD-CoT method to improve denoising and reasoning capabilities
The extensive experiments demonstrate the effectiveness of CD-CoT compared to baselines.

During rebuttal, authors successfully addressed several concerns including whether findings generalize to other datasets, whether the noisy rationales are applicable to real scenarios. Overall, this is a strong paper that makes novel contributions to an important problem in LLM reasoning. The AC recommends acceptance.