PaperHub
5.5
/10
Poster4 位审稿人
最低5最高6标准差0.5
5
5
6
6
3.8
置信度
正确性2.8
贡献度2.3
表达3.0
TL;DR

MR-Ben is a comprehensive process-oriented evaluation benchmark comprising 5,975 questions that cover various subjects, ranging from natural science to coding, challenging LLMs to score the reasoning steps of candidate solutions.

摘要

关键词
Large Language Models;Reasoning;System-2;Slow Thinking;Resource and Evaluation;Analysis and Interpretability

评审与讨论

审稿意见
5

The paper introduces MR.BEAN, a benchmark designed to evaluate the meta-reasoning capabilities of large language models (LLMs). This benchmark focuses on the models' ability to detect and correct errors in reasoning steps, addressing the limitations of existing outcome-based benchmarks.

优点

  • The paper is well-articulated, providing clear examples that effectively demonstrate the dataset's composition and intended utility.
  • This work contributes a new benchmark for identifying and correcting errors in reasoning steps rather than just the final outcomes.
  • The benchmark covers diverse topics and is paired with reasonable metrics.
  • A comprehensive set of experiments is conducted that can tell performance disparities among various LLMs.

缺点

  • The ACCreasonACC_{reason} metric shows limitations in consistency. Its dependency on the judgments of different LLMs or human evaluators could lead to variability in scoring.
  • The weights assigned to different metrics might need recalibration when new models are tested or when the validation set is updated.
  • A detailed report on each metric's individual performance and impact is lacking.

问题

  • Since the MR metric is newly proposed, would it be reasonable to incorporate human evaluations to support its credibility and relevance? By comparing the model outputs against human judgments on at least a sampled subset, it can provide empirical evidence on how well the MR metric aligns with human reasoning and offer insights into the absolute performance of models.

局限性

n/a

作者回复

We truly appreciate your kind review and insightful questions. We are more than happy to address them as follows:

W1: The ACC_reason metric’s dependency on the judgments of different LLMs or human evaluators could lead to variability in scoring

We would like to argue that due to the careful design of our evaluation mechanism, the automatic scoring of error reasons is both robust and economically feasible:

  • Multiple annotators: During the annotation stage, we collected multiple annotations for the first error reasons and potential error rectification from different annotators who agreed on the solution's correctness and the first error step.

  • Proxy Model Evaluation: Based on the ground truth annotations collected from various perspectives, the proxy language model (e.g., GPT-4-Turbo) then examines the error reasons provided by evaluating models. Given the question/solution pair and information regarding the first error step, error reasons, and rectification, the potential flaws of the error reasons provided by the evaluating models will be easy to diagnose under contrast.

  • ACC_reason robustness: Below is the scoring of error reasons sampled from our evaluation results. For the same set of error reasons collected in each subject, three different models made their predictions on the correctness/incorrectness. We can clearly see the consistency of their predictions among three models over questions in all subjects. Since the MR-Score is a weighted metric, the final score variability is less than 1 percent in total.

ModelCodingPhysicsBiologyMathMedicineChemistryLogic
gpt-4-turbo83/55137/15164/11305/46194/25166/27192/16
deepseek_coder100/38145/7167/8321/30200/19172/21193/15
Qwen2-72B99/39142/10167/8312/39195/24172/21200/8
  • Agreement Rate: As mentioned in lines 207-209, the agreement rate between manual annotations and the GPT-4 predictions over 100 samples randomly collected from all subjects is 92%. This high agreement rate also supports the reliability of our evaluation and therefore avoids the manual annotation of potentially 138,000 problems (6,000 benchmark sizes times 23 models evaluated).

W2: The weights assigned to different metrics might need recalibration.

We would like to clarify that, for consistency consideration, the weights assigned to different sub-metrics are not supposed to be recalibrated in the future, even when new models are tested. This is because:

The discriminative ability of final MR-Scores is not sensitive to the weights: We performed a comprehensive grid search over 23 models we evaluated. The results have shown that even for the large model coverage, the variance of MR-Scores across different models (representing the differentiability of MR-Score) does not change very much for different combinations of weights. We therefore considered both the difficulty levels of all three subtasks and their progressive nature, and selected the weighting schema that assigned increasing weights to solution correctness prediction, first error step determination, and error reason explanation.

The current weighting ratio strikes a good balance between interpretability and differentiation: The traditional reasoning accuracy assigns similar scores to the SOTA LLMs, for example, GPT-4-Turbo, Deepseek-v2-236B, and Mistral-Large achieve 86.4%, 78.5% and 81.2% respectively in MMLU but score 43.2%, 29.4% and 21.3% in our benchmark. The increased difference in performance showcases our superior differentiability.

W3: Missing detailed report on each metric's individual performance.

We apologize for not including detailed sub-task performance tables due to limited space. We agree that this information is beneficial for interpreting model behaviors. Since the rebuttal is limited to 6,000 words, we have moved the sub-task performance tables into the PDF posted under the global reply section of this rebuttal. Please kindly refer to the tables there for more information.

Q1: Incorporate human evaluations to support the credibility and relevance of the MR metric.

As detailed in lines 207-209, we have reported the 92% of human-model agreement rates on the error reason scoring. Below is the exact detail of our setup: We randomly collected 100 data instances where the evaluating model correctly identified the solution correctness and the first error step over all the subjects. We then manually examine whether or not the proxy scoring model (e.g. GPT-4-Turbo-2024-04-09) has correctly scored the error reasons of the evaluating models. Below is the detailed composition of the ratio that the author agrees with the proxy scoring model:

CodingPhysicsBiologyMedicineChemistryLogicMath
7/812/1321/2112/1215/1715/1610/13

The annotation time varies significantly across subjects, as some problems like coding and chemistry might take more than 10 minutes to evaluate, while subjects like biology will be easier to evaluate. We sincerely hope the author-model agreement rates and the agreement tables among the three different models above can relieve your concerns.

评论

Thanks for your rebuttal and I encourage you to include further clarification and results into your work.

评论

Dear reviewer JiJF: Thanks for your kind encourgement and we will make sure to include the clarification enlighted by your comments into our work. If you believe your concerns are addressed by our explaination, we would be very appreciative if you can kindly consider updating the ratings for us. We wish you a wonderful day ahead. Sincerely Authors

审稿意见
5

This paper introduces a new benchmark for evaluating the reasoning capabilities of large language models (LLMs). Current methods primarily focus on final outcomes and not sufficiently capture the intricacies of the reasoning process. To address this issue, this paper propose MR.BEAN, a process-based benchmark that demands meta-reasoning skills from LLMs, requiring models to identify and analyze potential errors in automatically generated reasoning steps.

The authors conducted an extensive analysis of various LLMs using MR.BEAN, revealing significant limitations and previously unidentified weaknesses in their reasoning abilities. They found that while many LLMs can generate correct answers, they struggle to pinpoint and correct errors in the reasoning process. The paper also discusses the potential for improving reasoning abilities through techniques like the use of high-quality synthetic data.

优点

  1. The paper introduces a novel benchmark, MR.BEAN, which focuses on meta reasoning—a higher-order thinking skill. This design pushes beyond traditional outcome-based evaluations to assess the reasoning process itself. And MR.BEAN covers a wide range of subjects, including physics, chemistry, logic, coding, and more.
  2. The benchmark's questions and error analyses are curated and annotated by human experts, ensuring a high level of quality and relevance in the evaluation process.
  3. The paper's evaluation of various LLMs reveals previously unidentified weaknesses in their reasoning abilities, providing valuable insights for researchers and developers. The benchmark's application to a diverse array of models, from small to large, open-source and closed-source, provides a broad comparative analysis that can inform future development in AI reasoning.

缺点

  1. The paper may not provide a thorough comparison between the model's automatic annotations and human annotations. Without such validation, it is challenging to assess the reliability and accuracy of the model-generated annotations.
  2. The concept of meta-reasoning typically involves instructing models on how to reason, which may include decision-making during the reasoning process. The paper primarily analyzes the reasoning steps after they have been generated, which might not fully align with the proactive aspect of meta-reasoning.
  3. The paper could provide more transparency regarding the annotation process, such as the total number of annotators involved, the total time spent on annotations, and the average number of samples annotated per person per day. This information is crucial for understanding the scalability and efficiency of the annotation process.
  4. The paper may not fully address how the dataset can be used to validate an LLM's reasoning capabilities on new, unseen examples. Understanding the generalizability of the findings is essential for assessing the true breadth of an LLM's reasoning skills.
  5. The paper could benefit from including specific test cases using GPT-4 to demonstrate the benchmark's effectiveness in identifying strengths and weaknesses in state-of-the-art models. This would provide a clearer picture of how GPT-4 performs against the benchmark and highlight the benchmark's utility.

问题

See weaknesses.

局限性

See weaknesses.

作者回复

Thank you for your kind review and insightful comments. We are committed to addressing your concerns and providing clarifications.

W1: Missing a thorough comparison between the model's automatic annotations and human annotations.

(Since all the instances in MR. Bean are annotated manually, we hereby assume that you are referring to the automatic scoring of error reasons generated by the evaluating models)

Thank you for highlighting the importance of manual examination compared to automatic evaluations. We fully agree that manual examination can often provide a higher level of trust and thoroughness. However, we have designed our evaluation mechanism to provide robust and reliable automatic annotations of the model responses.

  • Multiple annotators: During the annotation stage, we collected multiple annotations for the first error reasons and potential error rectification from different annotators who agreed on the solution's correctness and the first error step.

  • Proxy Model Evaluation: Based on the ground truth annotations collected from various perspectives, the proxy language model (e.g., GPT-4-Turbo) then examines the error reasons provided by evaluating models. Given the question/solution pair and information regarding the first error step, error reasons, and rectification, the potential flaws of the error reasons provided by the evaluating models will be easy to diagnose under contrast.

  • Agreement Rate: As mentioned in lines 207-209, the agreement rate between manual annotations and the GPT-4 predictions over 100 samples randomly collected from all subjects is 92%. This high agreement rate supports the reliability of our evaluation and therefore avoids the manual annotation of potentially 138,000 problems (6,000 benchmark sizes times 23 models evaluated).

W2: Not aligned with the proactive aspect of meta-reasoning.

Your observation about the proactive nature of meta-reasoning is insightful. Since meta-reasoning is not yet a well-defined and recognized concept, we follow the definition of MR-GSM8K and MR-MATH to define meta-reasoning as an evaluation process that scores the reasoning steps. This definition, which tasks the evaluating models to reason about different reasonings, is different from the proactive prompting instruction methods in the following ways:

  • Evaluation Mechanism: Even though the language models might adopt a more systematic and structured reasoning paradigm to answer the question, according to the proactive definition of the meta-reasoning paradigm you suggest, its effect is measured by the metrics that examine the computation results only. The quality and correctness of the intermediate reasoning steps are in no way guaranteed or reflected in the final metrics numbers. On the contrary, our evaluation mechanism provides an effective way that dissect the reasoning process into meticulously annotated subtasks, revealing the key properties of meta-reasoning ability.

  • Teacher Role: By shifting the model's role from a student generating answers to a teacher scoring solutions, our mechanism forces the model to actively reflect on and critique conditions, assumptions, and logic, examining potential outcomes counterfactually. All of the above are essential for a more robust reasoning model. In some sense, our definition of meta-reasoning is orthogonal and complementary to the proactive reasoning paradigm you suggested.

W3: Annotation details

Based on your suggestion, we provide the annotation details as follows. We are committed to include these details in the revised version of the paper.

  • Annotator Qualification & Training: As mentioned in lines 152-157, our annotators hold a minimum of a bachelor's degree. Each annotator is required to read through the annotation guidelines listed in Appendix G before completing a trial labeling process. The selection of annotators is based on their performance on a balanced small hold-out set of problems for each subject. Every annotator is ensured to be paid above the local minimum wage rate.
  • Initial Annotation: As mentioned in section 3.4, each question was labeled by two different annotators. Inconsistencies in solution correctness or the first error step were identified and reviewed by a quality controller for arbitration.
  • Quality Control: In the final quality control phase, 10% of the problems were randomly sampled and reviewed by meta controllers (authors). The author-annotator agreement rate had to exceed 90% for annotations to be accepted.
  • Annotator Details: Each subject usually comprised 5-6 annotators with two project supervisors for quality control. The annotation process took approximately three weeks, followed by two weeks for quality control and resolving disagreements. Annotators generally handled around 20-30 questions per day, though this varied slightly depending on the difficulty level of the subject matter.

W4: Validate an LLM's reasoning capabilities on new, unseen examples.

We would like to clarify that one of the core novelties and contributions of our meta-reasoning paradigm is that someone can apply it to transform any “student answering” benchmark to a “teacher scoring” benchmark. By successfully applying our benchmark on top of the well-recognized but already performance-saturated benchmarks like MMLU, LogiQA, and MHPP, we have observed a substantial performance drop for SOTA models (e.g. Mistral-Large achieved ~80% accuracy in MMLU but scored 21.3 in our benchmark). This supports our point that the meta-reasoning evaluation pipeline is a challenging mechanism that uncovers the holisticness and comprehensiveness of language models regarding the mastery of domain knowledge and its application. We believe this paradigm will significantly contribute to the community, whether applied to existing datasets or new compilations.

审稿意见
6

This paper introduces MR.BEAN, a comprehensive benchmark for evaluating meta-reasoning capabilities of large language models (LLMs). Comprising 6,006 questions across various subjects including physics, chemistry, logic, coding, and more, MR.BEAN requires LLMs to analyze and correct errors in automatically generated reasoning steps. The benchmark is meticulously constructed using a three-step annotation process involving answer correctness evaluation, error step identification, and error reason analysis.

The main contributions of this work are:

  1. A novel, large-scale benchmark for meta-reasoning evaluation covering diverse subjects and reasoning types.
  2. A rigorous methodology for creating and annotating meta-reasoning questions, ensuring high-quality data.
  3. Comprehensive evaluation of 15 LLMs on the benchmark, revealing limitations and weaknesses in their reasoning abilities.

优点

  1. Comprehensive and well-organized dataset: MR.BEAN covers many subjects (p.3, Table 1) and offers a broad assessment of LLM meta-reasoning capabilities across diverse domains. The paper is structured clearly, with detailed explanations of the dataset creation process, evaluation metrics, and experimental results.
  2. Novel meta-reasoning focus: By requiring LLMs to identify and correct errors in reasoning, MR.BEAN offers a unique perspective on evaluating AI reasoning capabilities, going beyond traditional outcome-based assessments.
  3. Extensive empirical study and analysis: The authors evaluate different LLMs on MR.BEAN and have tested different prompting methods to comprehensively analyze current model capabilities and reveal interesting limitations in their reasoning abilities.

缺点

  1. Validation of meta-reasoning specificity: While the paper describes the evaluation metrics in detail (p.5, l.185-197), it is still not clear such metrics measures meta-reasoning abilities rather than general language understanding or domain knowledge.
  2. Prompt sensitivity: The paper doesn't adequately address the potential impact of prompt design on the generated solutions and errors. Given that LLMs are known to be sensitive to prompt wording, this could significantly affect the nature and distribution of errors in the dataset.
  3. Lack of annotation details: While the paper mentions a three-stage annotation process (p.4, l.116-135), it lacks specific details on annotator qualifications, training, and inter-annotator agreement rates. This information is crucial for assessing the reliability and consistency of the annotations.

问题

  1. Can you provide more details on the annotation process, including inter-annotator agreement rates and resolution of disagreements?
  2. What steps were taken to ensure that MR.BEAN specifically measures meta-reasoning abilities and not merely language understanding or domain knowledge?
  3. How sensitive is the dataset to the specific prompts for generating solutions and errors? Did you experiment with different prompt formulations, and if so, how did this affect the resulting dataset?

局限性

Yes

作者回复

Thank you for your kind review and insightful and to-the-point comments. We are committed to addressing your concerns and providing clarifications.

W1: How does our evaluation mechanism measure meta-reasoning abilities rather than general language understanding or domain knowledge

We believe language understanding and domain knowledge are inseparable and essential components of reasoning. Examples include MMLU, LogiQA, and MHPP. These well-recognized reasoning evaluation benchmarks require substantial language understanding ability and application of domain knowledge. However, as outlined in our abstract and introduction, these benchmarks primarily use a result-oriented evaluation method rather than a process-oriented one. This can be flawed due to incorrect understanding and reasoning.

Our benchmark addresses this limitation by proposing a meta-reasoning framework that transforms evaluating models from the role of students generating results to the role of teachers scoring solution processes. To effectively score these processes, evaluating models must actively reflect on and criticize conditions, assumptions, and logic, and examine potential outcomes counterfactually. The above capabilities are essential for a more robust and trustworthy reasoning mechanism.

Therefore, our meta-reasoning paradigm provides a challenging yet feasible evaluation pipeline that forces models to "reason about reasoning" (i.e., score candidate solutions), which we term meta-reasoning. This paradigm aims to evaluate the reasoning in a more fine-grained approach (measured in a series of subtasks and represented under a unified metric), and to do so it indeed requires advanced language understanding and mastery of domain knowledge as you suggested.

W2: The paper doesn't adequately address the potential impact of the prompt.

We agree that language models can be susceptible to different prompting methods.

  • Response generation prompt design: We fully agree with you that language models are susceptible to prompt wordings and sometimes it can affect the distribution of reasoning errors in the evaluation benchmark. With this in mind, our design of response generation prompt are following the general best practices of prompt engineering guidelines. As illustrated in Figure-10 in the Appendix, our prompt has the following key attributes: (1) Clear task description and background information, (2) Persona adoption as an experience scoring teacher, (3) Divide-and-conquer fashion that splits the goal into sub-tasks, (4) Allowing “time” for the model to think in step by step manner, (5) Clear separation of different parts of information, and (6) Specific requirements on the return format.

The final prompt presented in Figure-10 is actually the iterated version after considering the prompt wording effects and is proven effective as explained in lines 219-222.

  • Few-Shot In-Context Learning: In section 6.1, we experimented with this method and observed performance fluctuations across models of different sizes. We suspect lengthy demonstrations might confuse the models. To prove our point that length does affect the model performance, we conducted a Pearson correlation analysis. In the results, we found a Pearson correlation coefficient of -0.58 between model performance and question length and -0.29 on solution length. This negative correlation supports our hypothesis above.
  • Self-Reflect Prompting Method: In section 6.2, we adopted a response-examine-refine pattern. This did not significantly boost performance, as models often switched their decisions from correct to incorrect and vice versa.
  • Ground Truth Solution Correctness: We provided hints of ground truth solution correctness to see if models could better identify error steps and reasons. The positive results indicate that correct prior knowledge in the prompt can indeed affect outcomes.

Overall, while prompting methods can influence performance, our benchmark provides a robust evaluation mechanism that remains relatively stable unless influenced by explicit hints.

W3: The paper lacks specific details on annotator qualifications, training, and inter-annotator agreement rates.

Based on your suggestion, we provide the annotation details as follows. We are committed to including these details in the revised version of the paper.

  • Annotator Qualification & Training: As mentioned in lines 152-157, our annotators hold a minimum of a bachelor's degree. Each annotator is required to read through the annotation guidelines listed in Appendix G before completing a trial labeling process. The selection of annotators is based on their performance on a balanced small hold-out set of problems for each subject.
  • Initial Annotation: As mentioned in section 3.4, each question was labeled by two different annotators. Inconsistencies in solution correctness or the first error step were identified and reviewed by a quality controller for arbitration.
  • Disagreement Resolution: We did not adopt majority voting due to the objective nature of our questions. Instead, a senior supervisor reviewed all questions with disagreements to resolve ambiguities (therefore the inter-annotator agreement rate is not applicable here).
  • Quality Control: In the final quality control phase, 10% of the problems were randomly sampled and reviewed by meta-controllers (authors). The author-annotator agreement rate had to exceed 90% for annotations to be accepted.
  • Annotator Details: Each subject usually comprised 5-6 annotators with two project supervisors for quality control. The annotation process took approximately three weeks, followed by two weeks for quality control and resolving disagreements.
评论

Thanks for your detailed response; I would like to keep my rating positive for this paper.

评论

Dear reviewer TVR6, thanks for your kind review and insightful questions. We are happy that you find the reply helpful. We wish you all the best : )

审稿意见
6

This paper proposes a meta-reasoning benchmark for evaluating the solutions generated by a large-language model (LLM) to shift the focus more to process-based evaluation of an LLM's reasoning abilities rather than outcome-based evaluation.

Evaluation: On a variety of question-solution pairs, the model is asked to score its own generated solution for correctness, to identify the first erroneous step, and if these are correctly identified, the reasoning of it solution is evaluated by GPT-4. These three are used to compute three metrics respectively-- the MCC for a binary classification of the solution correctness, the ACC_step to judge the number of correctly predicted first-error steps as compared to the total number of incorrect solutions, and the ACC_reason to judge the number of correctly predicted first-error steps and error-reasons as compared to the total number of incorrect solutions. These three are combined to form an MR-score that is used to compare different models.

Dataset: The question-solution pairs (~6k in all) themselves are formed such that the questions are sampled from MMLU (arithmetic reasoning), LogiQA(logical reasoning), and MHPP (code-based reasoning) and the solutions are generated via chain-of-thought prompting by GPT-3.5-Turbo-0125, Claude-2, and Mistral-Medium. The annotations for these question-solution pairs are annotated by human annotators-- these annotations of solution correctness, first error-step, and the error-reason are used for evaluation.

优点

  • The authors highlight the importance of more carefully evaluating the solutions generated by LLMs to complex problems rather than just comparing their accuracy which is a notable contribution to the body of work of improving evaluation of LLMs.
  • The annotation process followed for experimental evaluation is thorough and multi-step for quality assurance
  • The authors raise an interesting question about problem solving dynamics, which can be potentially explored further in future work on LLMs

缺点

These questions/remarks might also have arisen due to my lack of proper understanding, so I am willing to increase my score if these can be clarified:

  • My main issue is with the mixing of reasoning based and accuracy based evaluation into a singular score. What is the reasoning behind combining the three metrics (MCC, ACC_step, and ACC_reason) into a single score (MR-Score)? Does this not affect the motivating insight that models might have a high MCC but a super low ACC_step and an even lower ACC_reason? Is there somewhere the three metrics can be viewed individually also so it's clearer how the models vary along the three-- right now, it is not clear to me by looking at the MR-scores in Table 2 how the 15 models actually fare when it comes to identifying the correct error-reason, especially when the MR-scores for the models are in the same ballpark.

Writing:

  • (pre-existing work): it's not clear to me what the real differentiating aspect of MR. Bean is vis-a-vis MR-GSM8K and MR-Math. MR-GSM8K and MR-Math also consider the first-error reason? So, is it that they are math-reasoning based, and MR. Bean now just combines multiple datasets to essentially follow the same evaluation pipeline? If so, this should be clearly mentioned. They don't just go a step further, that is the furthest step gone to, even considering this work (if that is indeed the case).
  • The abstract makes it seem like the MR-score is a metric designed by the authors in this work, although it's already been proposed "Through our designed metrics" which is somewhat over-selling the work.
  • (minor) typo: A.2 Negative Societal Impacts
  • The connection between task difficulty and reasoning capability can be explored further-- I am assuming ACC_reason on more difficult tasks would be comparably worse than the MCC, which would be the actual assessment of reasoning capabilities. Also, from Fig 2, it seems that Logic based tasks are the most difficult, then wouldn't it be fair to include them in the comparison too or perform the comparison of task difficulty not just based on high-school/college, but also take into consideration how the model actually interprets the task difficulty based on MR-score?

Experiments:

  • what is the main takeaway on self-refine prompting since it seems to be in contrast to existing observations? How does the shift from incorrect to correct predictions depend on the model family and task? Is it seen only across MR-Scores or also in the ACC_reason (which should probably be affected more by model size?)?
  • Is the difficulty of logic-based reasoning also because they contain the longest questions? is there any impact of question length on the model's reasoning ability, if not solution length?

问题

Some questions for clarification:

  • I might have missed this, so in Table 2: is the meaning of k the number of shots of prompting the model gets? If so, it would help to mention that in the caption since k is not mentioned in the text anywhere else. what should be the take-away on self-correction?
  • Is the difference between the Self-Refine prompting (section 6.2) and Solution Correctness Prior (section 6.3) the absence/presence of of the ground truth (resp.)? How do the two compare with each other?

局限性

  • The proposed benchmark also relies on step-wise solutions to evaluate reasoning which reduces its applicability and novelty as compared to existing work.
作者回复

Thank you for your thoughtful review and insightful comments. We hereby address your concerns below:

W1: Mixing metrics into a singular score

Given the interdependent and progressive nature of the three tasks (MCC - ACC_step - ACC_reason), we can either choose to combine them organically by assigning weights that consider both the differentiability and interpretability, or simply report the final metric ACC_reason. Considering the three metrics are complementary in revealing how models perform in the respective dimension as you also suggested, we claim the necessity for MR-Scores as follows:

  • Unified Metric: The MR-Score offers a unified and normalized metric that balances the difficulty levels of the three sub-tasks (lines 198-199). We conducted a thorough grid search to determine the weights for the sub-metrics and found that the MR-Scores are insensitive to relatively minor (e.g. ~0.1) adjustments on the weightings. We therefore chose the schema that assigned the greatest weight to ACC_reason and the least weight to MCC, as error reason reflects explicitly the model's understanding of the rationale behind the question. We believe the current weighting ratio strikes a good balance between interpretability and differentiation: For example GPT-4-Turbo, Deepseek-v2-236B and Mistral-Large achieve 86.4% ,78.5% and 81.2% respectively in MMLU but score 43.2%, 29.4% and 21.3% in our benchmark.

  • MCC and its correlations: MCC is chosen because it effectively penalizes random or biased behaviors. Although given the progressive natures of the three tasks, the raw scores of MCC, ACC_step, and ACC_reason are generally in diminishing order, the correlations are not too high (corr(MCC vs ACC_step)=0.42, corr(MCC vs ACC_reason)=0.46). Therefore, all three metrics are necessary components to provide an accurate and thorough evaluation.

  • Individual metrics: Due to space limit, we report the sub-table into the pdf posted under the global response section of this rebuttal.

W2: Differences among MR.Bean, MR-GSM8K and MR-Math.

Our work builds upon previous efforts and introduces several key improvements.

  • Extensive domain coverage: By recruiting and training a diverse group of domain experts as annotators and operating meticulous supervision and quality control, we extended MR.Bean to be a comprehensive benchmark covering coding, logic, medicine, and science, in addition to math, providing a broader evaluation spectrum.

  • Larger scale and increased difficulty: While MR-GSM8K and MR-Math focus on primary and high school math competition, MR.Bean increased the difficulty to high school to graduate level. MR-GSM8K and MR-Math are both limited in dataset size, while MR.Bean is 199.1% and 1,195% larger, respectively. The scaling up and difficulty diversity both contribute to a more robust evaluation.

  • Rigorous and fine-grained annotations: MR-Math only considered solution correctness and the first error step. MR-GSM8K additionally annotated the first error reason. However, each problem in MR-GSM8K only contains annotations of a single solution and a single error reason. In MR.Bean, each problem is mapped to three solutions sampled from different SOTA models and the error reasons are provided by multiple annotators who agreed on solution correctness and the first error step. Additionally, we annotate the revisions of the first error step. The revisions are ultimately integrated into the error reasons used by proxy LLMs as a reference to score the error reasons generated by evaluating models.

W3: Length and difficulties

The table below shows the zero-shot average MR-Scores of SOTA models per subject sorted in the ascending order of question length (measured by the number of words). Other question statistics can be found in Table 1 of the paper. We can indeed observe a negative correlation between the performance and the question length.

ModelMathChemistryBiologyPhysicsMedicineCodingLogic
(question length)44.348.156.366.688.7140.1154.8
mistral-large21.5324.4921.4824.2716.3421.815.1
deepseek-chat32.1832.5229.9732.4426.5434.1823.58
gpt-4-turbo44.2841.7144.7742.5438.8950.9930.98

For further quantitative investigations, we conducted a Pearson correlation analysis:

  • Question Length: We found a high Pearson correlation coefficient of -0.58 between question length and model performance, indicating a relatively strong correlation.
  • Solution Length: The correlation coefficient is -0.29, showing a lesser but still notable effect.

However, it is important to note that the difficulty of the subject question is a complicated factor and not solely dependent on the question length. For example, the difficulty of logic questions extends beyond their length to their inherent abstractness and the need for commonsense and real-world understanding, since its questions are sourced from the LogiQA dataset, originally collected from the Civil Service Entrance Exam (for case demonstrations, please refer to Appendix E-6).

评论

W4: Takeaways on self-refine prompting

The self-refine prompting experiment was designed to unveil if LLMs are capable of discovering their own reasoning flaws and effectively rectifying them. The result was indeed intriguing and therefore we have decomposed the behavior of models across tasks in Appendix D and visualized it in Figure 9.

We summarize our observation below:

  • Small Models like Gemma-2B are too limited to perform effective self-reflection.
  • Competent Models like GPT4-Turbo are confident in their initial decisions, hardly switching the decisions during self-reflection.
  • Intermediate Models like Llama3-70B exhibit substantial changes during self-reflection, indicating a lack of consistency in their decisions. However, its change of decisions from incorrect to correct happens to be significantly higher in locating the first error step than examining solution correctness and explaining the error reason, therefore boosting the overall MR-Score by a large margin. We believe the lack of consistency does not necessarily indicate a more robust or advanced reasoning ability, despite the increase of the evaluation results.

Conclusion: Our results support the observation that LLMs generally lack effective self-refinement capabilities [1][2].

Ref:

[1] Large Language Models Cannot Self-Correct Reasoning Yet. 2024

[2] LLMs cannot find reasoning errors, but can correct them given the error location. 2024

W5: Clarifications

  • Clarification of 'k': 'k' as in ‘k-shot’, represents the number of demonstrations in a prompt.
  • Typos: Thank you for pointing out the typo in Appendix A2.

We will correct them in the revised manuscript.

Q: Difference between Self-Refine and Solution Correctness Prior

Yes, the difference is indeed the absence/presence of the ground truth, specifically:

  • Self-Refine: LLMs are asked to generate a three-step reasoning process, where the LLMs first answer the question directly and then self-critic on their own response. Finally, the LLMs generate the final refined response based on its original response and critics.
  • Solution-Correctness Prior: The information that the provided solution is incorrect is included as part of the input prompt. LLMs are only asked to identify the first error step and explain the reason for it.

Limitation: The proposed benchmark relies on stepwise solutions to evaluate reasoning which reduces its applicability and novelty as compared to existing work.

We would like to clarify that our evaluation mechanism ensures the robustness of the process. LLMs are tasked to determine the solution's correctness, first error step, and error reason. Even if the model made correct predictions on the solution correctness and first error step via a flawed reasoning process, such a process will generally lead to incorrect/incomplete error reasons.

When the proxy scoring language models (e.g. GPT4-Turbo) are presented with the question/solution pair and the detailed error reasons provided by several annotators from different perspectives, the flawed error reasons generated by the evaluating models will be easy to diagnose via contrast. This is supported by the high author-model agreement rates(92% as written in line 208) in the automatic error reason scoring process.

评论
  • Based on the attached pdf with the individual scores and the authors' explanation, I am convinced the MR score rightly captures process based evaluation as well (for eg, it shows a higher MR score when ACCreasonACC_{reason} score is higher whereas MCC is much lower). This was my biggest concern.
  • However, based on the authors' response, now it is clearer to me that MR.Bean is indeed an incremental improvement over MR-GSM8K and MR-Math, employing the same metric and the same evaluation protocol on a larger dataset. In my opinion, that does not necessitate a new paper. The authors have also not edited the abstract that says "Through our designed metrics" to reflect that the MR score is a pre-existing metric which they use. Even so, I have already provided a positive score, which I cannot increase.
评论

Dear Reviewer SioJ:

We are happy to see that your biggest concern is addressed. Since we are not allowed to edit the paper submitted during the rebuttal period, we will make sure to include that in our next iteration of our paper. We are truly grateful for your kind review and detailed comments!

Wish you a great day ahead : )

Sincerely,

Authors

作者回复

Dear PCs, SACs, ACs and reviewers:

We sincerely appreciate your thoughtful review and insightful comments. We have tried our best to address your concerns one by one in the corresponding rebuttal sections. If our answers satisfy your queries, we would be grateful if you could consider revising your final rating to a higher score.

PDF for Subtasks Performance Table

Attached is the PDF of the breakdown performance table for models in all four metrics (MR-Score, MCC, ACC_step and ACC_reason) that we can not enlist in the individual rebuttal section due to the character limit. This table should hopefully bring some insights about our design choice of MR-Scores and the subtask metrics:

  1. Metric Robustness: Due to the progressive nature of the difinitions of our subtasks (e.g. the success of subsequent tasks depends on the previous ones), we can see the diminishing trend in the scores of MCC, ACC_step and ACC_reason. However, thanks to the design of our evaluation mechanism and metrics, the score rankings of different models stay in relatively stable order across metrics. In another words, we have not observed any model that excels in determining the solution correctness (thus high in MCC) but is unable to explain the rationale behind it (e.g. low in ACC_reason).

  2. Task Difficulties: As shown in the breakdown table, the ACC_reason metric is more discriminative than the MCC metric for competent models but vice versa for the less competent ones. This aligns with our intuition that generally more difficult questions are more discriminative for strong candidates and the weaker ones are simply incapable of solving them whatsoever. This phenomenon could in part explain why in general the MR-Score is not very sensitive to minor change in the weightings assigned to the subtasks, since the differentiability of the subtask metrics tend to reconcile with each other under different scenarios.

  3. Differentiability and Interpretability: The weights of the MR-Score is ultimately decided by considering both the discriminative ability and the interpretability. To best differentiate models with different evaluation results, we conducted a thorough grid search to investigate the impact of the weightings. Since the weightings calcultated returned a few optimal instances, we deliberately selected the one that assign higher scores to more difficult tasks. We believe the current weighting ratio strikes a good balance between interpretability and differentiation: For example GPT-4-Turbo, Deepseek-v2-236B and Mistral-Large achieve 86.4% ,78.5% and 81.2% respectively in MMLU but score 43.2%, 29.4% and 21.3% in our benchmark.

Hope this table will help clarify some of your concerns regarding model performance and metrics design.

Wishing you all the best,

Sincerely,

The Authors

最终决定

This paper proposes Mr.Bean, a comprehensive benchmark that evaluates LLMs' capabilities of identifying incorrect reasoning steps rather than focusing on the accuracy of the final results. Compared to previous works sharing similar ideas, Mr.Bean's major contribution lies in that it significantly broadens the coverage and increases the scale. Experiments are comprehensive and reveal interesting limitations of many strong opensource models.

The response has addressed most of the reviewers' concerns. A major concern, as raised by SioJ and I share, is the lack of acknowledgement of previous works. The overall sentiment leans towards positive, and I recommend that the paper gets in. The authors will want to revise their framing in the abstract and intro, to appropriately credit MR-GSM8K and MR-Math.

At least one review is discounted due to quality and lack of participation in discussion.