COLD: Causal reasOning in cLosed Daily activities
We present a new perspective on causal reasoning in NLP by using a closed system (defined by real-world commonsense activities) and propose a framework with underlying causal graphs. The causal queries help validate causal reasoning in LLMs.
摘要
评审与讨论
This paper proposes the COLD (Causal reasOning in cLosed Daily activities) framework, aiming to bridge the gap between open-ended causal reasoning and symbolic representation-based question answering.
The framework leverages human understanding of daily real-world activities to reason about the causal nature of events. The authors create a large set of causal queries and evaluate multiple Large Language Models (LLMs) on these queries.
The findings show that causal reasoning is challenging for LLMs, even for activities considered trivial for humans. The authors also explore the causal reasoning abilities of LLMs using the backdoor criterion.
The key contributions of this work are the development of the COLD framework, the creation of a substantial number of causal queries, and the evaluation of LLMs' performance on causal reasoning tasks. The findings highlight the need for further analysis using real-world events to properly validate LLMs' understanding of causality.
优点
- The COLD framework effectively bridges the gap between open-ended causal reasoning and symbolic representation-based question answering, utilizing human understanding of daily activities as a solid foundation.
- The paper addresses the crucial issue of evaluating LLMs' causal reasoning capabilities, emphasizing the significance of investigating and validating their intellectual capabilities.
- The paper is well-written, with clear explanations, logical flow, and concise language, ensuring effective communication of key points.
- The evaluation of multiple LLMs on a large set of causal queries reveals limitations in their causal reasoning abilities, while the exploration using the backdoor criterion provides valuable insights into causal strength between events.
缺点
Generally, I believe this paper makes good contributions. However, there are some minor issues that need to be addressed:
- Since the queries are mostly automatically generated, it is necessary to support them with human annotations or expert evaluations in order to confirm the reliability of the generated queries.
- The observational graphs are created through human annotations, which limits their capacity to cover a wide range of concepts. It would be better to discuss automated approaches for constructing such graphs in order to facilitate large-scale causal benchmarking.
- Can the synthesized queries be used for fine-tuning? I am interested in whether splitting the observation graphs into different sets and training them on queries synthesized from the training graphs would significantly improve performance. This could greatly enhance the comprehensiveness of the paper.
- Additionally, another set of baselines focusing on zero-shot commonsense question answering should be evaluated as well. It would be interesting to see whether transformations from commonsense knowledge bases can benefit causal reasoning tasks. I recommend checking these papers for reference.
- Ma, K., Ilievski, F., Francis, J., Bisk, Y., Nyberg, E., & Oltramari, A. (2021, May). Knowledge-driven data construction for zero-shot evaluation in commonsense question answering. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 15, pp. 13507-13515).
- Wang, W., Fang, T., Ding, W., Xu, B., Liu, X., Song, Y., & Bosselut, A. (2023, December). CAR: Conceptualization-Augmented Reasoner for Zero-Shot Commonsense Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 13520-13545).
- Kim, Y. J., Kwak, B. W., Kim, Y., Amplayo, R. K., Hwang, S. W., & Yeo, J. (2022, July). Modularized Transfer Learning with Multiple Knowledge Graphs for Zero-shot Commonsense Reasoning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 2244-2257).
- Lastly, there are some grammar typos that need to be corrected. For example, the caption of table 3 should not have "Language" capitalized.
问题
Please refer to the weakness section for my questions.
局限性
The authors have dedicated a section explicitly discussing the limitations of their proposed COLD framework and offering potential solutions. The in-depth discussion contributes significantly to the paper, enhancing its overall quality.
We thank you for your detailed and insightful review, pointing towards some suitable directions to make our work more impactful. We would like to address the raised concerns below:
- We want to mention that the human/expert annotations were done to construct the underlying causal graph, capturing the causal relationship between the events. Previously, works like Jin et. al have released causal queries (https://huggingface.co/datasets/causalnlp/corr2cause, https://huggingface.co/datasets/causalnlp/CLadder) generated from the underlying causal graph that does require causal inference theory understanding to be purely annotated by humans. Whereas in our work, the commonsense reasoning nature of the proposed framework provides the primary strength for bridging the gap between current causal reasoning benchmarks/datasets in NLP. Please also refer to our detailed comment on human evaluation in the response to Reviewer iBzV, where we explain the challenges in human evaluation.
- The advantage of using a human crowdsource dataset is the marginalization that we obtain when the same activity is written by different humans. We would like to highlight that the primary strength of this work providing real-world grounding is coming from the observational graphs that are coming directly from humans (capturing the commonsense knowledge acquired by humans). When creating a benchmark or evaluation criteria, it becomes imperative to consider human knowledge. Using the constructed graphs, the proposed design scheme helps sample enormous causal queries that essentially facilitate large-scale causal benchmarking.
- We thank you for pointing this out. We agree that it would be interesting to explore the fine-tuning over the created samples to observe the model’s behavior. We highlight the same in the Future directions of our work in lines 354-357, “In the future, it would be interesting to sample trajectories from the observational distribution to create a training dataset and check if causal reasoning ability can be acquired by language modeling objectives (including other variants like presented in Lampinen et al. (2023)).” Some of the prior arts like Corr2Cause (Jin et. al, 2024) have considered training/finetuning over the symbolic dataset and have observed no generalization, claiming LLMs fail to robustly acquire the causal reasoning skill in out-of-distribution settings.
- We thank you for pointing out these interesting resources. In fact, we found the strategy formulated by us for zero-shot evaluation similar to the one used in the first resource (Ma et. al), the formulation used in equation 1 of Ma et. al's work [https://doi.org/10.1609/aaai.v35i15.17593 ] is similar to our formulation, the difference being the multiple numbers of tokens being used in their setup when compared to simple option id prediction in our formulation. The remaining two resources pointed out (Wang et. al and Kim et. al) are somewhat different and consider relying on the commonsense base (Knowledge Graphs), and would require more work to evaluate in our setting of causal reasoning. We agree that another set of baselines focusing on zero-shot commonsense question answering will be a good direction to explore in the future.
- We thank you for pointing out the typos, we will fix them in the camera-ready version of the paper.
Dear Reviewer LejL,
Thanks again for helping review this paper! Since we are approaching the end of the author-reviewer discussion period, would you please check this author response regarding your concerns? We really appreciate it!
Best, AC
This paper proposes causal reasoning in closed daily activities, which combines the causal reasoning works of open-ended causal reasoning via causal commonsense reasoning and symbolic representation-based question answering for theoretically backed-up analysis. By creating a dataset containing about 8 million queries, it can estimate the causal reasoning performance of pretrained language models. Moreover, the paper introduces theroies of causal inference (such as backdoor adjustment) to conduct in-depth analyses of causal reasoning in pretrained language model.
优点
- The number of constructed dataset samples is large.
- The usages of ATE and backdoor adjustment are novel
- The analyses of causal reasoning in pretrained language models are thorough and in-depth.
缺点
- The paper is not well demonstrated, readers might be confused when reading sequentially
- Just for the dataset itself, there is no differences between cold and copa. Even copa may have a better quality.
- The dataset is limited to six scenarios, many of them are just paraphrases of the original ones.
- Lack of human performance as a reference.
问题
- In lines 78-80, you give a event pair, which could be erroneously treated causal-related by humans. What is the differences between this event pair and the correct cause-effect pairs in your dataset? Most of the causal relations in your dataset are about causation of necessarity.
- Even your dataset is in closed daily activities, the final version of the queries do not reflect this.
- Why is the definit setting (Cold) bettern than the plausible setting (COPA)? I think causility is essentially a question of plausibility, there is no absolute causal relationship between two events.
- How would humans perform on this dataset? Can humans obtain this answer by just look into the temporal relationship between events? How many queries can be answered correctly from a statistical perspective (not from the perspective of LMs)?
局限性
None
Thank you for providing your insightful comments. Please find the response to the clarifications below.
- Given the wider scope of this work, we agree that the paper might have become too dense, affecting the presentation quality. We would love to hear some more feedback from you to incorporate in the final version of our paper to improve the presentation quality, focussing on the reading sequence for better comprehension.
- We would like to respectfully disagree regarding the difference between COPA and the proposed dataset. COPA (with only 1000 samples), is a more open-ended dataset and the causal queries can be of any generic domain. However, the dataset proposed in our work has enormous causal queries (8.3M in total), covering all the aspects of a particular activity (i.e. not open-ended), that essentially facilitate a more rigorous analysis (coming close to simulating the mini-turing test) and hard to be beaten by just memorization. Thanks for pointing this out, the table shown in the paper was to help consider the similarity between the datasets, we will add a more detailed comparison stating the differences between the datasets in the camera-ready version of the paper.
- We have already highlighted in the limitation section regarding the limited set of activities, and we agree that it would be good to explore more real-world activities in the future. However, the number of causal queries that can be sampled is enormous and is not exactly the paraphrases of each other. Every scenario is different and has a different set of texts describing the events in the particular activity. The paraphrased version refers to the same action/event described by different humans with varying granularity, that are written independently by the crowdsource workers. For example, when describing an event like “preheating oven”, some people will write “preheat oven to 350 degrees”, whereas some people will write it in detail, “preheat the oven by switching on the oven and change the temperature setting to required temperature”. We agree that using the term “paraphrased” for such cases may not be appropriate, we thank you for pointing this out, and we will update the details accordingly in the paper. These different versions of the same event described enhance the dataset quality by a significant margin, providing a much more robust evaluation of LLMs. In other benchmarks, keeping only one version often limits the rigorous evaluation.
- Please refer to our detailed comment on human evaluation in the response to Reviewer iBzV.
Response to Question/Clarifications:
Q1) By examples provided in lines 78-80, we wanted to highlight the importance of counterfactual reasoning required to reason causally between the events, where one does not only need to consider the temporal order of events but also needs to think counterfactually by imagining an alternate universe with the same event not occurring will cause the other event to occur or not. In the example, the event “waiting at the luggage belt” may not occur even if the person has “boarded the flight” and “not checked in the luggage”, hence boarding the airplane will not be a direct cause of “waiting at the luggage belt”. This marks the third rung of the causal ladder proposed by Pearl et. al, which is also considered a core feature of intelligence (Penn and Povinelli, 2007; Pearl and Mackenzie, 2018) [lines 23-24].
Q2) The final version of the dataset assumes the activity to happen (also considered in the evaluation prompts as context). Moreover, the marginalization of pre and post-activity events makes it a closed system for rigorous analysis where all the unobserved variables (Figure 1, represented by U, i.e. intention to perform a task) take care of separating out the causal effect from events that may occur in rest of the world. This setup also provides adherence to SUTVA which is not possible in a dataset like COPA. Some of the prior arts (ROCK[https://arxiv.org/abs/2202.00436]) have also clearly highlighted this as a major issue in natural language-based approaches. We believe this work mitigates those issues, implicitly by design, making it a much more reliable framework for claims regarding casual reasoning abilities.
Q3) COLD provides a closed setup that can act as a suitable testbench to perform a rigorous analysis, in comparison with COPA which is an open-ended causal reasoning benchmark with only 1000 samples. Moreover, the presence of an underlying causal graph in COLD helps facilitate enormous causal queries, making use of causal dependency concepts like d-separation. Regarding capturing the causal relationship between two events, COLD provides a medium to cover all three rungs of the causal ladder, 1) association 2) intervention, and 3) counterfactuals (Pearl et. al, 2018). The relationship does not only come from plausibility but also considers the occurrence of an event causing another event. For example “checking-in luggage” will directly cause events like “collect luggage from luggage belt” to occur.
Q4) It is to be noted that temporal precedence is generally assumed essential for defining causation, and it is one of the most important clues that is used to distinguish causal from other types of associations [Mill, 1898, Hill, 1965, Pearl and Verma, 1995]. For our framework, we also consider the topologically sorted order obtained from the observational graphs and use the temporal order to define the causal query triplets, i.e., the cause events will always precede the effect events. We also perform a statistical perspective as suggested in the question without using any LMs. The first two rows of Table 4, show the results for the same with a detailed explanation for the statistical computation provided in the appendix. Please refer to our detailed comment on human evaluation in the response to Reviewer iBzV.
Thanks for your detailed response. Some main concerns still remain, so I intend to keep my rating unchanged:
-
For the causation of necessity, "boarding a plane" is not a cause of "waiting at the luggage belt", while for the causation of sufficiency, they have a causal relationship. So if you do not provide the event "not checking in luggage", you cannot say there is not a causal relationship. When compared to "checking in luggage", "checking in luggage" indeed has a larger probability of becoming a cause of "waiting at the luggage belt". If I provide an event such as "forgetting to collect the luggage", "fight is canceled", or "falling asleep and missed the flight", the causal relationship between "checking in luggage" and "waiting at the luggage belt" does not exist. I insist that causality is about probability if we cannot capture all confounders.
-
About the human evaluation: I still think human evaluation is important, even language models do not perform well in temporal settings. If humans can easily infer the correct answer by temporal order or necessary relationship between events, then it can be a shortcut to be utilized by language models. By the way, I do not see any human evaluation results in your response to reviewer iBzv.
Dear Reviewer SJyu,
Thank you for involving in the discussion.
We agree that the event "checking in luggage" indeed has a larger probability (larger value of causal estimand/strength) of becoming a cause of "waiting at the luggage belt”. The underlying assumption made in the framework (also highlighted in Figure 1 of the paper) is that all the occurrence of events is confounded (caused indirectly/directly) by the event U (that also involves “intention to perform a task”). Given the nature of instructions provided to the crowd-sourced workers, saying “write the steps in the telegrammic style to perform the activity (flying in an airplane, in this case)”, the assumption becomes valid and the events such as "forgetting to collect the luggage", "fight is canceled", or "falling asleep and missed the flight" will not occur in that case. Hence our framework provides a closed system rather than an open system where all these events can take place, which also comes with the advantage of SUTVA being valid (missing from any previous works in NLP and CCR). We thank you for pointing this out, we agree that providing these explanations in detail will improve the presentation quality of our work and we will make suitable changes to the updated version of the paper.
We would like to again reiterate that the condition of temporal precedence is necessary but not sufficient, and only provides a weak signal that helps in considering the causal relationships between the events. The temporal precedence only helps in providing a cue, i.e. the cause events will always precede the effect events. We also created another version of the dataset to match the findings from (Do et al., 2011) when doing human annotation of causal relationships. We mention briefly about it in the main paper with more details in the appendix.
[Lines 257-259] ”We also experimented with another version of the dataset, where incorrect choice may correspond to temporally plausible but causally implausible events. The results drop significantly in this case, details and results are provided in App. F.1.”
[Lines 929-937] “Some of the initial studies (Do et al., 2011) highlight the difficulty in choosing between the causal effect events and temporal events (that occur in close proximity to the premise event), i.e., temporal relationships are sometimes considered as a causal relationship by human annotators. We also create another version of created causal triplets where the wrong choices are replaced by temporally near nodes (nodes that are at a one-hop distance from the premise node). We call these ‘causally hard triplets.’ Note the temporal nodes are obtained from the observational graphs Go. Table 6 shows the performance comparison with causal triplets and causal-temporal triplets versions of the same queries. We observe a significant performance drop on the causal-temporal triplets version for most models, highlighting the increased confusion.”
Quang Do, Yee Seng Chan, and Dan Roth. 2011. Minimally supervised event causality identification. EMNLP, pages 294–303, [Cited on 411 pages 4, 19, and 23.]
We are sorry if the description of the Lack of Human Performance was not clearly highlighted in the response to Reviewer iBzV. We rewrite the same below:
Lack of Human Performance: We would like to mention that validating human performance is a little challenging due to the nature of the causal reasoning task. The nature of counterfactual reasoning requires the human/algorithm to assume a similar alternate world/universe with only a particular happening or not happening to approximate the causal strength between the events. These imaginations can be expressed in statements as highlighted by Pearl et. al, containing an “if” statement in which the “if” portion is untrue or unrealized, which is known as a counterfactual. The “if” portion of a counterfactual is called the hypothetical condition, or more often, the antecedent, making it challenging (cognitively heavy) to conduct a human evaluation. We agree that human performance may be lower than complete perfection and adding human performance in these tables will make the comparisons more informative. However, the true dependency of the events is coming from the underlying causal graph, making the generated causal queries accurate. Previously, works like Jin et. al have released causal queries (https://huggingface.co/datasets/causalnlp/corr2cause, https://huggingface.co/datasets/causalnlp/CLadder) generated from the underlying causal graph that does require causal inference theory understanding to be purely annotated by humans.
Reviewer iBzV had similar concerns, who acknowledged the difficulty of performing human evaluations and consequently have increased their score!
We are grateful for your invaluable comments, and considering your comments will definitely improve the presentation quality of our work. Please let us know If you have any further questions or require additional clarification.
Thanks for your detailed response, the causation concern is solved to some extent.
Since your proposed dataset is in closed daily activities without accidents, why is it challenging to conduct and validate human evaluation?
The importance of human evaluation is two-fold:
- Human evaluation can help to assess the quality of the dataset. If this task is challenging for humans, then I do not think AI can handle it properly, and the quality of this dataset is questionable. If it is challenging to validate human performance, why is this dataset suitable to evaluate the performance of AI models?
- Human evaluation can help to demonstrate the advantages of your paper, if humans cannot obtain good results by temporal order or necessary relationship between events, then it proves your dataset does not give a shortcut for AI models. Just utilizing AI models to validate this is improper since we do not know the capabilities of these black-box models.
I understand that there may not be time for human evaluation now, but validating the performance of humans is not challenging since you have a golden answer for each question.
In conclusion, I still keep the rating the same as reviewer iBzV
Dear Reviewer SJyu,
Thank you again for spending some extra amount of time in involving the discussion. We are pleased that our response was able to resolve the previous points raised in the weakness as well as the clarifications regarding the causation.
Regarding the human evaluation, we agree that there are multiple advantages to benchmark human performance on the dataset, however, just simple annotations without proper counterfactual reasoning cannot be considered as a baseline for human performance. In causal literature, some of the previous works like Jin et. al have released causal queries (https://huggingface.co/datasets/causalnlp/corr2cause, https://huggingface.co/datasets/causalnlp/CLadder, [we request you to take a look at the created causal queries]) that are symbolic in nature and do require prior knowledge about causal inference/theory to answer by humans. If benchmarked on these datasets, a human without any causal terminology knowledge and background in probabilistic graphical models would not perform well. Hence, benchmarking humans on such datasets not only remains challenging but also makes it difficult to set up a stable baseline for a general audience claiming a human-level performance. Similarly, for our dataset (though grounded in the real world), the counterfactual-based reasoning (the third rung of causal inference), is cognitively heavy and requires some amount of reasoning to make a decision, and conveying those to the annotators remains one of the major challenges.
Initially, we did consider taking a random sample and asking a set of lab students to annotate the created causal queries. However, later we brainstormed that claiming the performance on a small sample (note that the dataset can contain an enormous number of causal queries), and generalizing it as a human performance may not be scientifically correct and could cause some biased estimates/benchmarking present in the paper. We believe it would be good to open source the dataset and human performance can be computed on an open end, rather than taking a small sample, which would give a better estimate as requested in your comment. We hope you understand that the primary challenge is not due to the unavailability of the ground truth and the evaluation criteria but due to 1) the size of the dataset and 2) due to the over-generalized claim that would be present in the paper by asking a few humans to annotate a small sample (may not be true representative of the entire population).
Please let us know if you have any other suggestions for conducting the study for better benchmarking. We believe the challenges understood by the Reviewer iBzV were a little different. As acknowledged by the reviewer, in the response after the rebuttal, the human evaluation is challenging and would be interesting to conduct in future works.
We are pleased to see your involvement in the discussion, we will be happy to answer any other questions/clarification you have regarding the risks and challenges of human evaluation.
Thanks for your detailed response, my concerns still remain:
-
The symbolic forms of Corr2Cause and CLadder do make it hard (not impossible) to conduct human evaluation. While your dataset is in natural language and in a closed scenario. Furthermore, human annotators can create ESDs, why are they unable to conduct the human evaluation?
-
I agree that conducting human evaluations on the whole dataset is unrealistic and unnecessary, just like many other works [1], randomly sampling a small portion of the dataset for evaluation is sufficient. Although there might be some issues with the results of human evaluation, this should not be a reason for you not to do it. "Actions speak louder than words."
Finally, I am still positive about this paper, but human evaluation is indispensable.
[1] e-CARE: a new dataset for exploring explainable causal reasoning
Dear Reviewer SJyu,
Thank you again for your response.
We are happy that you understand that conducting human evaluations on the whole dataset is unrealistic and unnecessary.
We agree that a small random sample can be taken from the dataset and perform human evaluation using experts coming from counterfactual reasoning background. We will be happy to add those in the updated version of the paper.
We hope that adding a human evaluation on a small set of the created causal queries for all the activities present in the created resource will help improve the quality of our work.
We thank you for all your suggestions.
We are happy that you were actively involved in the discussion phase which was fruitful for the presentation quality of our work. We are also pleased that we could resolve all the concerns raised in the first iteration. We think this discussion improved the quality of our work by a significant margin.
Thank you again for your time. Since we are approaching the end of the author-reviewer discussion period, it would be great if you could share some final thoughts/clarification questions if you have any. We would be happy to respond as quickly as possible.
The paper proposes a dataset for evaluating the causal reasoning capabilities of LLMs by grounding evaluation in human understanding of real-world daily activities. The authors address the gap between open-ended causal commonsense reasoning and symbolic question answering by introducing the COLD framework, which generates causal queries based on daily activities. The paper describes the creation of causal graphs through crowd-sourcing. The authors test different LLMs using these causal queries, and show that even simple causal reasoning tasks remain challenging.
优点
-
Its approach to evaluating causal reasoning in LLMs by grounding the evaluation in real-world activities.
-
The use of a large dataset to create extensive causal queries.
-
The detailed experimentation with various open-source LLMs and the plan to publicly release the framework and datasets.
缺点
-
The data used in the study appears overly simplistic, focusing on basic daily activities which may not provide a robust test for causal reasoning. If the goal is to evaluate how well models understand causal reasoning I would stay close to how we perform causal inference in science (i.e. gather data, understand whether we have an identification strategy, compute treatment effects/learn causal graph).
-
Causal Commonsense Reasoning is not causal in the sense of statistical causality. Almost all uses of causality in science are either: 1. To estimate a treatment effect in the real-world 2. To discover a causal graph, again in the real-world. This is done either from interventional or from observational data. This data doesn’t capture any of this, so I wonder what the usefulness of it is in relation to the causality literature.
-
The insights drawn from the experiments are not clearly articulated. What exactly has been learned from this new extensive dataset and all of the experiments? The ATE experiments are particularly hard to understand. ATE measures the marginal effect of a treatment on an outcome. Comparatively, this dataset includes binary questions about causal triplets. In the results the authors show accuracy measures, I don’t understand how to interpret this as the ability to properly estimate the effect, and I wasn’t able to understand how to do so from the authors’ description.
-
The examples provided (e.g. Figure 3) often lack clarity, as both options seem plausible effects without additional context or a causal graph.
问题
-
What specific insights have you gained from your experiments with this dataset? Can you elaborate on how these insights contribute to our understanding of the causal reasoning capabilities of LLMs? Could we use it to propose any improvements to the current models?
-
Are there plans to incorporate more traditional causal inference use-cases, such as treatment effect estimation or causal graph discovery, into your framework? I mean this in the sense of evaluating the capabilities of LLM in performing these tasks.
-
In the first example in Figure 3, both options appear plausible without additional context. Can you provide a detailed explanation of how the correct choice is determined in such cases, and whether there is a mechanism to ensure that the queries are unambiguous?
局限性
The authors acknowledge limitations of their work, such as the restricted set of activities and the challenge of creating causal graphs for more complex long-term tasks.
Thank you for pointing out some of the important directions.
-
The scope of this paper was limited to commonsense knowledge and we agree that the data used is simplistic in nature (easily to reason about by humans), and does not deal with events beyond commonsense knowledge. We also highlight the same in lines 72-75 of the paper, that CCR excludes the causal events that are beyond the reach of commonsense knowledge, for example, does “planting trees” have a direct impact on the “rainy season”? or does “providing free education” improve the “economic condition of the country/state”; does “carpooling” directly impact “air pollution”, etc. It is to be noted that the primary goal of this work is to bridge the gap between real-world causal reasoning performed by humans on a daily basis and provide the underlying causal graph for rigorous analysis as done in symbolic approaches. Performing causal reasoning on a broader scale that requires a Randomized control trial (as suggested in the weakness) is challenging for humans and is highly dependent on the methodology rather than reasoning/intelligence.
-
Some recent works like ROCK [https://arxiv.org/abs/2202.00436], show causal commonsense reasoning through the perspective of causal theory. Our work provides a medium to integrate the causal theory perspective (previously used by papers like Corr2Cause and Cladder for symbolic queries that are verbalized in natural language prompts to evaluate LLMs) and causal commonsense reasoning. We believe this work will be the first to consider both and will facilitate research in making LLMs more causal in reasoning.
-
We are sorry for the unclear explanation provided in the paper. We would like to clarify that prior arts on similar lines (ROCK[https://arxiv.org/abs/2202.00436], COLA[https://aclanthology.org/2023.acl-long.288/]) have made use of ATE estimations to perform a zero-shot evaluation using language models like RoBERTa on datasets like COPA. We followed a similar perspective to estimate ATE between the events that are further used to perform classification using the obtained Delta estimates, i.e. given two options, if the ATE of option1 is greater than ATE of option2, the causal prediction is made as option1, which is further used to compute the accuracy over the dataset. Algorithm 2 uses the causal estimands to compare the causal strength between the premise event and the choice events. We consider the causal estimand computed between the premise and the available set of choices and predict the label corresponding to the high values. For a given causal query from the created causal query triplet dataset , where each data entry corresponds to , i.e., premise event, choice 1, choice 2, question and the label respectively. As the task is to predict the more plausible cause/effect of the given premise event, we create two event pairs, and , and compute the causal estimand for both the pairs using the temporal or the backdoor scheme (described in Algorithm 3. Note that the order of events given to is in and format, i.e. . Using the temporal precedence (highlighted as remark above), the cause event will always precede the effect event temporally. Hence, for a causal query with the question as 'cause', the causal estimand is estimated as , and , when the question is 'effect.' Further, based on the estimated scores, the more plausible cause/effect is predicted.
-
The examples provided in Figure 3 include Premise: go to store and buy cake mix. Question: Which of the following is an effect? Choice 1: come home with the ingredients. Choice 2: go to kitchen. In the given example, using counterfactual reasoning, we can say that “going to the kitchen” is possible without going to the market (if the ingredients are already available), however when going to the market, the event of “come home with the ingredients.” will always take place, making it a more plausible effect in the given choices. Similarly, example 2 in the figure is also explainable as going to market has no direct relation with heating the oven. Thank you for pointing this out, we will add a more detailed explanation for better clarity of the presented idea.
Response to Questions/Clarifications:
Q1) The primary insight as shown from the results is the LLMs lack causal commonsense reasoning abilities in general. Moreover, to provide a suitable test for the model’s understanding, we also apply the backdoor criterion and validate if it improves the performance. Interestingly, we observe that applying the backdoor criterion does help improve the performance over the simple temporal prediction. As highlighted in Table 4, we do see a significant boost in the performance when applying the backdoor adjustments. We believe this strengthens the claim of using the causal theory perspective to improve the performance of LLMs as causal reasoners.
Q2) We completely agree that this framework can be extended to perform various other evaluations like treatment effect estimation or causal graph discovery, and incorporating these will enhance the evaluation. However, given the density of the paper, we believe these go beyond the scope of this work. We provide a few future directions on similar lines in the main paper. We thank you for pointing out some additional directions, we’ll add these to the camera-ready version of the paper.
Q3) Please refer to the response to the last point of weakness. We hope considering the counterfactual situation will make the causal query justified. We thank you for pointing this out, we will improve the writing to explain these examples in the camera-ready version of the paper.
Dear Reviewer G7bs,
Thanks again for helping review this paper! Since we are approaching the end of the author-reviewer discussion period, would you please check this author response regarding your concerns? We really appreciate it!
Best, AC
This paper presents a new causal reasoning dataset for LLMs. The main motivation of the dataset is to bridge the gap between casual commonsense reasoning datasets and symbolic representation-based causal reasoning datasets. Specifically, the proposed dataset collects crowd-sourcing observations, and constructs related causal graphs and triplets. This paper evaluates multiple open-source LLMs using the collected datasets, benchmarking LLM's ability to predict causal relationships, and to estimate average treatment effect.
优点
-
The construction of the causal graph and the estimation of casual relationships are strict and sound. The whole dataset provides high-quality causal relationship annotations for real-life events.
-
A large number of open-source models are evaluated. The evaluation on ATE estimation also involves multiple estimation methods using LLMs.
缺点
-
Human performance is not provided in the empirical comparisons. This makes it hard to understand LLM's performances. One can argue that the performances are far from perfect. However, due to the inherent ambiguity of natural languages, and inherent disagreement in people's opinions, I would assume human performance will also be significantly lower than 100%. Adding human performance in these tables will make the comparisons more informative.
-
Many important details are in the appendix. I understand this is a dense paper with lots of content, however, the presentation and organization can be greatly improved. Additionally, the current results section includes a significant part on how to estimate ATE, which should be included in previous sections.
-
While the dataset itself is huge, there are only five different events (shown in Table 2). Therefore, it is unclear how representative the model's performance on this dataset will be. This is a significant limitation, especially due to the "Causal Parrots" phenomenon mentioned in the introduction.
问题
-
How sensitive are the eval results w.r.t. the prompts? I'm a bit concerned that since "cause" and "effect" are not common words (especially their formal definition in causal inference), the model's ability may be underestimated with these prompts.
-
There are multiple different methods to estimate probability prediction from LLMs? Have you tried other methods and will that change the empirical results significantly?
局限性
Limitations are discussed in Sec. 6.
We thank you for your detailed and insightful response. Please find the response to the mentioned weakness/clarifications below:
-
Lack of Human Performance: We would like to mention that validating human performance is a little challenging due to the nature of the causal reasoning task. The nature of counterfactual reasoning requires the human/algorithm to assume a similar alternate world/universe with only a particular happening or not happening to approximate the causal strength between the events. These imaginations can be expressed in statements as highlighted by Pearl et. al, containing an “if” statement in which the “if” portion is untrue or unrealized, which is known as a counterfactual. The “if” portion of a counterfactual is called the hypothetical condition, or more often, the antecedent, making it challenging (cognitively heavy) to conduct a human evaluation. We agree that human performance may be lower than complete perfection and adding human performance in these tables will make the comparisons more informative. However, the true dependency of the events is coming from the underlying causal graph, making the generated causal queries accurate. Previously, works like Jin et. al have released causal queries (https://huggingface.co/datasets/causalnlp/corr2cause, https://huggingface.co/datasets/causalnlp/CLadder) generated from the underlying causal graph that does require causal inference theory understanding to be purely annotated by humans.
-
Presentation and Organization of the paper: We agree that the density of contents in the paper is a little high, and the presentation can be improved by a significant margin. We agree that it would be better to keep the estimation of ATE as a separate section, however, due to limited space, we had to merge the experiments and the results section (Section 4 Experiments and Results). We hope you understand that fitting all the details in the main paper is challenging, we would love to have some feedback if you think any particular section would be better suitable to move in the main paper from the appendix. We will make the suggested changes in the camera-ready version of the paper.
-
We agree that one of the limitations of our work is the limited set of activities, we also highlight the same in the limitation section of our paper with suitable reasons in [Line 349-355]. We would like to reiterate that finding general commonsense reasoning aactivities/tasks that are well understood by humans remains challenging in general and it would be good to explore in the future by adding more set of real-world activities. Moreover, creating a causal graph for an activity increases as we move toward more long-term tasks. However, as a general test of causal intelligence, our framework provides a suitable platform to validate the reasoning capabilities more rigorously, which is missing from the current literature. We hope, with the enormous causal queries (simulating/coming close to the mini-turing test), the created dataset will serve as a robust testbench for evaluating causal reasoning in LLMs.
Response to Questions/Clarifications:
Q1) In general, LLMs are found to be sensitive toward prompts, and coming up with a suitable prompt may affect the performance. Previously, various LLM benchmarking papers have performed evaluation over causal datasets like COPA, CausalBench, etc. LLMs are few-shot learners ROCK has benchmarked LLMs using the cloze evaluation or the MCQ evaluation. We consider the latter that was also justified for benchmarking commonsense reasoning activities in general.
Q2) We agree that there are multiple ways of estimating probability predictions. Some of the widely accepted ones include the CLOZE test evaluation and the MCQ-based evaluation. As highlighted in the paper, we use multi-choice question-answering (MCQA) [Robinson and Wingate, 2023]. Robinson and Wingate [2023] highlight the advantages of MCQ-based evaluation over cloze evaluation [Brown et al., 2020] (where the LLMs are expected to generate the entire answer in a cloze test), leading to a significant boost in various tasks, including commonsense-based tasks. We agree that there is a risk of estimating bias in probability that may result in variance in prediction results.
To eliminate the bias in the probability predictions, we consider the averaging (similar to https://openreview.net/forum?id=shr9PXz7T0) that uses multiple permutations to balance out the prediction evaluation. Algorithm 3 in the Appendix depicts the process of computing an unbiased estimate for the causal estimand. The causal strength is computed between two events and where is assumed to be preceding temporally. To make an unbiased estimate based on the provided options, we consider normalizing the obtained probability scores by flipping the options and providing the same query prompt to the Language Model.
where denotes the prompt template as shown in Figure 8 (top) and denotes the same prompt with flipped options, Figure 8 (bottom). The overall equation helps normalize the prediction probabilities of the 'Increase' option by using the probabilities of the 'Decrease' option. Finally, these normalized scores are computed for multiple trajectories in the backdoor adjustment scheme to compute the causal estimands and that help estimate the causal strength between the events and .
Dear Reviewer iBzV,
Thanks again for helping review this paper! Since we are approaching the end of the author-reviewer discussion period, would you please check this author response regarding your concerns? We really appreciate it!
Best, AC
Thank you for the response. It addresses some of my concern around prompt sensitivity and uncertainty estimation, hence I increased my score. I still believe adding human annotation is valuable for this work, but I also understand its difficulty now.
We thank all the reviewers for their detailed reviews and suggestions. We are pleased that the reviewers found our work novel, helping bridge the gap between open-ended causal reasoning and symbolic representation-based question answering (Reviewer iBzV, Reviewer LejL). We are also happy that the design of the proposed framework in the construction of the causal graph is found to be strict and sound (Reviewer iBzV), laying a solid foundation (Reviewer LejL). Moreover, in the analysis part, the reviewers found the proposed backdoor adjustment and the ATE construction unique and novel (Reviewer LejL, Reviewer SJyu) providing valuable insights into causal strength between events (Reviewer LejL) and the supporting experiments to be detailed and thorough (Reviewer G7bs, Reviewer SJyu).
Finally, on the presentation quality, we are happy that the reviewer found the paper “well-written, with clear explanations, logical flow, and concise language, ensuring effective communication of key points” (Reviewer LejL). There are some concerns raised regarding the high density of the paper with lots of contents (Reviewer iBzV) and the ATE experiments hard to comprehend (Reviewer G7bs) in the first iteration. We hope incorporating the suggestions made by the reviewers, will help enhance the presentation quality of our work, making it easier for readers to comprehend in the first iteration.
This paper presents a new framework for studying causal reasoning of daily events by bridging the gap between symbolic causal reasoning and real-world grounded commonsense causal reasoning. Reviewers recognized that this is a strict and sound contribution to create causal graphs for real-life events, and this framework supports comprehensive causal queries. Rebuttal and discussions further addressed most of reviewers' concerns, and reviewers remain generally positive. Overall, this is a rigorous step towards a correction direction for evaluating LLMs for causal reasoning. One major criticism shared by reviewers still remains: human evaluation has not been conducted during rebuttal even though it is asked for by two reviewers, thus making it difficult to assess the dataset quality and to calibrate LLM performances, so that authors need to include human evaluation, albeit small scale, in the final version.