PaperHub
6.0
/10
Poster4 位审稿人
最低5最高8标准差1.2
6
5
8
5
3.5
置信度
正确性2.5
贡献度2.8
表达2.5
NeurIPS 2024

Discovery of the Hidden World with Large Language Models

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

A new framework leveraging large language models to extend the scope of causal discovery to unstructured data.

摘要

关键词
Causal DiscoveryLarge Language ModelsCausal Representation Learning

评审与讨论

审稿意见
6

This paper presents Causal representatiOn AssistanT (COAT), which introduces large language models (LLMs) to bridge this gap. LLMs are trained on massive observations of the world and have shown great capability in extracting key information from unstructured data. Thus, employing LLMs to propose useful high-level factors and craft their measurements is natural. COAT also uses CDs to find causal relations among the identified variables and provides feedback to LLMs to iteratively refine the proposed factors. This mutual benefit enhances both LLMs and CDs.

优点

  1. Interesting topic. Employing LLMs to propose useful high-level representations for causal discovery.
  2. Develop two benchmarks for the unstructured causal discovery. AppleGastronome and Neuropathic.
  3. Derives the first metrics that measure the causal representation learning capabilities of various LLMs.

缺点

  1. ‘We will release an anonymous link during the discussion period.’ I will consider raising my score if the code is reasonable.
  2. The contribution of LLM in COAT is a little small. I assume LLM is used as a representation tool to learn the conceptual level attributes, including iterative refining. The causal structural learning can still be considered as the downstream task.
  3. COAT will inherit the shortcomings of downstream causal structure learning algorithms.

问题

  1. How can we ensure that LLM does not introduce erroneous prior knowledge for reasoning?

局限性

N/A

作者回复

W1 ‘We will release an anonymous link during the discussion period.’ I will consider raising my score if the code is reasonable.

We send an anonymized link of code to the AC in a separate comment, as we are not allowed to include any links to external pages in the responses this year.

W2 The contribution of LLM in COAT.

LLM is extensively involved in the factor proposal phase as a representation assistant:

  • In the factor proposal phase, LLMs are extensively leveraged to identify useful high-level factors from the given samples. The pre-trained knowledge and the reasoning ability of LLMs are heavily involved during this phase. Although COTA reduces the reliance on LLMs, the final results still depend on the capabilities of LLMs, as shown in our theories and experiments.
  • Moreover, when an external interface to annotate the factors is not available, LLMs will be leveraged to parse the data. Without LLMs, we can still not obtain tabular data readily to be used for further causal structural learning.
  • For the causal feedback module, it is still crucial that LLMs need to properly distinguish the already verified factors and other potential factors that may affect the partitioning of the provided groups.

Although in the current design of COAT, the causal structural learning module can be separated, the causal structural learning module plays an important role in providing various feedback, such as the existence of a hidden confounder, when extending COAT to more challenging scenarios.

W3 COAT will inherit the shortcomings of downstream causal structure learning algorithms.

It is common for all rigorous causal discovery methods to have certain assumptions and requirements on the data, while COAT may relieve the requirements:

  • Well-specified variables: Classic causal discovery methods work on tabular data whose columns are meaningful variables. Here COAT extends their scope on unstructure data utilizing the power of LLMs.
  • Assumptions on the data and population distribution: Assumptions are inevitable for any causal identifiability guarantee. COAT's factor proposal phase is disentangled with causal discovery, so its identifiability can hold even when one of the chosen causal discovery methods doesn't. For instance, the Apple Gastronome benchmark doesn't satisfy the linear non-Gaussian assumption, yet it still works on the factor proposal task when cooperating with LiNGAM (Appendix F).

Q1 How can we ensure that LLM does not introduce erroneous prior knowledge for reasoning?

Thanks for the good question. This question shares exactly the same spirit as the motivation of the paper. All factors proposed by LLM will be checked on data in the factor proposal phase. LLM can propose irrelevant factors (as witnessed on the META baseline), but they will be filtered out in each COAT iteration (as witnessed on the DATA baseline, the single COAT iteration, and COAT itself).

评论

Thanks for your detailed reply. I am sorry to hear that there is no link at this stage. I would like to keep my score as it is and will keep an eye on other reviewers' comments.

评论

Dear Reviewer PF9K,

We just received the reply from the Program Chairs:

A possible solution might be sharing an anonymized link with the AC that can be passed further on to the reviewer of interest.

We will communicate and ask the Area Chair iWif to kindly help pass the link to you. Please let us know if you need any further information about the codes. Thank you so much!

评论

I did not get this message but I assume you received it and thus will pass the link only to the Reviewer PF9K

https://anonymous.4open.science/r/CausalCOAT-D9CD

All the best

评论

Dear Area Chair iWif,

Thank you for your prompt help with the link. Yes, the message was sent in reply by Program Chairs to our inquiry email.

Best regards,

The Authors of Paper6701

评论

Thanks for sharing the link. I will increase my score to 6. I recognize the novelty of this paper. Large language models have potential applications for causal inference, although in this paper, I only see their potential for handling unstructured data, but it is a good start. I hope this work can explore more profound issues in the future.

评论

Dear Reviewer PF9K,

Thank you for increasing the score and recognizing the novelty of the paper!

We indeed share the same ambition. Through COAT, we aim to elicit the full potential of LLMs, to empower and broaden the use of causal inference methods to more realistic applications, where we usually do not have measured variables, i.e., data is unstructured[1]. In Appendix B, we present the big picture of how to build causal foundation models based on COAT, that empower various causal or causality-inspired methods such as causal machine learning, treatment effect estimation, and counterfactual reasoning.

We would like to note that, reliability is the key requirement to leverage LLMs to resolve various causality-related problems. Although LLMs have great potential, directly using LLMs to resolve causality tasks, as we demonstrated in experiments, is unreliable. One has to be cautious when making claims about the causal capabilities of LLMs[2]. In contrast, in this work, we demonstrate a provably sound and reliable approach to fully elicit the potential of LLMs, which marks the significance of COAT. Thus, COAT provides a reliable ground for various causality tasks and downstream applications, where one could safely exploit the full potential of LLMs.

Thanks again for your valuable suggestions and engagement.

References

[1] Towards Causal Representation Learning, arXiv'21.

[2] Causal Parrots: Large Language Models May Talk Causality But Are Not Causal, TMLR'23.

评论

Dear Reviewer PF9K,

Thank you for your prompt reply. We need to clarify that, there is a link shared in a comment to Area Chair iWif, since NeurIPS does not allow us to share any links to external pages with Reviewers:

  1. All the texts you post (rebuttal, discussion and PDF) should not contain any links to external pages. Furthermore, they should not contain any identifying information that may violate the double-blind reviewing policy. If you were asked by the reviewers to provide code, please send an anonymized link to the AC in a separate comment (make sure the code itself and all related files and file names are also completely anonymized).

If Reviewer PF9K would like to know any details in our code, we are more than happy to provide any details. We could also just copy the codes here if needed. Please kindly understand that it is because of the NeurIPS regulations instead of the authors' intention that we could not provide a link directly in the responses.

Meanwhile, we have sent another email to Program Chairs, to inquire about the case. Please kindly let us know if you feel there is any way that could resolve your concern about the link. Since the code has already been packed up and shared in an anonymous github link, we would like to provide any information for it so long as it is allowed by the NeurIPS regulations!

审稿意见
5

The paper tackles the problem of discovering relevant features for recovering the underlying causal graph in the absence and/or in lieu of a human domain expert. The proposed method, COAT, first queries an LLM through a prompt elucidating the task (for eg., discovering relevant features that affect a product review using a few text reviews), then the proposed variables are fed into another LLM that assigns a value to each of these variables, thus outputting structured/tabular data that can be used for causal discovery. Finally, the tabular data is used in conjunction with a traditional causal discovery algorithm (FCI and LiNGAM in this case) to retrieve a causal graph with respect to a target variable (for eg., review score) using the proposed variables. The process repeats until the proposed variables form a markov blanket for the target variable w.r.t the raw unstructured input data (for eg., the text reviews), progressively expanding the markov blanket in each iteration. Additionally, the LLM can receive feedback at the end of each iteration in the form of samples that the proposed variables cannot sufficiently explain. In particular, the authors propose clustering the samples w.r.t the latent variables induced by the LLM and picking the samples in the cluster with the largest conditional entropy.

Initial theoretical analysis of the proposed method implies that the proposed method is able to identify the markov blanket for a target variable using the proposed variables given that enough iterations of COAT are performed.

The authors evaluate COAT empirically over two synthetic datasets and three real-world datasets. They compare COAT against two simple baselines 1) factors being directly proposed by the LLM based on the prompt without further iterations 2) factors being proposed by LLM when queried using both the prompt and some samples of raw observations. The second baseline is essentially COAT without the LLM receiving any feedback after each iteration. The experiments are conducted using 10 different LLMs and primarily one causal discovery algorithm (FCI), with additional experiments on one dataset using LiNGAM. Additionally, the paper proposes two novel metrics for quantitatively assessing the performance of LLMs for feature proposals to be used for causal discovery.

Update: I moved my rating up in the hope that teh authors will add the experiments as they promised to the final version. We have no way of enforcing it but hopefully, the authors will follow up on their promise.

优点

The paper addresses the important problem of causal discovery and employs an effective two pronged approach involving LLMs and traditional causal discovery algorithms. This approach leverages the strengths of both the LLMs and the causal discovery algorithms i.e, ability to respond to complex prompts and unstructured data with high-level and possibly noisy information, and robust causal discovery with strong theoretical guarantees although requiring strong assumptions on the faithfulness of the data and causal mechanisms, respectively. Overall, I believe this is a promising direction wherein the two components complement each other effectively.

The empirical evaluation is sufficient in terms of the large number of LLMs considered and the moderate amount of datasets evaluated. The results, based on the chosen metrics, sufficiently demonstrate the effectiveness of the proposed method over the simple baselines.

Finally, the paper is well-written and clearly explains the steps involved in each iteration. The further explanations provided in the appendix also aid in this.

缺点

The theoretical aspects of the proposed algorithm are exaggerated in the introduction. Given the strong assumptions of “sufficiently powerful” LLMs, “sufficiently diverse” examples and further assumptions pertaining to the chosen causal discovery method, the propositions, while appreciated, are rather straightforward. In particular, it would be far more interesting to theoretically analyse the impact of modules involving the LLMs themselves, such as the chosen prompt template, quality of factor annotations and responsiveness of LLMs to feedback regarding causal discovery, even though some of these are evaluated empirically. Also, an analysis on the rate of convergence of COAT would be beneficial.

Secondly, while the modularity of the proposed approach facilitates utilising a cross product of LLMs, causal discovery methods and feedback mechanisms, it also necessitates extensive ablation studies. In particular, the paper would be strengthened by a thorough ablation of the initial prompts and feedback. In particular, a discussion and ablation on the chosen prompt template and its effect on the proposed factors, or lack thereof, is needed. A robust template would allow more seamless adoption of the proposed method. Finally, the chosen baselines are far too simple to make any strong claims on the effectiveness of COAT. Comparing against some of the methods covered in the related work section would help bolster this claim.

问题

The paper addresses an important and timely problem and proposes a simple and intuitive solution, leveraging the strengths of LLMs and traditional causal discovery methods. While the experiments demonstrate the effectiveness of the proposed method over two simple baselines, stronger baselines and more ablations on prompts and factor annotations would strengthen this claim. Theoretical analysis is limited to the well-studied causal discovery aspect of the pipeline while making strong assumptions on the powerfulness of the LLMs, diversity and faithfulness of the raw observational data, and the number of iterations being sufficiently large, seems rather unsurprising.

局限性

See the weakness and questions above.

评论

We give the responses to the other questions in the follow-up comment due to the character limits.

W1.4 discussions on other suggested aspects.

We further improve the paper with the following additional discussions on the impact of other suggested aspects:

  • Quality of factor annotations. The annotation could introduce an additional "error term" on the true factor values, as one can observe in Figure 4 (a) (b). The key point here is to ensure the distribution of data from annotation not violating the assumptions of causal discovery methods. For example, the "error terms" should be independent; otherwise, the faithfulness assumption required by FCI will be violated. In practice, LLMs could also use tools like API to acquire data from external resources (we try this way on the Neuropathic Benchmark in section 5.2 and also in the ENSO case study in Appendix J).
  • Prompt template. Constraints on the prompt include the LLM's instruction-following ability and the length of the context window. Including more instruction, more data samples, or more background knowledge may improve the pp and CΨC_\Psi, but would also be more challenging for LLM to handle. In practice, decomposing prompts into multiple simpler sub-tasks could alleviate this issue.

Q1 The paper addresses an important and timely problem and proposes a simple and intuitive solution, leveraging the strengths of LLMs and traditional causal discovery methods. While the experiments demonstrate the effectiveness of the proposed method over two simple baselines, stronger baselines and more ablations on prompts and factor annotations would strengthen this claim. Theoretical analysis is limited to the well-studied causal discovery aspect of the pipeline while making strong assumptions on the powerfulness of the LLMs, diversity, and faithfulness of the raw observational data, and the number of iterations being sufficiently large, seems rather unsurprising.

Here we make an overall clarification:

  • This paper considers a novel task to reliably utilize the LLM's ability. In particular, it introduces COAT to propose high-level factors behind unstructured data that form a Markov blanket of target variables.
  • Theoretically, we show COAT can provably identify a Markov Blanket with sufficient iterations. Our theoretical results do not assume a sufficiently strong LLM or diverse data, rather, we propose two new metrics to characterize the capability of LLMs in identifying useful high-level factors. Then, we establish the theoretical guarantee including the convergence rate based on the developed metrics. We need to clarify that these results are out of the causal discovery literature.
  • For empirical evaluation, we compare the performance of COAT with the state-of-the-art baseline META in the literature, along with two additionally constructed stronger baselines DATA and DATA-CoT. The experimental results demonstrate the superiority of COAT.
  • We also conduct extensive ablation studies to verify the robustness of COAT to different prompt templates, LLMs, and causal discovery algorithms.
作者回复

Thanks for the insightful and constructive comments on our work. We hope our response can sufficiently address your concerns.

W1.1 "strong assumptions" of “sufficiently powerful” LLMs...

Thank you for pointing out these potentially confusing words. We revised the paper to clarify that "sufficiently powerful" and "sufficiently diverse" are not part of assumptions. These words appear in Sec 3.3, which are used to give an intuitive and concrete description of the COAT algorithm.

Our theoretical results in Sec 3.4 do not rely on those assumptions.

  • Proposition 3.1 is to show why we expect new factors to satisfy Eq 6. Test on the Eq 6 is a conditional independent test. The assumption behind this is the faithfulness condition, i.e., the conditional independence reflects the d-separation of the causal graph.
  • Proposition 3.3 concretely characterizes the impact of LLM's ability on factor identification, and it requires p>0p>0 and CΨ>0C_\Psi>0 so that Eq 9 is about a finite number.

W1.2 Impacts of modules.

The impacts of each module in COAT can be characterized by our theoretical framework. Let YY be the target variable. Before entering one certain COAT iteration, kk factors (denoted by h[k](X)h_{[k]}(X) as defined on page 6 line 202) have been proposed and verified. In this iteration, the LLM receives a feedback constructed with those factors, and then we expect the LLM to propose a new factor wk+1w_{k+1} such that Ywk+1(X)h[k](X)Y \perp w_{k+1}(X)\mid h_{[k]}(X) doesn't hold. In definition 3.2 (page 6 line 209), we define two measures to quantify the capabilities of LLMs in the COAT framework:

  • Perception Score pp: the probability that the LLM proposes a new factor. This can be seen as a measure of the LLM's responsiveness to the given prompts and the feedback.
  • Capacity Score CΨC_\Psi: the decreasing ratio of the conditional mutual information, as described in Eq 8 on page 6. This can be seen as a measure of the quality of the factors proposed by the LLM.

At each iteration, we prompt LLM to propose factors multiple times so the two measures can be estimated, as shown in Table 6 (page 25, appendix). With the help of these two metrics, we then theoretically analyze the impact of the LLM-involved modules on the number of COAT rounds, as shown by Eq 9 in proposition 3.3 (page 6).

W1.3 Convergence rate of COAT.

As suggested by the reviewer, we are happy to further improve proposition 3.3 with the result about rate of convergence (the proof is similar): with probability at least 1δ1-\delta, the following inequality holds:

I(Y;Xht(X))I(Y;X)(11CΨ)tpzδtp(1p)\frac{I(Y;X\mid h_{\le t}(X))}{I(Y;X)} \le \left(\frac{1}{1-C_\Psi}\right)^{-tp-z_{\delta}\sqrt{tp(1-p)}}

That is, under the setting in proposition 3.3, with both CΨC_\Psi and pp being positive, COAT would converge exponentially with its number of rounds.

W2.1 Ablation of the prompt templates.

We conduct an ablation study with a different prompt template following [1]:

  • We put the task description (also the format instructions) in the beginning [System] part; and we put samples in the last [Data] part of the prompt.
  • The markdown grammar is replaced by blankets to represent headings, like [System], [Data], and [Groups with Y=1] ...
  • 3 COAT iterations are performed, which is aligned with the original experimental setup.

COAT with changed prompt template:

MBNMBOTRecallPrecisionF1
GPT-44000.801.000.89
GPT-3.5-Turbo4000.801.000.89
Mistral-Medium3000.601.000.75

We observe that COAT is robust to the choice of templates, rejects unexpected factors (zero NMB and OT), and keeps a high precision.

[1] Judging llm-as-a-judge with mt-bench and chatbot arena, NeurIPS'23.

W2.2 Baselines are too simple.

We need to clarify that, to the best of our knowledge, the META baseline is already a strong baseline, as the extensive empirical evidence about LLM-based methods in causality-related tasks [2]. If you happen to know a stronger baseline, please let us know.

Meanwhile, we additionally construct a stronger baseline with CoT based on DATA, where the LLM is prompted to "Think step by step to consider factors", and to output these factors in the same format as other methods.

LLMmethodMBNMBOTRecallPrecisionF1
GPT-4CoT baseline4.33 ± 0.580.83 ± 0.290.17 ± 0.290.87 ± 0.120.81 ± 0.020.84 ± 0.06
COAT4.00 ± 0.820.33 ± 0.470.00 ± 0.000.80 ± 0.160.93 ± 0.090.85 ± 0.11
GPT-3.5CoT baseline5.00 ± 0.001.00 ± 0.001.33 ± 0.581.00 ± 0.000.68 ± 0.050.81 ± 0.04
COAT3.67 ± 0.470.00 ± 0.000.00 ± 0.000.73 ± 0.091.00 ± 0.000.84 ± 0.07
Mistral-MediumCoT baseline4.33 ± 0.581.00 ± 0.000.67 ± 0.580.87 ± 0.120.73 ± 0.070.79 ± 0.05
COAT4.67 ± 0.470.00 ± 0.000.00 ± 0.000.93 ± 0.091.00 ± 0.000.96 ± 0.05

It shows that LLMs with CoT can:

  • be aware of high-level factors behind data (lower OT than META);
  • still struggle to distinguish the desired factors in Markov Blanket (higher NMB than COAT).

[2] Causal reasoning and large language models: Opening a new frontier for causality, arXiv'23.

评论

Dear 1cJQ, thank you so much for reading the paper and sharing your opinion. I went through your review and I think that it would be very beneficial for the authors and for their paper whether you could suggest how to tackle the issue emerging from the following sentences of yours.

"Finally, the chosen baselines are far too simple to make any strong claims on the effectiveness of COAT. Comparing against some of the methods covered in the related work section would help bolster this claim."

In particular, I kindly ask you to share which more appropriate baselines could be taken into account to improve the quality of the evaluation , and which methods covered in the related work are the most appropriate ones .

I think the authors will greatly benefit from your insights.

All the best

评论

I quite like the additional baselines - DATA and DATA-CoT. These were in the lines of the stronger baselines that I had requested in my review.

评论

Dear Reviewer 1cJQ,

Thank you for acknowledging our additional baseline. Please kindly let us know if our responses address your remaining concerns, too. We would sincerely appreciate it if you could jointly take our responses into your evaluation of our work.

Best regards,

The Authors of Paper6701

评论

Yes. I will certainly take your responses into consideration. I will move the rating up. I hope that you will make the other things clearer (like the proposition, the experimental results and finally improving the writing). With these changes, it can be a good contribution.

Thanks for carefully considering my review.

评论

Dear Reviewer 1cJQ,

Thank you for agreeing to consider our responses in your evaluation! We have revised our manuscript according to your suggestions and promised that all of your suggestions will be reflected in our final manuscript (We could have provided a revised manuscript; however, NeurIPS does not allow it this year).

Regarding your suggestions for clarity, we have revised our manuscript as follows:

  • In Sec 3.3, when elaborating on our methods, we replace "If the observational data contains sufficiently diverse examples, and the LLM is sufficiently powerful" (line 185~187) with a more suitable sentence to build the motivation behind the Eq 6 (as the response to W1.1): "Intuitively, the following Eq 6 establishes the ideal expectation for the LLM to propose the desired high-level factors."
  • At the end of Sec 3.3, we also included additional comments: "In practice, many factors, such as the LLM capabilities, data faithfulness, and prompt templates, could affect the satisfaction of Eq 6. Therefore, in the next section, we will establish a theoretical framework to discuss the influence of the factors above to the satisfaction of Eq 6";
  • In Sec 3.4, after Definition 3.2, we included the motivation and intuitive interpretation about the two metrics (pp for LLM responsiveness and CΨC_\Psi for quality of factors) (as the response to W1.2);
  • In Sec 3.4, we also explicitly interpret Proposition 3.3: Intuitively, Proposition 3.3 also characterizes the influence of prompt templates, the LLM responsiveness, and the quality of factors on the performance of COAT via the two proposed measures: ... (as the response to W1.2 and W1.4);
  • In Sec 3.4, after Proposition 3.3, we included the suggested theoretical results of the rate of convergence (as the response to W1.3);
  • In Sec 3.4, after Proposition 3.3, we explicitely emphasize the requirements of COAT to get theoretically guaranteed: p>0p>0, CΨ>0C_\Psi >0, and the faithfulness condition (as the response to W1.1);
  • In Sec 3.4, we included our respective discussion about the impact of quality of factor annotations on the faithfulness conditions (as the response to W1.4);
  • In experiments, we included the ablation studies about prompt template and the DATA-CoT baseline (as the response to W2.1 and W2.2);

Please kindly let us know if you feel any additional changes that could further improve the clarity of this work. We would like to thank you again for your suggestions, which greatly improved our manuscript!

评论

These changes are encouraging to say the least! Thank you for taking my comments carefully into consideration.

评论

Dear Reviewer 1cJQ,

We are more than happy to hear your acknowledgment of our responses and revisions! We humbly invite you to reconsider the rating if the changes to our manuscript resolved your concerns. Otherwise, please kindly let us know if you feel any additional changes that could further improve the clarity of this work!

评论

Authors,

I will change the rating appropriately during the closed discussion. Nonetheless, the fixable concerns have been addressed.

评论

Dear Reviewer 1cJQ,

Thank you for agreeing to change the rating and for your acknowledgment of the resolved concerns. As the whole community would benefit from an open discussion, even if you feel there are any "unfixable" concerns, we are more than happy to clarify them during the remaining discussion period. Thank you!

评论

Dear 1cJQ and authors, I greatly appreciated the discussion you had which is a fundamnetal componento to understand each other and improve the quality of the work. All the best

评论

Dear Area Chair iWif and Reviewer 1cJQ,

Thank you so much for moderating the review process of our work, and the communication. We believe it is such open, thoughtful, and responsible communication that advances the development of science throughout human history!

审稿意见
8

This work proposes COAT (Causal representation AssistanT), a novel framework to leverage LLMs to assist with causal discovery from unstructured data. COAT aims to combine the advantages of LLMs and causal discovery algorithms. To do so, COAT employs LLMs to identify high-level variables and parse unstructured data into structured data. On the other hand, causal discovery algorithms read the parsed data to identify causal relations. To improve the reliability of the results, COAT also constructs feedback from the causal discovery results to iteratively improve the high-level variable identification. The authors conduct extensive case studies ranging from synthetic data to realistic data, and find COAT effectively helps with discovering meaningful causal structures that well explain the target variable.

优点

  1. This work identifies a crucial and timely problem for how to advance the causal tasks including causal learning and reasoning with foundation models likes LLMs;
  2. COAT is novel, interesting and well-motivated. The authors also provide theoretical discussion to justify its soundness;
  3. COAT is model-agnostic and robust to the choice of LLMs, and input data modalities;
  4. The authors construct several benchmarks, present comprehensive case studies, and conduct extensive experiments to verify their claims. The improvements over direct prompting LLMs are significant.

缺点

  1. The authors should provide more comparisons with advanced prompting techniques such as CoT.
  2. More discussions should be provided on the hyperparameters used in COAT, such as the group size in feedback.
  3. Model names are inconsistent. The name in Fig 4(c) is not the same with other names.
  4. GPT-4 reasoning in Fig 7(c) is unclear in meaning.

问题

Please refer to "Weaknesses".

局限性

No concerns regarding limitations.

作者回复

Thanks for your support and constructive comments on our work. We hope our response can sufficiently address your concerns.

W1 More comparisons with advanced prompting techniques such as CoT.

We construct a CoT baseline based on DATA, where the LLM is prompted to "Think step by step to consider factors", and to output the desired factors in the same format as other methods.

LLMmethodMBNMBOTRecallPrecisionF1
GPT-4CoT baseline4.33 ±\pm 0.580.83 ±\pm 0.290.17 ±\pm 0.290.87 ±\pm 0.120.81 ±\pm 0.020.84 ±\pm 0.06
COAT4.00 ±\pm 0.820.33 ±\pm 0.470.00 ±\pm 0.000.80 ±\pm 0.160.93 ±\pm 0.090.85 ±\pm 0.11
GPT-3.5CoT baseline5.00 ±\pm 0.001.00 ±\pm 0.001.33 ±\pm 0.581.00 ±\pm 0.000.68 ±\pm 0.050.81 ±\pm 0.04
COAT3.67 ±\pm 0.470.00 ±\pm 0.000.00 ±\pm 0.000.73 ±\pm 0.091.00 ±\pm 0.000.84 ±\pm 0.07
Mistral-MediumCoT baseline4.33 ±\pm 0.581.00 ±\pm 0.000.67 ±\pm 0.580.87 ±\pm 0.120.73 ±\pm 0.070.79 ±\pm 0.05
COAT4.67 ±\pm 0.470.00 ±\pm 0.000.00 ±\pm 0.000.93 ±\pm 0.091.00 ±\pm 0.000.96 ±\pm 0.05

It shows that LLMs with CoT can:

  • be aware of high-level factors behind data (lower OT than META);
  • still struggle to distinguish the desired factors in Markov Blanket (higher NMB than COAT).

W2 More discussions should be provided on the hyperparameters used in COAT, such as the group size in feedback.

Thanks for the suggestion. We have revised the manuscript to include a discussion on hyperparameters:

  • group size in prompt: In the COAT prompt, several samples are given and grouped by the values of the target variable. The samples in each group are randomly selected and are kept in a fixed number (like 3 samples per group). Empirically, we keep it to be 3 throughout all experiments (sometimes smaller than 3 if samples are not enough). In practice, it is mainly constrained by the LLM's context length.
  • the number of clusters: When constructing feedback, we first use clustering to separate the dataset and then find the cluster where the target variable is not explained well by current factors (This is a heuristic for the problem in line 191). Empirically, we set the number of clusters to be one plus the number of current factors.

Furthermore, we conduct ablation studies of COAT with GPT-4 using different hyperparameters:

Methodcluster sizegroup sizeMBNMBOTRecallPrecisionF1
META--2.67 ±\pm 0.940.67 ±\pm 0.472.33 ±\pm 0.470.53 ±\pm 0.190.46 ±\pm 0.080.49 ±\pm 0.13
DATA-33.00 ±\pm 0.000.33 ±\pm 0.470.00 ±\pm 0.000.60 ±\pm 0.000.92 ±\pm 0.120.72 ±\pm 0.04
COATlen(factor) + 134.00 ±\pm 0.820.33 ±\pm 0.470.00 ±\pm 0.000.80 ±\pm 0.160.93 ±\pm 0.090.85 ±\pm 0.11
COATlen(factor) + 114.67 ±\pm 0.580.00 ±\pm 0.000.00 ±\pm 0.000.93 ±\pm 0.121.00 ±\pm 0.000.96 ±\pm 0.06
COAT233.67 ±\pm 1.530.00 ±\pm 0.000.00 ±\pm 0.000.73 ±\pm 0.311.00 ±\pm 0.000.82 ±\pm 0.22

One can observe that COAT is not sensitive to these hyperparameters, and performs robustly well than the baselines under different hyperparameter setups.

W3 Model names are inconsistent. The name in Fig 4(c) is not the same as other names.

We have fixed them in the revised manuscript.

W4 GPT-4 reasoning in Fig 7(c) is unclear in meaning.

Fig 7(c) shows the result based on directly prompting LLM to reason for the causal relations among given factors. We add explanations in the figure caption now.

评论

The authors have adequately addressed my concerns. I would like to thank the authors, and I hope the comparison to CoT could be included in future revisions.

评论

Dear Reviewer szEm,

We are happy to learn that our rebuttal addressed your concerns. Please feel assured that we will integrate all the promised revisions in the future version of this work (we have already done so for most of the required revisions)!

审稿意见
5

This paper combines the power of LLMs with that of causal discovery by proposing a Causal representatiOn AssistanT (COAT) approach. Specifically, it considers datasets with textual descriptions, and tries to identify the Markov blanket with respect to a target variable (such as customer ratings and medical diagnosis). The key contribution is discovery of the causal factors through a pipeline that uses both LLMs and a causal discovery algorithm.

优点

I find this an interesting and practical paper that combines the advantages of LLMs – such as the vast amount of knowledge that they encode – and that of causal discovery approaches. The ideas around combination are generally simple but novel, and I believe the approach could potentially be valuable in a suite of applications, although the extent of the value is unclear from the paper.

缺点

A major limitation of the work is the empirical evaluation, even though it comes across on the surface as being extensive. I sympathize with the authors about benchmarks for causal discovery, but it seems they have used GPT-4 to generate the textual description of the data, and then used LLMs in their COAT procedure. This is clearly a synthetic dataset that can be problematic. Even the “realistic” benchmarks do not come across as sufficiently realistic, based on my understanding and the lack of details in the main paper.

I don’t understand why key aspects of the evaluation were moved to the appendix. I find it impossible to fully evaluate the work based solely on the contents of the main paper. I understand the need to make space and to move things to the appendix, but it’s never suitable to move key aspects such as the description of the benchmarks and the key results that show value of the work. This has impacted my assessment of this work and I have had to decrease my score because of the authors’ choices around appendix content.

A related weakness is the lack of any attempt at describing limitations, of which there are clearly many.

问题

Could the authors share more about the scope of the work? Are there some other restrictions on the problem setting, besides needing a discrete label y? My assessment is that there is a gap between the scope mentioned in the problem setting and what is described in the experiments, which seems more restricted. Perhaps the authors can clarify.

Identifiability is mentioned loosely on page 2, with some technical references, but seems to have been used in an imprecise way here. The connection here seems tenuous at best.

There is a comment on pg. 3: “Note that the target variable Y serves as a guider, and no specific relation between x and Y is assumed.” What is a guider? And what do you mean no relation between x and Y is assumed? I thought the entire point is to do causal discovery: x is a function of z, and Y is a function of z. This line seems incorrect.

Section 3.2 would be much easier to follow with some illustrative examples. The content in Fig. 2 is too abstract to be really useful. I think the authors missed a trick here.

The meaning of C and p are unclear to me from what is described on pg. 6. How does one assess the significance of Proposition 3.3?

The details about benchmarks are incredibly important, and it should be easy for anyone to understand at least a high-level sense of a benchmark – basic things like the number of data points, for instance. Please fix Section 4 accordingly.

What is OT in Section 4.2? Is it the same as OT in the next section? Define MB, NMB, OT somewhere. I don’t see them mentioned anywhere clearly, although I understand from context that MB means Markov blanket.

Are Table 3 and Fig. 5 in the Appendix? If so, then mention that.

I’m not convinced that the “realistic” benchmarks are realistic. It’s too bad I can’t gauge this from the main text.

Please add a detailed limitations section. Mention all the limitations around evaluation in particular, as well as the significant risks of relying on LLMs for causal discovery.

Minor comments: line 181: “an potential” should be “a potential”; line 208: “Ability” should be “ability”; lines 211 and 212: seems there is a grammatical error here; line 216: what are “shared notations”?

局限性

The authors have not suitably described limitations. I consider this a major weakness; please see my previous comments.

评论

We give the responses to the other questions in the follow-up comment due to the character limits.

Q4 Illustrative examples for Sec 3.2 and Fig. 2.

We revised Sec 3.2 and Fig. 2 with more illustrative examples:

  • Factor proposal. After observing customers' comments on apples in different ratings, LLM may propose that sweetness is a possible factor with its criterion to assign values, like more sweet is 1, more sour is -1, and 0 if not mentioned in the text. LLM can also propose other factors, like the color of the apple.
  • Factor parsing. Then, another LLM goes through all comments to assign values for each factor. Therefore we get tabular data with rows for comments and columns for sweetness and color factors.
  • Verification. We use the tabular data to check these new factors. We find the color is conditional independent with the rating given existing factors in the current representation (line 202, empty now), and sweetness is not. So we add sweetness to the current representation.
  • Feedback. Following Sec 3.3, we find a subset of comments where the current representation cannot explain Y well. We pass these comments to step 1 to continue the next iteration.

The revised Fig.2 is included in the pdf.

Q7 Definition of MB, NMB, OT.

Formal definitions are now given in section 4 as follows:

  • MB means the desired factor forming the Markov Blanket of Y.
  • NMB means the undesired factor relevant to data but not in MB.
  • OT means the unexpected factors irrelevant to data.

Q8 Are Table 3 and Fig. 5 in the Appendix? If so, then mention that.

Table 3 is in the appendix, which is now mentioned in the main paper. Fig.5 is on page 8.

Q9 I’m not convinced that the “realistic” benchmarks are realistic. It’s too bad I can’t gauge this from the main text.

Thanks for the important suggestion. These details in the appendix will be also seen in the main paper, as we responded in the comment about weakness.

Q11 Minor comments.

Thanks for reminding the typos. They are fixed now.

shared notations means: the notation (like pp and CΨC_\Psi) used here are defined in Definition 3.2.

评论

I thank the authors for their detailed response and hope they will revise their exposition in various sections of the manuscript and move some content to the main file. I will remain open to re-assessing the work based on the rebuttal and further discussion.

评论

Dear Reviewer AG9s,

Thank you for considering re-assessing our work! We have revised our manuscript according to your suggestions and promise that all of your suggestions will be reflected in our final manuscript.

Regarding your suggestions for exposition, we have revised our manuscript as follows:

  • On page 2, lines 37 and 38, we replaced ", which is the central concept of causality research [3,4,5]" with a more clear sentence to show our purpose: "which plays an important role in classic causal discovery literature [3,4,5]" (as the response to Q2).
  • In Sec 3.1, when stating the problem definition, we included a clearer description of the scope of this work and respective pointers about other assumptions needed (as the response to Q1).
  • In line 115, we replace the last sentence containing "guider" and "relation" with a clearer one: "Note that the target variable Y serves as a guider, and no prior assumption between x and Y is assumed" (as the response to Q3).
  • In Sec 3.2, we included the illustrative examples as stated in the initial response. We also improved Fig.2 with the uploaded pdf (as the response to Q4).
  • In Sec 3.4, after Definition 3.2, we included the motivation and intuitive interpretation about the two metrics (pp for LLM responsiveness and CΨC_\Psi for quality of factors) (as the response to Q5);
  • In Sec 3.4, we also explicitly interpreted Proposition 3.3: Intuitively, Proposition 3.3 also characterizes the influence of prompt templates, the LLM responsiveness, and the quality of factors on the performance of COAT via the two proposed measures: ... (as the response to Q5);
  • In Sec 4, before Sec 4.1, we included one sentence to state the experiment's purpose: "We evaluate whether COAT can propose and identify a set of high-level factors belonging to the Markov Blanket of the target variable Y." (as the response to W2).
  • In Sec 4, after the 'Benchmark Construction' paragraph, we included one sentence to state the expected result of an ideal method: "good Method is expected to propose the five high-level factors (up to semantical meanings) and exclude the "disturbing" factor." (as the response to W2).
  • In Sec 4, at the end of Sec 4.1, we included the definitions of MB, NMB, and OT. (as the response to Q7).
  • In line 285, we replace "Table 3 and Fig. 5" with "Table 3 (in Appendix E.1) and Fig. 5 (on page 8)" to make it more convenient. (as the response to Q8).
  • In Appendix B, we separated the discussion around line 629 into an individual 'Limitation' section. We also added discussions about faithfulness, selection bias, and sample size. (as the response to W3, Q10, and L1)
  • Minor typos are fixed. (as the response to Q11).

Regarding your suggestions for moving some content to the main file, we have revised our manuscript as follows:

  • In Sec 4.1, at the beginning of the 'Benchmark Construction' paragraph, we also included one sentence to show the concrete details: "We prepare different high-level factors: 3 parents of Y, 1 child of Y, and 1 spouse of Y. These factors form a Markov blanket of Y. In addition, we also prepare one "disturbing" factor related to Y but not a part of this blanket" (as the response to W2 and Q6).
  • In Sec 4.1, at the end of the 'Benchmark Construction' paragraph, we also moved the sample size of the Apple Gastronome to the main file: "we generated 200 samples for LLMs' analysis and annotation." (as the response to Q6).
  • In Sec 5.1, at the end of the 'Benchmark Construction' paragraph, we also moved the sample size of the Neuropathic to the main file: "We generated 100 samples for LLMs' analysis; since the number of possible factors is finite, we generate 1000 tabular data for CI tests." (as the response to Q6).
  • At the beginning of Sec 4.2, we included the summary of key findings from the full experiment results (as given in Table 5 in Appendix E.4) in the main file (as the response to W2).
  • At the beginning of Sec 5.1, we moved short introductions about the three realistic datasets to the main file. (as the response to W1, Q6, and Q9).

Please kindly let us know if you feel any additional revisions could further help improve the clarity and the exposition of our work. We sincerely appreciate and are looking forward to your re-evaluation combining our rebuttal and the discussion. Thank you again for your time and constructive suggestions!

作者回复

Thanks for the detailed and insightful comments on our work. We hope our response can sufficiently address your concerns.

W1. Reality of benchmarks

The choice of synthetic and realistic benchmarks is because of the evaluation purpose. Since we usually do not have access to the ground truth causal graph of the realistic benchmarks, we need to synthesize ones to effectively evaluate the performances of COAT.

Meanwhile, we do evaluate COAT in realistic data, for which we have revised our manuscript to make it clearer:

  • Brain Tumor: This is an open-sourced dataset containing MRI images for brain tumor classification.
  • Stock News: This dataset contains the close price of a company from 2006 to 2009 and its 804 news summary (from the New York Times).
  • ENSO: This dataset contains high-dimensional information about Earth’s atmosphere with fine-grained time and space coverage from the 19th century to the early 21st century. This is a popular real-world dataset in climate science.

W2 Evaluation details.

Thank you for the suggestion! We have revised the paper to include more details of the evaluation in Sec 4.1, and please kindly let us know if you feel any additional revision could further improve the clarity:

  • Purpose: We evaluate whether COAT can propose and identify a set of high-level factors belonging to the Markov Blanket of the target variable Y.
  • Construction: We prepare different high-level factors: 3 parents of Y, 1 child of Y, and 1 spouse of Y. These factors form a Markov blanket of Y. In addition, we also prepare one "disturbing" factor related to Y but not a part of this blanket.
  • Expectation: A good Method is expected to propose the 5 high-level factors (up to semantical meanings) and exclude the "disturbing" factor.

The key findings from the experiments are summarized below:

  • COAT is more resistant to the "disturbing" factor, which is supported by the lower NMB column (number of factors out of the Markov blanket) in both table 1 (page 8) and table 5 (page 24).
  • COAT filters out irrelevant factors from LLMs' prior knowledge that are not reflected by the data, which is supported by the lower OT column (number of other irrelevant factors).
  • COAT robustly encourages LLM to find more expected factors through the feedback, which is supported by the lower MB column (number of factors in the Markov blanket).

W3 Lack of limitation discussion.

We indeed provided a discussion on limitations in the future work section on line 629, page 17. We revised the section title to Limitations and Future Directions to make it clearer.

Q1 Scope of the work.

The problem setting in Sec 3.1 establishes the objective of this work that reliably leverages an LLM to identify the underlying high-level factors in the Markov Blanket of a given target variable.

The inputs are merely the target variable and the unstructured data, which can be either text or images. After identifying a set of candidate high-level factors, the values of those factors can be either obtained by LLM annotations or by some external tools.

Meanwhile, identifiable causal discovery also requires certain assumptions about the data:

  • Faithfuness. The empirical distribution of the data reflects the actual data-generating process.
  • No selection bias. Otherwise, the faithfulness condition would be violated.
  • Enough sample size. Our method involves statistical tests, the higher the better.

Q2 Use of identifiability mentioned on page 2.

The references here are classic books and survey papers about causal discovery methods with rigorous theoretical guarantees of identifiability. We use them to show the important role of identifiability, which is lacking in the current literature on using LLMs for causality-related tasks. The discussion in line 33 to 38 gives rise to our main research question: How can LLMs reliably assist in revealing the causal mechanisms behind the real world?

Q3 Description of problem setting.

The meanings of the sentence are:

  • Why Y serves as a guider: Only a subset of hidden high-level factors belongs to a Markov blanket of Y, that correlates with Y. Therefore, one may use Y to guide the identification of the desired high-level factors.
  • The relation actually means causal relations. In a Markov Blanket, there are three types of causal relations: Parents, Children, and Spouses. In the second type, some elements in z could be functions of Y.

We have revised the sentence to "Note that the target variable Y serves as a guider, and no prior assumption between x and Y is assumed" to avoid any potential misunderstandings.

Q5 The meaning of C and p, and the significance of Proposition 3.3?

We define the two concepts to formalize the LLMs' ability to propose useful factors:

  • Perception Score pp: the probability that the LLM proposes such a new factor. This can be seen as a measure of the LLM's responsiveness to feedback.
  • Capacity Score CΨC_\Psi: the decreasing ratio of the conditional mutual information, as described in Eq 8. This can be seen as a measure of the quality of the proposed factors.

The significance of Proposition 3.3:

  • It shows that COAT can provably find a set of high-level factors in the Markov blanket of Y with sufficient rounds if p>0p>0 and CΨ>0C_\Psi >0.
  • It also characterizes the influence of LLM's ability (pp and CΨC_\Psi) on the efficiency of COAT by Eq 9.

Q6 The details about benchmarks.

Thanks for the suggestion. These details (initially in appendix G) are added to the main paper now. A related discussion can be found in our response to W1.

Q10 & L1 Limitations.

Now the limitation part (initially a part of appendix B) is a single section. We also add discussions about faithfulness, selection bias , and sample size, as we responded. Note that COAT is relying on LLM for factor proposal, whose risk can be controlled given the faithfulness condition, as implied by our theoretical results.

作者回复

Dear Reviewers,

Thank you for your time and constructive comments on our work. To summarize, all reviewers agree the paper's proposal to reliably advance rigorous causal discovery methods with the advantages of foundation models like LLMs is novel and valuable (AG9s, szEm, 1cJQ, PF9K). The method is sufficiently evaluated in constructed benchmarks (szEm, 1cJQ, PF9K). The soundness is justified by the provided theoretical results (szEM, 1cJQ). The paper also develops novel metrics to measure the LLMs' ability to propose desired factors (1cJQ, PF9K). Comprehensive case studies are presented on three realistic data (szEM, 1cJQ).

We believe all of the reviewers' concerns can be addressed. In the following, we brief our responses to the main concerns and suggestions raised in the review:

  • The guarantee on factor proposal (AG9s, 1cJQ, PF9K)
    • In this paper, we show COAT can provably identify a Markov Blanket of a given target variable. In section 3.4, we propose two new metrics (pp and CΨC_\Psi) to characterize the capability of LLMs in identifying useful high-level factors and also establish the corresponding guarantee.
    • We further discuss the intuition behind pp and CΨC_\Psi, and also give the convergence rate of COAT in the response to Reviewer 1cJQ.
  • Baselines and ablation study (szEm, 1cJQ)
    • We construct an additional stronger baseline with CoT prompting to enhance the evaluation. We present the key results and interpretation in the response to Reviewer szEm and 1cJQ.
    • We provide empirical evidence to show COAT is not sensitive to hyperparameters like group size or cluster size in the response to Reviewer szEm.
    • We also provide empirical evidence to show COAT is not sensitive to the choice of prompt templates in the response to Reviewer 1cJQ.

We also provided an anonymous link to our sample code for reproducing the results in our paper to the Area Chair, according to the NeurIPS requirements.

Please let us know if there are any other concern and we are happy to discuss them. We would appreciate it if you could take our responses into consideration when making the final evaluation of our work.

Sincerely,

Authors.

最终决定

The discussion on the manuscript was intense and of high quality between authors and reviewers. I have to recognize that the high engagement of reviewers (more tha 30 posts were published) and authors significantly contributed to understand each other on the presented topic, and this made me more confident to propose acceptance for this paper.

Indeed, all reviewers agreed that the outcome of the rebuttal and subsequent discussions were satisfying. The authors managed to address and solve all the main issues raised by reviewers, they all agree on this and some of them consequently raised their score.

As also some reviewers noticed, I suggest the authors to develop all fixing discussed and promised during the rebuttal and discussion period (revise their exposition in various sections of the manuscript and move some content to the main file). In particular, all the revisions that have been promised by the authors, see the many posts, must be honored.