Dear Reviewer DRzn, we highly appreciate your time dedicated to reviewing this paper. Following your comments and suggestions, we have further highlighted the contributions, clarified the model details, extended the experimental results, and provided more justifications. We modified the manuscript and appendix accordingly. Please find our point-to-point response below. Hope it can address all your concerns.

W1: The novelty of the paper is limited. While the paper draws motivation from causal modeling, both the theoretical analysis and the implementation of TMGM using conditional VAEs do not adequately realize the causal processes in Eq. 1.

A1: We appreciate this comment which helped us clarify both theoretical and technical details and highlight our contributions. We would like to highlight that under the assumptions we make, our TMGM model is able to learn the causal processes. We have included detailed proof in the revised Appendix A2. We have conducted an experiment to verify whether we can learn the causal processes in Appendix A6.3. We also analyzed the relations between motivation and implementation in the following. The discussion on how the implementation details can constrain/leverage these assumptions

Key Motivation Our key motivation is to learn the model by answering reasoning questions without the usage of textual annotations. To achieve that, our LLCP consists of two stages, one is learning the causal processes with the temporal multivariate generative model, and the other is applying the discovered causal processes to conduct inference for answering reasoning-based questions. The first stage relies on the assumptions of Granger Causality, and the latter one leverages the shared feature spaces and causal inference.
Granger Causality in Learning the Causal Processes
- Assumption: We assume that the past values of one variable can predict the future values of another, indicating a directional relationship in a time-series context and implying no instantaneous effects in the temporal dynamics. Variables are independent given the historical state , as aligned with our data generation process.
- Implementation: To substantiate this, we employ auto-regression based on the concept of Granger Causality. This approach predicates future events on historical data, validating our assumption about predictability. The identification of our proposed model is further supported by a proposition and proof in Appendix A2.
Causal Inference
- Root cause analysis We conduct the root cause analysis by identifying the changed local causal mechanisms. The insight is if one variable's change cannot be explained by its parents, then it is the root cause of the change.
- Counterfactual Counterfactual inference is an inference method that leverages the causal structures to predict the possible outcome in the counterfactual conditions.

W2: It is not clear how the authors enforce the independence constraints, both spatially and temporally, on the noise variables by the three subnetworks f_N, f_H and f_E. I would say the fusion of these three networks encodes spatio-temporal dynamics across objects (agents) but it is unclear if this has any bearing on causal relations.

A2: Thanks very much for this question. It definitely helped improve the readability of our paper. We respectively answer the questions in terms of independence and model-fusion below. We have revised the manuscript accordingly to make the details clearer as shown in Section 2.1.

Independence. We enforce the independence by the KL divergence loss in Equation (3) in the revised version (original Equation (2) ). Specifically, we constrain the learned posterior distribution with a prior distribution where which implies the following conditional independent properties. Given the historical information, the variables in the current time step are independent. By this, with the log function, we have . With the KL loss, we encourage the posterior also with these conditional independent properties.

Subnetworks. By the design of three subnetworks, we decompose the causal processes into local mechanisms. First, this indicates prior knowledge that the transition functions are different for historical states, neighborhood interactions, and environmental clues. By this decomposition, we respectively learn each submodule leveraging the modularity property of the causal mechanisms, which makes the learning process easier.