PaperHub
6.8
/10
Rejected5 位审稿人
最低3最高5标准差0.7
4
5
4
5
3
3.2
置信度
创新性2.2
质量2.6
清晰度2.6
重要性2.6
NeurIPS 2025

Learning General Causal Structures with Hidden Dynamic Process for Climate Analysis

OpenReviewPDF
提交: 2025-05-07更新: 2025-10-29
TL;DR

We develop an estimation approach simultaneously learning both the observational causal structures and latent causal process.

摘要

关键词
Climate Analysis; Causal Discovery; Causal Representation Learning

评审与讨论

审稿意见
4

This work establishes a refined connection between SEM (Structural Equation Model) with non-linear ICA (Independent Component Analysis), and applies it in climate analysis applications. Specifically, it extends ICA-based causality discovery to nonparametric settings with the presence of latent confounders. It proves an equivalence between SEM and non-linear ICA in this setting so that it is possible to learn an ICA instead of SEM. The training objective consists of ELBO and two penalty terms. Experiments on both synthetic and real-world climate datasets demonstrate the performance of proposed CaDRe and validate the theoretical design.

优缺点分析

This work is solid in general, with some (probably minor) issue before granting an acceptance. On the strengths, the proof of the existence of equivalent ICA to SEM is not trivial, and this work manages to make the formulations and proofs simple in presentation. Additionally, the performance of the proposed CaDRe could be validated on real-world datasets in the climate domain. The discovered of causailties also coincident with climate knowledge.

However, there are some weaknesses of this paper, from which the current manuscript can be improved.

W1. The proposed method could generalize to domains beyond climate data, wherever there could be latent variables, it seems CaDRe can be used. There are many time series data in other domains which could also serve as the testbed. Please correct me if there are some specific constraints that CaDRe can only be used on climate data.

W2. On synthetic data, the comparison are mainly with causal learning models, yet the comparison with time series models are not convincing, as the compared models PCMCI and LPCMCI are methods up to year 2020. Comparison of more recent time series models are needed.

W3. On real-world data, the experiment is conducted on only one dataset CESM2, whose statistics and scale is uncertain from the main page.

W3. It is not very clear how the latent variables (z) are interpreted for climate analysis. Specifically, there should be details of how visualization is conducted.

W4. The advantage of learning non-linear ICA instead of SEM should also be discussed and verified. In other words, this word should emphasize why bother to prove the existence of equivalent ICA and try to build it, instead of just training an SEM.

W5. Efficiency is not considered in this manuscript. The training/inference time is not reported for any method.

问题

Please refer to the weaknesses.

局限性

This is a pure ML research work without potential negative social impact.

最终评判理由

I appreciate the authors' rebuttal, which has addressed most of my concerns. I have adjusted my rating accordingly.

格式问题

None

作者回复

Dear Reviewer 1X1S, we sincerely thank you for your valuable comments, constructive suggestions, and encouraging feedback. Below, we provide point-by-point responses and have updated the main paper and appendix accordingly.

W1: CaDRe could generalize to other time series domains

Yes, you are correct. In light of your suggestions, we have evaluated CaDRe across 3 standard long-term time-series forecasting datasets from diverse domains, including finance (Exchange), public health (ILI), and traffic monitoring.

As shown in the table below, CaDRe demonstrates consistently strong forecasting performance across a wide range of horizons, indicating robust generalization to varied time-series domains.

DatasetI/O LenCaDRe(MSE/MAE)N-Transformer(MSE/MAE)Autoformer(MSE/MAE)MICN(MSE/MAE)TimesNet(MSE/MAE)
ILI18-61.200/0.6911.491/0.7572.637/1.0944.847/1.5702.406/0.840
72-241.856/0.8332.551/1.0392.653/1.1164.776/1.5562.270/0.988
144-481.796/0.8782.227/1.0182.696/1.1394.917/1.5842.978/1.123
216-722.010/0.9842.595/1.0812.960/1.1674.804/1.5842.696/1.098
ECL18-60.114/0.2160.134/0.2420.136/0.2540.250/0.3380.128/0.236
72-240.121/0.2200.140/0.2460.144/0.2570.258/0.3420.134/0.242
144-480.124/0.2250.155/0.2600.163/0.2750.271/0.3530.149/0.256
216-720.131/0.2320.169/0.2740.175/0.2870.279/0.3570.166/0.271
Traffic18-60.487/0.3070.797/0.3470.554/0.3220.475/0.2870.781/0.337
72-240.452/0.3030.625/0.3190.508/0.3180.454/0.2760.608/0.307
144-480.412/0.2820.574/0.3140.497/0.3190.450/0.2750.553/0.296
216-720.400/0.2780.593/0.3250.524/0.3300.473/0.2870.564/0.303

W2: Comparison of more recent time series causal discovery models in synthetic data

Thanks for pointing this our. We additionally compare several recent time-series causal discovery methods, including TCDF [1] (2019), IDOL (2024), and TDRL (2022), on the same simulated datasets used in our main experiments.

MethodSHD↓Precision↑Recall↑F1↑
CaDRe0.185±0.0210.803±0.0370.830±0.0120.815±0.025
TCDF0.429±0.0350.442±0.0410.384±0.0460.411±0.014
IDOL0.297±0.0290.624±0.0530.598±0.0400.610±0.021
TDRL0.348±0.0310.505±0.0370.462±0.0140.483±0.056

Implementation details: We follow official implementations and default settings for all methods. For TCDF, we set the significance threshold to 0.9 and treat identified causes as parents. IDOL's latent graph is mapped to observed space and outputs instantaneous graphs. TDRL 's latent graph is mapped to observed space and outputs time-lagged graphs only.

[1] Nauta, et al. "Causal discovery with attention-based convolutional neural networks." Machine Learning and Knowledge Extraction 1.1 (2019): 19.

W3 (1): CESM2 statistics is uncertain from the main page

Thank you for pointing this out. The CESM2 Pacific SST dataset consists of monthly SST data from a 500-year pre-2020 control run, with 6000 time steps over ocean-only regions. It retains a native resolution of 186 × 151, totaling 28086 spatial points—24749 valid SST observations and 3337 land points excluded from analysis. For efficiency, we use a downsampled 6 × 14 grid (84 points). This is described in Main Paper, Line 296 and Appendix, Lines 1347–1355, and has been further highlighted in the revision.

W3 (2): On real-world data, the experiment is conducted on only one dataset, CESM2

Thank you for the suggestion. In addition to CESM2, we include two real-world datasets in our revised experiments:

  • Weather (see Appendix, Table A10), a standard benchmark with hourly meteorological observations;
  • ERSST, the NOAA Global Temperature Anomaly Dataset (1880–2025), consisting of 2052 monthly steps and 16,020 spatial grid points per step. We downscale it to 100 dimensions via spatial averaging for forecasting.

Results on "CESM2", Weather, and ERSST are summarized below. CaDRe consistently achieves competitive or best performance in both MSE and MAE across various prediction lengths.

DatasetLengthCaDRe MSECaDRe MAEiTransformer MSEiTransformer MAEAutoformer MSEAutoformer MAETimesNet MSETimesNet MAEMICN MSEMICN MAECARD MSECARD MAEFITS MSEFITS MAE
CESM2960.4100.4830.4220.4910.9590.7350.4150.4860.4170.4860.4090.4840.4390.508
CESM21920.4120.4870.4250.4951.5740.9720.4170.4971.5590.9840.4220.4930.4470.515
CESM23360.4130.4850.4260.4941.8451.0780.4230.4992.0911.1730.4210.4970.4820.536
Weather960.1570.2030.1680.2140.2250.2590.1800.2310.1990.2560.4230.4970.1720.221
Weather1920.2070.2480.1930.2410.3540.3480.2120.2650.2380.2980.4820.5440.2160.260
Weather3360.2700.3140.4260.4940.3540.3480.4230.4990.3160.4960.5250.5960.3860.439
ERSST960.1450.2680.2470.2640.9530.2720.4320.5080.7260.7650.1970.2730.5390.297
ERSST1920.2080.3070.2510.5351.0240.9080.4520.5851.2630.8920.2330.3750.2260.752
ERSST3360.3050.3610.3050.6591.3871.3530.5810.6071.1731.1720.4870.4840.4390.535

These additions strengthen the generalizability of our method on diverse real-world climate datasets. All experiments follow consistent preprocessing and evaluation protocols as detailed in the main paper.

W3 (3): It is not very clear how the latent variables (z) are interpreted for climate analysis

Thank you for this important question. Latent factors are, by definition, unobserved, in principle, we cannot give a direct interpretation. This poses a central challenge in the field of CRL. However, once we are sure about the existence of latent factors and understand how it's related to measured variables, we can begin to interpret them and even come up with ways to measure them. For example, as long as we obtain domain knowledge of latent factors from climate experts/scientists, we can easily match them to the meaningful quantities, e.g., precipitation, solar radiation, or components of a climate foundational model representation. This mirrors historical processes in science, such as the discovery of viruses, which were first hypothesized based on indirect evidence and later confirmed and measured directly.

We have included this perspective in our main paper and regard it as our future work.

W3 (4): There should be details of how visualization is conducted

We clarify the visualization procedure as follows:

  • Wind System Visualization: The wind data consists of two components: the vertical component uu and the horizontal component vv. For each spatial location aa, we draw an arrow originating at (λa,ϕa)(\lambda_a, \phi_a), with its direction and length determined by the vector (ua,va)(u_a, v_a): the direction indicates wind flow, and the length is proportional to the magnitude of the vector.

  • Causal Graph Visualization: If an edge from region aa to region bb in estimated causal adjacency matrix BB is present, i.e., the weight Ba,bB_{a,b} is nonzero, we draw an arrow aba \rightarrow b, according to their positions in the dataset.

W4: Why train an ICA instead of SEM

Thank you for this valuable question. The key distinction lies in the nonparametric setting of our SEM. In the parametric additive noise setting, where the SEM takes the form X=f(X)+EX = f(X) + E, the noise term can be isolated as E=Xf(X)E = X - f(X), allowing a direct reconstruction-based loss Xf(X)|X - f(X)| as used in prior work [2,3]. However, in the nonparametric setting X=f(X,E)X = f(X, E), the noise EE cannot be separated from XX via subtraction. This makes direct optimization of the SEM via reconstruction loss ill-posed for causal discovery. In contrast, nonlinear ICA offers a principled alternative by enabling recovery of latent sources (here, EE) under identifiability guarantees, thereby allowing recovery of the causal graph. We have clarified this point in the revised manuscript.

[2] Zheng, Xun, et al. "Dags with no tears: Continuous optimization for structure learning." Advances in neural information processing systems 31 (2018).

[3] Lachapelle, Sébastien, et al. "Gradient-based neural dag learning." arXiv preprint arXiv:1906.02226 (2019).

W5: Efficiency is not considered in this manuscript.

Thank you for highlighting this important point. We agree that computational efficiency is an important consideration. In our revised manuscript, we have included measurements of training time, memory usage, and inference latency to provide a more complete assessment of efficiency.

  • Climate Forecasting Model – Training and Inference Efficiency
    (CESM2, input dim = 82, sequence length = 96, batch size = 1; measured on NVIDIA A100-SXM4-80GB with CUDA 12.6, averaged over 100 runs):
MetricCaDReAutoformerTimesNetCARDMICNFITSTDRLiTransformer
TrainingTime(s)6134871297215131826392318709
Memory(GB)1.2341.8435.5172.0181.0450.8861.0271.093
InferenceLatency(ms)1.095±0.2038.414±3.38611.867±2.8952.620±0.9324.315±1.3671.953±0.8740.974±0.1260.919±0.185

These results are shown in Appendix, Fig. A8, and are now explicitly referenced in the main text.

  • Causal Discovery – Inference Time Comparison (ms)
    (Using official open-source implementations from Tigramite and Causal-Learn, which are training-free):
MethodCaDReFCICD-NODPCMCILPCMCIPC
Latency1.10±0.20999.09±16.272242.88±27.143391.35±76.643508.50±123.362230.72±27.94

These results show that CaDRe is highly efficient in both training and inference, while maintaining strong performance in causal representation learning and discovery.

评论

Dear Reviewer 1X1S,

We are grateful for your time on our paper, your constructive comments, and your recognition of the significance and novelty of our work. Could you please have a look at our response and let us know whether your concerns have been addressed, regarding

  • W1: Generalization of CaDRe to other domains and additional experiments
  • W2: Comparison with more recent time series causal discovery models on synthetic data
  • W3 (1,2): CESM2 statistics and additional experiments on climate across two new datasets
  • W3 (3,4): Physical interpretability of latent variables and details on how the visualization is conducted
  • W4: The rationale for training an ICA model instead of directly training a SEM.
  • W5: Clarification of CaDRe’s efficiency in terms of training time, memory cost, and inference time.

Additionally, for W4, we provide further empirical evidence and identifiability analysis to complement our initial response:


W4: Why train an ICA instead of SEM

  • Experimental Verifications: We tested direct SEM training by replacing sts_t prior estimation in CaDRe with the likelihood-based objective (if we assume a Gaussian distribution, it reduces to a reconstruction in [1]) as follows:

    \max \mathbb{E}\_{x_{t} \sim P_{x_{t}}} \sum_{j=1}^d \log p_j \left( x_{t,j} \mid x_{t, \pi_j}, z_{t} \right),$$ where $\pi_j$ denotes the set of parents of node $x_j$ in observational causal graph. This equation is a variant of Eq. (4) in [2] that accounts for latent confounders, denoted **CaDRe_SEM**. All other terms (e.g., DAG constraint, Jacobian-based edge support) follow our settings, with identical datasets, implementations, and hyperparameters. |Method|$d_x$|SHD↓|Precision↑|Recall↑|F1↑| |------|-----|----|----------|-------|----| |**CaDRe_ICA**|3|0.00|1.00|1.00|1.00| ||6|0.18|0.80|0.83|0.81| ||8|0.29|0.76|0.78|0.77| ||10|0.43|0.63|0.65|0.64| |**CaDRe_SEM**|3|0.12|0.86|0.82|0.84| ||6|0.40|0.64|0.60|0.62| ||8|0.51|0.50|0.42|0.46| ||10|0.56|0.49|0.41|0.44| As shown, **CaDRe_SEM** shows a sharp performance decline, with substantially higher SHD and lower precision/recall, confirming that direct SEM training degrades in the nonparametric latent setting.
  • ICA Establishes Identifiability: Indeed, causal models to be identified here can be written as either an SEM or ICA. We reformulate the SEMs into a constrained form of nonlinear ICA primarily to establish identifiability results in a natural way, since no existing theory guarantees identifiability for nonparametric SEMs with latent variables.

[1] Zheng, Xun, et al. "Dags with no tears: Continuous optimization for structure learning." Advances in neural information processing systems 31 (2018).

[2] Lachapelle, Sébastien, et al. "Gradient-based neural dag learning." arXiv preprint arXiv:1906.02226 (2019).


We hope these address your concerns. As the discussion phase ends in 2 days, we would greatly appreciate it if you could let us know if you have any remaining questions or suggestions.

Your further feedback would be highly appreciated.

Best,

The Authors of Submission 9138

评论

I appreciate the authors' rebuttal, which has addressed most of my concerns. I will adjust my rating accordingly.

评论

Dear Reviewer 1X1S,

Thank you very much for taking the time to review our work. We are sincerely grateful for your constructive comments and are pleased to hear that our responses addressed your concerns.

With best regards,

The Authors of Submission 9138

审稿意见
5

This paper handles causal discovery over observational variables while identifying the latent variables. The authors assume a generative model grounded in climate analysis intuitions where an observed variable is causally affected not only by its lagged or instantaneous counterparts but also by some of the latent variables. The authors also assume that some of the latents ztz_t are not necessarily causal parents but may modulate the causal mechanism behind observed variables xtx_t through a random term sts_t. The authors demonstrate a pointwise identification of latent variables and an “edge-by-edge” identification over the causal graph over observed variables. Experiments show improvement in structure recovery w.r.t. baselines.

优缺点分析

Strengths

  • The authors made commendable efforts in clearly communicating their works, although some parts might have been better written.
  • The theoretical part has discussed the main important aspects of the problem.
  • The model proposed was compared against a very fair number of baselines in both causal discovery and time series forecasting.
  • I took note of the quality of the appendix complementing the main paper (although some elements should have been stated in the core paper for better clarity—see below).

Weaknesses

  • For “st,i=gsi(zt,ϵxt,i)s_{t,i} = g_{si}(z_t, \epsilon_{x_{t,i}})” in Eq. 1, authors mentioned “noise conditioned on ztz_t”; however, it should not be considered “noise,” since it depends on ztz_t. sts_t plays the role of an endogenous mediator.
  • sts_{t} is said to be ‘designed to capture inherent climatic variability, such as perturbations introduced by human activities on CO₂’”: I think further attention should be paid to explaining the idea behind sts_t. I am not an expert in climate analysis, but it is unclear how sts_t, being—for example—interpreted as perturbations induced by human activity on CO₂, could be “driven” by latent causes like solar radiation as said in the text.
  • The authors mention causal representation learning, and in Table 1 the model CADRe claims to recover a “latent causal graph.” However, there is no discussion of the identification of the causal graphs between the latent variables (lagged and instantaneous edges). Unless mistaken, there is no clear discussion of how such a graph is identified or learned—only pointwise identifiability of latents in Thm 1. This should be clarified, plus I think it should be clarified that identifying pa(xi,t)\mathrm{pa}_\ell(x_{i,t}), although concerning latent variables, is not a classical task of CRL.
  • “We denote Jd(z^t+1)J_d(\hat z_{t+1}) as the Jacobian matrix of the function rr,” the notation is confusing: the Jacobian of rr should better be denoted JrJ_r, but rr in line 235 has two arguments (z^t+1,z^t)(\hat z_{t+1}, \hat z_t). Can you please clarify the notations and enhance the presentation quality of the paragraph “Prior Estimation of ztz_t and sts_t”?

Notation and Typo issues

  • “the stochasticility” should be “stochasticity.”
  • Bad notation to use MM for a functional in “A4 (Differentiability:)” (line 137).
  • “As depicted in Eq. (A19),” in line 151, the equation is missing (I know it is in the separate appendix, but its omission disrupts the flow).
  • Line 139: “where hh is differentiable.” should read “where hzh_z is differentiable.”
  • Paragraph “Prior Estimation of ztz_t and sts_t.” is poorly written and difficult to read.
  • Line 242: “we using” should be “we are using.”
  • Line 257: “D(A)D(A)” overlaps with DD used earlier for KL divergence.
  • Equations 10–11: Please specify the summands in Σ\Sigma; it is very unclear.
  • Line 267: Metric names should be at least mentioned, and definitions could be deferred to the appendix.

问题

  • Theorem 3 (lines 202–203): Definition 5 is missing from the paper (present only in the appendix), making the theorem statement on the identification of observational causal graph unclear. Moreover, the definition in the appendix refers to sts_t, creating confusion because sts_t are the stochastic mediators. The authors did not claim such identifying sts_t in the main text. Looking at the proof, Definition 5 appears to be an intermediate step near the end of observational causal graph identification. Can you please clarify the result related to Def 5 and how it relates to identifying the observed causal graph?
  • In Table 2: Can you further elaborate on the meaning of “the best-converged result per seed is selected to avoid local minima”?
  • In Figure 5, “dx=6d_x = 6 and dz=3d_z = 3” are very low dimensions; How do the exact metrics reported in Figure 5 evolve as a function of dxd_x, including other baselines?
  • Figure 6: “CaDRe matches observational causal graphs to physical wind patterns.” What about other baselines that could provide the same instantaneous causal graph? How do they qualitatively match the physical wind patterns?

局限性

  • Authors did not discuss how the developed approach may or may not generalize to other domains beyond climate analysis.
  • The distributional assumptions like A2 and A3 are very strong.

最终评判理由

The authors addressed the main points I mentioned in the Weaknesses and Questions sections. The only point I am not convinced of is the response related to the choice of the best-converged result per seed. I think this is a concern general to many papers where more effort should be put into dealing with the impact of initialization on model performance, especially in this field. This does not, however, impede my ability to raise my score.

格式问题

N/A

作者回复

Dear Reviewer NFts, thank you for your insightful feedback. Your comments helped improve the rigor of our work, particularly in notation, terminology, comparisons, and quantitative analysis. We sincerely appreciate your time and effort.

We provide point-by-point responses to your comments below and have updated the manuscript accordingly.

W1: st,is_{t,i} should be considered as an endogenous mediator

Thank you for this valuable suggestion. Describing st,is_{t,i} as an endogenous mediator of ztz_t is more accurate than the term nonstationary noise, which we used following prior work [1]. We have revised the manuscript accordingly.

[1] Huang et al., "Causal discovery and forecasting in nonstationary environments with state-space models."

W2: The scientific idea behind modeling sts_t

sts_t is not limited to human-induced effects but broadly captures uncertainties such as environmental variability and measurement noise. These can be modulated by latent dynamics, e.g., variance of sts_t is time-varying [2]. We have clarified this general interpretation in the revised manuscript.

[2] Lashgari, et al. "Evaluation of simulated responses to climate forcings: a flexible statistical framework using confirmatory factor analysis and structural equation modelling–Part 1: Theory."

W3(1): No discussion of the identification of the latent causal graphs

Thanks a lot for raising this point! The latent graph is identifiable up to permutation under the assumptions of sparse latent dynamics and sufficient variability, as discussed in Main Paper, Line 159 and Appendix, Section A.3. In light of your suggestion, we have highlighted it in the main text.

W3(2): Identifying paO(xt,i)pa_{O}(x_{t,i}) is not a classical task of CRL

You are totally right! CRL is a specific task within the broader domain of CD, concerned with learning causally related latent representations. To clarify, identifying paO(xt,i)pa_{O}(x_{t,i}) is the CD with latent variables. We have distinguished these two tasks throughout the revised manuscript.

W4(1): Notation on the Jacobian matrix, JrJ_r should has two arguments

We appreciate your efforts in helping us improve the readability. We clarify that the notation Jr(z^t)J_r(\hat{z}_t) denotes the Jacobian w.r.t. z^t\hat{z}_t, rather than a function taking z^t\hat{z}_t as an input.

W4(2): Enhance the presentation quality of “Prior Estimation of ztz_t and sts_t"

We now provide a clearer version in brief:

""

To estimate priors that preserve causal structure, we minimize the KL divergence between the approximate posterior and a learned prior. Since the true prior is unknown, we use a normalizing flow conditioned on selected inputs. Input selection is guided by learned inverse transition functions rir_ i, where ϵ^t,iz=ri(z^t1,z^t)\hat{\epsilon}^z_ {t,i} = r_i(\hat{z}_ {t-1}, \hat{z}_ t) identifies the latent variables that causally influence zt,iz_ {t,i}. Formally, the Jacobian of the transformation κ:{z^t1,z^t}{z^t1,ϵ^tz}\kappa: \{\hat{z}_ {t-1}, \hat{z}_ t\} \rightarrow \{\hat{z}_ {t-1}, \hat{\epsilon}^z_ t\} encodes these dependencies: Jκ(z^t1,z^t)=(I0Jr(z^t1)Jr(z^t)),J_ {\kappa}(\hat{z}_ {t-1}, \hat{z}_ t) = \begin{pmatrix} \mathbf{I} & 0 \\\\ J_ r(\hat{z}_ {t-1}) & J_ r(\hat{z}_ t) \end{pmatrix}, where Jr(z^t)J_ r(\hat{z}_ t) captures instantaneous structure and Jd(z^t1)J_ d(\hat{z}_ {t-1}) captures time-lagged effects.

A similar process is applied to observed variables via wi(z^t,s^t)w_i(\hat{z}_t, \hat{s}_t). Finally, we enforce conditional independence across components by minimizing KL divergences from the estimated ϵ^tz\hat{\epsilon}^z_t and ϵ^tx\hat{\epsilon}^x_t to standard Gaussian. This enables principled prior learning and reveals causal relations among ztz_t.

""

The updates for W4 have been incorporated into our revised manuscript.

Notation and Typo issues

Thank you for your detailed and attentive review. We have corrected all noted issues, including typos (e.g., “stochasticility”), unclear notation (e.g., Mˉ\bar{M}, D(A)D(A)), missing references (e.g., Eq. A19), and phrasing errors (e.g., Lines 139, 242). We have also revised the paragraph on prior estimation and clarified Eq. (10)–(11) and metric definitions for better readability.

Q1: The motivation of Def 5 and how it relates to identifying the observed causal graph

Thank you for this question. To motivate Def. 5, consider the linear case xt=Bxt+stx_t = B x_t + s_t, where st=(IB)1xt=Mxts_t = (I - B)^{-1} x_t = M x_t. If sts_t is identified component-wise without permutation (as required in Def. 5), then the support of MM is also identified, enabling recovery of B=IM1B = I - M^{-1}. Any permutation in sts_t would corrupt MM and render BB unidentifiable.

Our nonlinear setting follows the same principle: sts_t is identified component-wise via nonlinear ICA without permutation (conditioned on ztz_t), and the Jacobian replaces MM and BB to recover the causal graph over xtx_t (see Main Paper, Lines 204–210).

Q2: Elaborate on the meaning of "best-converged result"

Thank you for the question. Due to non-convexity in structure learning [3], for the same dataset, we run multiple trials with different initializations and report the one with the lowest total loss, referred to as the best-converged result.

[3] Ng, Ignavier, Biwei Huang, and Kun Zhang. "Structure learning with continuous optimization: A sober look and beyond."

Q3: Baseline comparison with dimension evolving as a function of dxd_x

Thank you for your thoughtful question. To address this, we report detailed results illustrating how the metrics in Figure 5 evolve with increasing dimensionality dx={3,6,8,10}d_x = \{3,6,8,10\}, while fixing the number of latent variables dz=3d_z = 3 and the number of samples n=10000n = 10000.

The table below presents the SHD, Precision, Recall, and F1 score for CaDRe and four baselines across varying values of dxd_x:

Methoddxd_xSHD↓Precision↑Recall↑F1↑
CaDRe30.0001.0001.0001.000
60.1850.8030.8300.815
80.2950.7610.7890.778
100.4320.6380.6560.643
FCI30.1860.8010.7600.780
60.3840.4760.3940.431
80.4470.3980.3210.356
100.4920.3550.2840.315
CDNOD30.1630.8210.7820.801
60.4520.4320.4190.425
80.5090.3650.3120.336
100.5460.3280.2760.300
PCMCI30.1390.8430.8030.822
60.4310.4880.3860.430
80.5010.3970.3080.347
100.5480.3650.2840.319
LPCMCI30.1160.8640.8230.843
60.3370.6370.6210.629
80.4410.5350.4860.509
100.4870.4820.4320.456

We have included a line plot in the revised version.

Q4 (1): Can other baselines that could provide the same instantaneous causal graph

Thank you for the insightful question. While baselines such as FCI and PCMCI can estimate instantaneous graphs, their outputs tend to be noisier and less aligned with known physical structures. For completeness, we have added their visualized comparisons in the appendix.

Q4 (2): Qualitatively match the physical wind patterns

To assess alignment with physical wind patterns, we propose two metrics, Wind-SHD (WSHD) and Wind-TPR (WTPR): WSHD measures the normalized SHD between the estimated graph BB and the wind-induced reference graph BrefB_{\text{ref}}, while WTPR computes the recall of edges in BB w.r.t. BrefB_{\text{ref}}.

We report results below. These results demonstrate that CaDRe achieves the best alignment with the wind-induced causal graph.

MetricCaDReFCICDNODPCMCILPCMCI
WSHD↓0.0120.0280.0310.0240.019
WTPR↑0.5320.2360.2510.1980.274

L1: How the developed approach may or may not generalize to other domains

Thank you for giving us the opportunity to clarify the generalization capability of CaDRe. While our focus is climate data, CaDRe generalizes well to other time-series domains with latent variables. We evaluate it on 3 standard benchmarks: finance (Exchange), public health (ILI), and traffic monitoring (Traffic), using varied input/output lengths.

As shown below, CaDRe consistently outperforms or matches strong baselines across domains, demonstrating robust generalization and applicability beyond climate. We have added this part in our revised manuscript.

DatasetI/O LenCaDRe(MSE/MAE)N-Transformer(MSE/MAE)Autoformer(MSE/MAE)MICN(MSE/MAE)TimesNet(MSE/MAE)
ILI18-61.200/0.6911.491/0.7572.637/1.0944.847/1.5702.406/0.840
72-241.856/0.8332.551/1.0392.653/1.1164.776/1.5562.270/0.988
144-481.796/0.8782.227/1.0182.696/1.1394.917/1.5842.978/1.123
216-722.010/0.9842.595/1.0812.960/1.1674.804/1.5842.696/1.098
ECL18-60.114/0.2160.134/0.2420.136/0.2540.250/0.3380.128/0.236
72-240.121/0.2200.140/0.2460.144/0.2570.258/0.3420.134/0.242
144-480.124/0.2250.155/0.2600.163/0.2750.271/0.3530.149/0.256
216-720.131/0.2320.169/0.2740.175/0.2870.279/0.3570.166/0.271
Traffic18-60.487/0.3070.797/0.3470.554/0.3220.475/0.2870.781/0.337
72-240.452/0.3030.625/0.3190.508/0.3180.454/0.2760.608/0.307
144-480.412/0.2820.574/0.3140.497/0.3190.450/0.2750.553/0.296
216-720.400/0.2780.593/0.3250.524/0.3300.473/0.2870.564/0.303

L2: The distributional assumptions like A2 and A3 are very strong.

We agree that A2 and A3 are strong in an absolute sense, though they are standard in nonparametric identifiability. However, they are weaker than assumptions in prior CRL works compared to prior CRL works. In previous nonlinear-ICA-based CRL, which often require invertibility. As shown in Appendix A.8, invertible functions form a zero-measure subset of injective operators (A2), making A2 less restrictive. Latent drift (A3) is also weaker than invertibility and can hold under heteroskedastic noise. Both reflect distributional variability, which is plausible in dynamic climate data. We have clarified this in the revised manuscript.

评论

Dear authors, thank you for your detailed response. You have addressed most of my concerns. The only point I am not convinced of is the response related to the choice of the best-converged result per seed. I think this is a concern general to many papers where more effort should be put into dealing with the impact of initialization on model performance, especially in this field. This does not, however, impede my ability to raise my score.

评论

Dear Reviewer NFts,

Thank you once again for your time dedicated to reviewing this paper and further engagement and insight. We appreciate you pointing out an open problem involved in training many deep learning models, which deserves more attention in future research, and we are grateful for your understanding.

With best wishes,

The Authors of Submission 9138

审稿意见
4

This paper introduces CaDRe, which aims to uncover the latent driving forces and causal relations among the observed variables in climate analysis. They provided a detailed theoretical analysis and demonstrated the effectiveness of the proposed method using synthetic and real-world experiments.

优缺点分析

Strengths

  1. This paper provides novel methods and insights for the application of causal discovery in the field of climate analysis, and achieves solid results.

  2. This paper provides a detailed theoretical analysis of the proposed causal discovery framework.

Weaknesses

  1. This paper uses an MLP to map observed variables into a latent space for causal discovery. However, the paper does not explain whether the latent space has a clear physical meaning and whether it has a consistent meaning across different datasets.

  2. Although the authors mention discussions about higher-order Markov structures in the article, there does not seem to be much experimental verification.

问题

  1. The author maps observed variables to latent variables, and the dimension between variables changes from dxd_x to dzd_z. It is expected that latent variables can represent variables with clear meanings, such as pressure and precipitation, but is this representation consistent? Is the physical meaning of latent variables fixed and clear across different datasets?

  2. We can understand that due to the rapid changes in the climate system, the time lagged effects between variables may not need to be set to a particularly high order, but using only 1 lag is not sufficient. We can provide some experimental results under high-order conditions.

  3. The experimental results in Table 2 show that when dxd_x becomes larger, the effect of the fixed 3-dimensional latent variables dzd_z becomes worse. However, we noticed that when dxd_x is 100, dzd_z is also 3, but its index results are significantly better. How should we understand this?

局限性

yes.

最终评判理由

I have checked the author's rebuttal and other reviewers' comments, and I still tend to accept this paper.

格式问题

no

作者回复

Dear reviewer G8Kr, we are very grateful for your valuable comments, helpful suggestions, and encouragement. We provide the point-to-point response to your comments below and have updated the paper and appendix accordingly.

W1 & Q1: Physical meaning of the latent space

Thank you for this important question. Latent factors are, by definition, unobserved, in principle, we cannot give a direct interpretation. This poses a central challenge in the field of CRL. However, once we are sure about the existence of latent factors and understand how it's related to measured variables, we can begin to interpret them and even come up with ways to measure them. For example, as long as we obtain domain knowledge of latent factors from climate experts/scientists, we can easily match them to the meaningful quantities, e.g., precipitation, solar radiation, or components of a climate foundational model representation. This mirrors historical processes in science, such as the discovery of viruses, which were first hypothesized based on indirect evidence and later confirmed and measured directly.

We have included this perspective in our main paper and regard it as our future work.

W1(2): Whether it has a consistent meaning across different datasets

Thank you for raising this issue. For different climate subsystems, latent variables may not share consistent physical meanings across datasets, as observed variables and measurement settings differ. Each dataset yields a distinct observational causal graph governed by its latent confounders; thus, the underlying latent factors are necessarily different.

However, for datasets describing the same climate subsystem, despite differences in measurements due to sampling, the recovered latent factors might consistently represent the same underlying phenomena and complement each other.

We have included this discussion in the revised manuscript.

W2 & Q2: Experimental verifications on the higher-order Markov structure

Thank you for the valuable suggestions! To verify our framework’s capability on higher-order latent dynamics, we conduct experiments using a second-order latent Markov process. Results are reported below:

dzd_zdxd_xSHD (Jg^(x^t))(J_{\hat{g}}(\hat{x}_t))TPRPrecisionMCC (st)(s_t)MCC (zt)(z_t)SHD (Jr(z^t))(J_r(\hat{z}_t))SHD (Jd(z^t1))(J_d(\hat{z}_{t-1}))R2R^2
330.31±0.010.31 \pm 0.010.91±0.020.91 \pm 0.020.93±0.030.93 \pm 0.030.9780±0.010.9780 \pm 0.010.9825±0.010.9825 \pm 0.010.26±0.060.26 \pm 0.060.30±0.040.30 \pm 0.040.93±0.040.93 \pm 0.04
360.19±0.070.19 \pm 0.070.81±0.040.81 \pm 0.040.79±0.080.79 \pm 0.080.9560±0.030.9560 \pm 0.030.9520±0.010.9520 \pm 0.010.25±0.050.25 \pm 0.050.34±0.080.34 \pm 0.080.91±0.020.91 \pm 0.02
380.27±0.060.27 \pm 0.060.80±0.040.80 \pm 0.040.70±0.050.70 \pm 0.050.9040±0.090.9040 \pm 0.090.9610±0.100.9610 \pm 0.100.34±0.090.34 \pm 0.090.32±0.100.32 \pm 0.100.93±0.030.93 \pm 0.03
3100.45±0.060.45 \pm 0.060.64±0.100.64 \pm 0.100.62±0.130.62 \pm 0.130.8470±0.060.8470 \pm 0.060.9630±0.030.9630 \pm 0.030.30±0.060.30 \pm 0.060.39±0.060.39 \pm 0.060.91±0.100.91 \pm 0.10
3100100^*0.18±0.030.18 \pm 0.030.81±0.040.81 \pm 0.040.80±0.030.80 \pm 0.030.9100±0.030.9100 \pm 0.030.9570±0.020.9570 \pm 0.020.22±0.020.22 \pm 0.020.28±0.090.28 \pm 0.090.92±0.130.92 \pm 0.13

The table shows that CaDRe maintains high CRL quality and accurate causal discovery, confirming its effectiveness under higher-order latent dynamics.

All other settings are aligned with those reported in the main paper. This setting allows us to evaluate how model performance changes under higher-order latent dynamics. For simulating datasets, the latent process is simulated using a leaky non-linear autoregressive model with L=2L=2:

zt=(IB1)(σ(=1LW()zt)+ϵtz),ϵtzN(0,σz2I), z_t = (I - B^{-1}) \left( \sigma\left( \sum_{\ell=1}^{L} \mathbf{W}^{(\ell)} z_{t - \ell} \right) + \boldsymbol{\epsilon}^z_t \right), \quad \boldsymbol{\epsilon}^z_t \sim \mathcal{N}(0, \sigma_z^2 \mathbf{I}),

where σ()\sigma(\cdot) is leaky ReLU, and W()\mathbf{W}^{(\ell)} are lag-\ell transition matrices, BB is the instantanous latent causal adjacency matrix.

Q3: How to understand dx=100d_x=100 performs significantly better dx=10d_x=10

Thank you for pointing this out. The key reason is that we incorporated a physical prior in the dx=100d_x = 100 experiment (marked with “^*”), but did not include this prior in the lower-dimensional cases (dx=3,6,8,10d_x = 3, 6, 8, 10), as described in Main Paper, Lines 275–278 and Appendix, Lines 1283–1287.

This prior reflects a physical assumption in climate systems: instantaneous interactions between distant spatial regions are unlikely. Accordingly, we removed 75% of the edges from the fully connected graph during initialization to enforce sparsity in dx=100d_x=100. To further clarify the role of this prior and our setup, we include an ablation study below, which demonstrates that incorporating the physical prior significantly improves performance, particularly in causal discovery.

dzd_zdxd_xSHD (Jg^(x^t))(J_{\hat{g}}(\hat{x}_t))TPRPrecisionMCC (st)(s_t)MCC (zt)(z_t)SHD (Jr(z^t))(J_r(\hat{z}_t))SHD (Jd(z^t1))(J_d(\hat{z}_{t-1}))R2R^2
310 (w/o prior)0.45±0.060.45 \pm 0.060.64±0.100.64 \pm 0.100.62±0.130.62 \pm 0.130.8470±0.060.8470 \pm 0.060.9630±0.030.9630 \pm 0.030.30±0.060.30 \pm 0.060.39±0.060.39 \pm 0.060.91±0.100.91 \pm 0.10
310 (w/ prior)0.28±0.04\mathbf{0.28 \pm 0.04}0.76±0.05\mathbf{0.76 \pm 0.05}0.74±0.07\mathbf{0.74 \pm 0.07}0.9010±0.03\mathbf{0.9010 \pm 0.03}0.9670±0.02\mathbf{0.9670 \pm 0.02}0.26±0.04\mathbf{0.26 \pm 0.04}0.31±0.05\mathbf{0.31 \pm 0.05}0.94±0.04\mathbf{0.94 \pm 0.04}
3100 (w/o prior)0.38±0.08\mathbf{0.38 \pm 0.08}0.61±0.06\mathbf{0.61 \pm 0.06}0.57±0.09\mathbf{0.57 \pm 0.09}0.8330±0.07\mathbf{0.8330 \pm 0.07}0.9380±0.05\mathbf{0.9380 \pm 0.05}0.36±0.10\mathbf{0.36 \pm 0.10}0.42±0.07\mathbf{0.42 \pm 0.07}0.86±0.12\mathbf{0.86 \pm 0.12}
3100100^* (w/ prior)0.18±0.030.18 \pm 0.030.81±0.040.81 \pm 0.040.80±0.030.80 \pm 0.030.9100±0.030.9100 \pm 0.030.9570±0.020.9570 \pm 0.020.22±0.020.22 \pm 0.020.28±0.090.28 \pm 0.090.92±0.130.92 \pm 0.13

We have now explicitly clarified this difference in the revised manuscript to avoid confusion.

评论

Dear Reviewer G8Kr,

Thank you so much for your detailed and constructive review. We are truly grateful for your recognition of the significance and novelty of our work in both causal discovery and climate science.

In our rebuttal, we addressed your comments point by point, covering:

  • W1 & Q1 Clarification of the physical interpretability of latent variables.
  • W1 (2) Discussion on the consistency of the physical meaning of latent variables across different climate datasets.
  • W2 & Q2 Experimental verifications of the higher-order Markov structure
  • Q3 Explaination for why dx=100d_x = 100 with physical prior performs better than dx=6d_x=6

We hope these address your concerns. As the discussion phase ends in 2 days, we would greatly appreciate it if you could let us know if you have any remaining questions or suggestions.

Thank you again for your time and thoughtful feedback!

Best,

The Authors of Submission 9138

评论

I have checked the authors' rebuttal, which has addressed my concerns. Then I still tend to accept this paper.

评论

Dear Reviewer G8Kr,

Thank you very much for taking the time to review our work. We are sincerely grateful for your constructive comments.

We feel delighted that your concerns have been fully addressed. We would be grateful if you might consider updating your rating accordingly. Thank you again for your thoughtful and invaluable feedback!

With best wishes,

The Authors of Submission 9138

审稿意见
5

Causal discovery and representation learning are important approaches to better understand, for instance, Earth's climate or generally (spatio-)temporal data. The work builds a theory of when causal links are identifiable from such purely observational data while considering causal dependencies within and between both the observed and latent factors. The novel framework CaDRe is then composed of standard machine learning components and can uncover such causal graphs. Empirical findings on synthetic and real-world Earth data confirm the model's effectiveness.

优缺点分析

Strengths:

  • The paper is of high quality and written clearly.
  • The problem being tackled is important. It is a significant contribution.
  • The provided identification theory on latent variables is a substantial contribution, although in large parts deferred to the appendix. I did not check it in detail.
  • The construction of the CD method is convincing and not more complex than necessary.
  • The empirical findings support the methods' effectiveness.
  • I am not sufficiently familiar with the field of CD/CRL to judge the overall novelty of the method nor the completeness of CD/CRL baselines.

Opportunities for improvement (weaknesses):

  • The time-series forecasting baselines in Sec. 5.2 are not the latest (see, e.g., TimeMixer, xLSTM-Mixer, Chimera, or pretrained models such as Timer-XL). However, this does not substantially weaken the findings.
  • While the key contributions are theoretical and on the modeling side (i.e., not the application), the core motivation for CaDRe was still climate science. However, only a single experiment on it is provided (CD in Sec. 5.2), which could be evaluated in more depth.
  • The results on CD on actual climate data (Sec. 5.2) are solely qualitative. They lack contextualization of how existing models would fare. Additionally, a quantitative measure of correctness would be helpful as an aggregate indicator. This would be helpful since it is very hard to even qualitatively judge the appropriateness of the learned graph. Adding colors to denote correct/incorrect edges could be one more opportunity to improve this directly. The caption of Fig. 6/description in the text is overall rather minimal. See also Q1 below.
  • No next steps (future work) are discussed.

问题

  • Q1: What is the meaning of the arrow length in Fig. 6: The certainty of the edge existing in that orientation, or the strength of the influence?
  • Q2: What are the latent factors that have been identified by CaDRe? The goal of CD/CRL in climate sciences is to uncover hidden structure, so having a latent representation we can understand is key.

局限性

The discussion of limitations is very minimal. Aspects that are not sufficiently explored are robustness to noise and, generally, the set of assumptions necessary to run the method.

最终评判理由

The rebuttal was convincing. Some points were not fully resolved (e.g., interpretation of latent factors), but it is indeed material for future work.

I maintain my score and moderate confidence.

格式问题

Minor Comments:

  • LL. 221 and 227: "Figure 4" links to Fig. 3.
  • Table 3 is never referenced. It should be done in ll. 288ff.
  • The table/figure ordering is slightly irritating and can be improved to align with the prose. For instance, Fig. 6 and Tab. 4 are discussed in the text in the opposite order to how they are presented. Which part of the figure shows latent transitions (cf. ll. 398f)?
作者回复

Dear Reviewer 2AQA, we sincerely appreciate your informative feedback that helps clarify our contributions and the completeness of our experiments. Here is the point-to-point response below.

W1 & W2: Lack of latest time-series forecasting baselines & Only a single experiment on climate science

Thank you for raising these points, which help improve the soundness of our experiments. Please kindly note that we have considered Weather dataset is reported in Appendix, Table A10. In light of your suggestions, we further conduct experiments on CESM2, Weather, and an additional dataset ERSST, and reproduce the Timer-XL, TimeMixer, TimeXer, and xLSTM-Mixer for comparisons. Due to time limitations, we still work on Chimera due to the absence of released code, and will produce the experiment results as soon as possible. As shown in the table, CaDRe achieves competitive MSE and MAE across all datasets and forecast lengths.

DatasetLengthCaDRe MSECaDRe MAETDRL MSETDRL MAECARD MSECARD MAEFITS MSEFITS MAEMICN MSEMICN MAEiTransformer MSEiTransformer MAETimesNet MSETimesNet MAEAutoformer MSEAutoformer MAETimer-XL MSETimer-XL MAETimeMixer MSETimeMixer MAETimeXer MSETimeXer MAExLSTM-Mixer MSExLSTM-Mixer MAE
CESM2960.4100.4830.4390.5070.4090.4840.4390.5080.4170.4860.4220.4910.4150.4860.9590.7350.4330.4250.3360.4180.3470.4280.3670.452
CESM21920.4120.4870.4400.5080.4220.4930.4470.5151.5590.9840.4250.4950.4170.4971.5740.9720.4540.5240.4450.4240.3580.4350.4340.498
CESM23360.4130.4850.4410.5050.4210.4970.4820.5362.0911.1730.4260.4940.4230.4991.8451.0780.5270.5650.5410.4210.3480.4290.4480.471
Weather960.1570.2030.4420.5110.4230.4970.1720.2210.1990.2560.1680.2140.1800.2310.2250.2590.3670.2520.3670.2520.3670.2520.3670.252
Weather1920.2070.2480.4920.5450.4820.5440.2160.2600.2380.2980.1930.2410.2120.2650.3540.3480.4340.2980.4340.2980.4340.2980.4340.298
Weather3360.2700.3140.5360.6120.5250.5960.3860.4390.3160.4960.4260.4940.4230.4990.3540.3480.5270.5650.3410.4210.3480.4290.3750.341
ERSST960.1450.2680.1870.2680.1970.2730.5390.2970.7260.7650.2470.2640.4320.5080.9530.2720.1630.2590.1720.2720.3650.3440.3450.255
ERSST1920.2080.3070.2140.2930.2330.3750.2260.7521.2630.8920.2510.5350.4520.5851.0240.9080.2100.2940.2140.3020.3720.3670.3710.297
ERSST3360.3050.3610.4620.3880.4870.4840.4390.5351.1731.1720.3050.6590.5810.6071.3871.3530.3520.3370.4390.3940.4290.4480.4760.357

The dataset statistics are as follows:

  • Weather dataset contains 52696 time steps starting from 2020-01-01 00:10:00 at regular 10-minute intervals. It includes 22 meteorological variables such as atmospheric pressure, temperature, humidity, vapor pressure, wind speed and direction, radiation, and photosynthetically active radiation (PAR). The data were collected from an automated rooftop station at the Max Planck Institute for Biogeochemistry in Jena, Germany.

  • ERSST dataset is from NOAA GlobalTemp (NOAA/NCEI) official website, we use the NOAA Global Temperature Anomaly Dataset (1880–2025), which includes 2052 monthly steps and 16,020 spatial grid points per step. For time-series forecasting, we use a downscaled version with 100 dimensions, obtained by averaging over block regions.

[1] Wang, Yuxuan, et al. “Timexer: Empowering transformers for time series forecasting with exogenous variables.” NeurIPS 37 (2024): 469–498.

W3: CD results in Sec. 5.2 are only qualitative, lacks baselines and quantitative metrics; Fig. 6 is hard to interpret.

Thank you for the suggestion. We have added quantitative evaluations using two new metrics based on wind direction priors: Wind-SHD (WSHD) and Wind-TPR (WTPR). WSHD measures the normalized SHD between the estimated graph BB and a wind-induced reference graph BrefB_{\text{ref}}, while WTPR computes the recall of BB with respect to BrefB_{\text{ref}}.

These metrics provide meaningful physical correctness scores, allowing baseline comparison. As shown below, CaDRe outperforms all baselines, achieving the best alignment with the wind-induced causal graph:

MetricCaDReFCICDNODPCMCILPCMCI
WSHD ↓0.0120.0280.0310.0240.019
WTPR ↑0.5320.2360.2510.1980.274

In light of your suggestion, we also revised Fig. 6 by color-coding edges (green: correct, red: incorrect) based on BrefB_{\text{ref}}, and expanded both its caption and description to improve interpretability.

W4: No future work

Thank you for pointing this out. Our future work will focus on two directions:

  • More general causal structure: We aim to extend our framework to support time-lagged causal relations in the observed space and sparse transition/generation processes, to better reveal how latent variables govern observations. This includes developing general identifiability guarantees and scalable estimation methods.

  • Scalability with pretrained climate models: We will integrate our framework with pretrained foundation models such as ClimaX [2] and GenCast [3] by introducing our flow-based module and structural constraints as plug-in fine-tuning components. This allows for refining latent representations post-training and uncovering the underlying causal structure in climate data.

We have discussed these topics in detail and highlighted them in our revised manuscript.

[2] Nguyen, Tung, et al. "Climax: A foundation model for weather and climate." arXiv preprint arXiv:2301.10343 (2023).

[3] Price, Ilan, et al. "Gencast: Diffusion-based ensemble forecasting for medium-range weather." arXiv preprint arXiv:2312.15796 (2023).

Q1: The meaning of the arrow length in Fig. 6

Thank you for raising this important question. In the visualized wind system (Top of Fig. 6), longer arrow length represents stronger wind speed, indicating the strength of the wind flow.

In the estimated causal graph (Bottom of Fig. 6), if aa causes bb, then we draw an arrow from the location of aa to the location of bb on the map. The length reflects the spatial distance, not the causal strength, with a longer length indicating a longer distance between two regions.

We have clarified this distinction and illustrated the visualization procedure in the revised manuscript.

Q2: Physical meaning of latent factors

Thank you for this important question. Latent factors are, by definition, unobserved, in principle, we cannot give a direct interpretation. This poses a central challenge in the field of CRL. However, once we are sure about the existence of latent factors and understand how it's related to measured variables, we can begin to interpret them and even come up with ways to measure them. For example, as long as we obtain domain knowledge of latent factors from climate experts/scientists, we can easily match them to the meaningful quantities, e.g., precipitation, solar radiation, or components of a climate foundational model representation. This mirrors historical processes in science, such as the discovery of viruses, which were first hypothesized based on indirect evidence and later confirmed and measured directly.

We have included this perspective in our main paper and regard it as our future work.

Comment (1): Figure/table reference and the order of figures

Thank you for your careful reading and helpful suggestions. We have corrected the figure and table references, and reordered the figures to align with the flow of the text in the revised manuscript.

Comment (2): Which part of the figure shows latent transitions (cf. ll. 398f)?

The middle part of the figure illustrates the latent transitions, which are indicated by the black lines connecting latent states z^t1\hat{z}_ {t-1} and z^t\hat{z}_ {t}.

In light of your comment, we have highlighted this part of the figure for improved clarity in our revised manuscript.

评论

Thank you very much for the detailed response. It convinced me that I should maintain my score of "5: Accept".

Deciphering latent factors is a substantial research objective on its own, agreed.

评论

Dear Reviewer 2AQA,

Thank you so much for your time dedicated to reviewing this paper, as well as for your discussion and insightful comments. Indeed, we fully agree with you that deciphering latent variables is a challenging task in climate science, typically requiring domain expertise. We are truly happy that our responses addressed your concerns and are grateful for your insightful review.

With best wishes,

The Authors of Submission 9138

审稿意见
3

This paper, introduces a novel framework called CaDRe (Causal Discovery and Representation Learning). Its primary goal is to uncover complex causal structures within climate systems from purely observational time-series data. A core innovation of CaDRe lies in its unification of Causal Representation Learning (CRL) and Causal Discovery (CD), establishing rigorous identifiability conditions for both latent processes and observed causal graphs, even without strong parametric assumptions. The framework leverages a novel theoretical connection between Structural Equation Models (SEM) and Nonlinear Independent Component Analysis (ICA) to achieve this. Methodologically, CaDRe is instantiated as a state-space Variational Autoencoder (VAE) that incorporates flow-based priors and gradient-based structural penalties to ensure the identifiability and sparsity of the learned causal graphs. Empirical validation, through synthetic data experiments, confirms its theoretical claims, demonstrating CaDRe's ability to recover latent representations and causal structures. On real-world climate datasets (CESM2 sea surface temperature), CaDRe achieves competitive forecasting performance and visualizes causal graphs consistent with domain knowledge, such as wind circulation patterns and land-sea interactions, even revealing structural patterns that may inspire new hypotheses in climate science.

优缺点分析

Strengths

  • the paper explicitly identifies the need for a "unified framework". By simultaneously addressing both latent variable discovery (CRL) and observed variable causal discovery (CD), CaDRe tackles a more complete and realistic problem than methods that focus on only one aspect or assume simpler data-generating processes.
  • The paper provides rigorous identifiability proofs in a nonparametric setting , which is crucial for the complex and often unknown functional forms in climate systems.The assumptions are clearly stated and discussed, indicating a deep theoretical understanding.
  • Beyond competitive forecasting performance , CaDRe's ability to generate "visualized causal graphs consistent with domain knowledge" is a major strength for scientific applications. Weaknesses:
  • The method exhibits performance degradation as the data dimensionality increases. This implies that as the number of variables in climate models grows, or when analyzing at finer scales where more interconnected factors need to be considered, CaDRe's efficiency and accuracy might be impacted. This could limit its direct applicability in handling extremely complex and high-dimensional climate datasets.

问题

  • How robust is CaDRe when some of the conditions in Assumptions A1-A5 are partially violated in real-world data? Are there ways to quantify the degree of these violations and assess their impact on the results?
  • How can the "physical interpretability" of the latent variables ztz_t and noise terms sts_t be quantified or evaluated beyond consistency with domain knowledge? Are there more rigorous metrics?

局限性

  • Performance Degradation with Dimensionality: The method shows reduced performance as d increases, though partitioning variables via geographical priors is suggested as a solution.
  • Reliance on Observational Data: The framework assumes access to time-series data, but climate observations are often sparse or noisy, which may impact identifiability in practice.

格式问题

none

作者回复

Dear Reviewer p8vt, thank you for your constructive comments. Your insights have helped us significantly improve the clarity, empirical validation, and theoretical framing of our work. Below, please see our point-to-point responses.

W1: Performance/efficiency degrades with dimensionality in synthetic data.

That is a great point! We sincerely appreciate the insightful comment about the importance of evaluating the scalability of our approach. Although the difficulty of causal process identification increases with dimensionality, some identifications can still be achieved by using a climate prior. Please kindly find that we have included experiments with dx=100d_x = 100^* in Main Paper, Table 2 as follows:

dzd_zdxd_xSHD (Jg^(x^t))(J_{\hat{g}}(\hat{x}_t))TPRPrecisionMCC (st)(s_t)MCC (zt)(z_t)SHD (Jr(z^t))(J_r(\hat{z}_t))SHD (Jd(z^t1))(J_d(\hat{z}_{t-1}))R2R^2
3100*0.17±0.020.17 \pm 0.020.80±0.050.80 \pm 0.050.81±0.020.81 \pm 0.020.9131±0.020.9131 \pm 0.020.9565±0.020.9565 \pm 0.020.21±0.010.21 \pm 0.010.29±0.100.29 \pm 0.100.93±0.030.93 \pm 0.03

To overcome this challenge, in this setting (*), we mask 75% of edges in the initial fully connected graph based on spatial distance, under the assumption that distant regions do not directly interact, which is aligned with domain knowledge in climate systems. This prior reduces spurious dependencies and mitigates local minima during optimization, as described in Main Paper, Line 276 - 278 and Appendix, 1283-1287. We have emphasized this in our revised manuscript.

To further validate robustness to dimensionality, we conduct additional experiments with dx{20,50,80,100,200}d_x \in \{20, 50, 80, 100, 200\}, applying the same physical prior in each case. As shown in the table below, our method maintains strong performance across all metrics, with only mild degradation as dimensionality increases. Inference time remains efficient due to nonlinear ICA-based structure learning is equivalent to a one-step generation.

dzd_zdxd_xSHD (Jg^(x^t))(J_{\hat{g}}(\hat{x}_t))TPRPrecisionMCC (st)(s_t)MCC (zt)(z_t)SHD (Jr(z^t))(J_r(\hat{z}_t))SHD (Jd(z^t1))(J_d(\hat{z}_{t-1}))R2R^2Inference Time (ms)
3200.09±0.010.09 \pm 0.010.92±0.020.92 \pm 0.020.89±0.010.89 \pm 0.010.9573±0.120.9573 \pm 0.120.9742±0.080.9742 \pm 0.080.10±0.010.10 \pm 0.010.18±0.040.18 \pm 0.040.96±0.010.96 \pm 0.010.89±0.070.89 \pm 0.07
3500.13±0.020.13 \pm 0.020.87±0.170.87 \pm 0.170.85±0.190.85 \pm 0.190.9318±0.010.9318 \pm 0.010.9619±0.010.9619 \pm 0.010.16±0.020.16 \pm 0.020.22±0.060.22 \pm 0.060.93±0.020.93 \pm 0.020.99±0.140.99 \pm 0.14
3800.15±0.020.15 \pm 0.020.84±0.080.84 \pm 0.080.83±0.100.83 \pm 0.100.9223±0.070.9223 \pm 0.070.9550±0.090.9550 \pm 0.090.18±0.020.18 \pm 0.020.25±0.130.25 \pm 0.130.94±0.030.94 \pm 0.031.07±0.251.07 \pm 0.25
31000.17±0.020.17 \pm 0.020.80±0.050.80 \pm 0.050.81±0.020.81 \pm 0.020.9131±0.020.9131 \pm 0.020.9565±0.020.9565 \pm 0.020.21±0.010.21 \pm 0.010.29±0.100.29 \pm 0.100.93±0.030.93 \pm 0.031.25±0.191.25 \pm 0.19
32000.16±0.070.16 \pm 0.070.74±0.060.74 \pm 0.060.72±0.040.72 \pm 0.040.8950±0.020.8950 \pm 0.020.9603±0.030.9603 \pm 0.030.22±0.020.22 \pm 0.020.35±0.120.35 \pm 0.120.92±0.040.92 \pm 0.041.45±0.161.45 \pm 0.16

W2: Sparse or noisy time-series climate data may hinder practical identifiability.

Thank you for raising this important point. We sincerely appreciate the insightful comment that learning identifiable representations from noisy and incomplete climate observations is a significant challenge in practice Indeed, we have made much effort to address these concerns. In particular:

  • For the sparse data, we do not require invertibility in the generative process, which offers potential resilience to partial observability or missing data.
  • For the noisy data, we allow the data-generating process to be nonparametric (Eq. (1)), which improves robustness to observational noise.

In light of your insight, we will include related discussions about the identifiability under sparse/noisy observations in the revised manuscript.

Q1: How robust is CaDRe when some of the assumptions A1–A5 are violated, and can their violations be quantified or assessed in practice?

Thanks for your insightful question! The assumptions A1–A5 primarily require distributional variability (e.g., nonstationarity and latent-driven dynamics), which are generally satisfied in real-world climate systems due to their inherent physical variability and forcing mechanisms, as discussed in Main Paper, Line 141-146.

To quantify the degree of assumption violations and evaluate their impact, we conducted controlled simulation experiments where we violated A2, A3, A5, while noting that A1 (continuous density) and A4 (differentiability) are trivially satisfied in practice via neural network parameterizations, e.g., VAEs.

The table below summarizes performance under these violations for dz=3d_z = 3 and dx=6d_x = 6, evaluating both representation quality and causal discovery:

Assumptiondzd_zdxd_xSHD (Jg^(x^t))(J_{\hat{g}}(\hat{x}_t))TPRPrecisionMCC (st)(s_t)MCC (zt)(z_t)SHD (Jr(z^t))(J_r(\hat{z}_t))SHD (Jd(z^t1))(J_d(\hat{z}_{t-1}))R2R^2
No Violation360.18±0.060.18 \pm 0.060.83±0.030.83 \pm 0.030.80±0.040.80 \pm 0.040.9583±0.020.9583 \pm 0.020.9505±0.020.9505 \pm 0.020.24±0.190.24 \pm 0.190.33±0.090.33 \pm 0.090.92±0.010.92 \pm 0.01
Violate A2 (Contextual Variability)360.26±0.070.26 \pm 0.070.71±0.050.71 \pm 0.050.68±0.050.68 \pm 0.050.7563±0.040.7563 \pm 0.040.8820±0.040.8820 \pm 0.040.36±0.070.36 \pm 0.070.41±0.080.41 \pm 0.080.67±0.030.67 \pm 0.03
Violate A3 (Latent Drift)360.31±0.050.31 \pm 0.050.67±0.130.67 \pm 0.130.64±0.060.64 \pm 0.060.8645±0.070.8645 \pm 0.070.8478±0.080.8478 \pm 0.080.39±0.120.39 \pm 0.120.46±0.140.46 \pm 0.140.78±0.210.78 \pm 0.21
Violate A5 (Generation Variability)360.35±0.110.35 \pm 0.110.65±0.120.65 \pm 0.120.60±0.100.60 \pm 0.100.7052±0.130.7052 \pm 0.130.9325±0.030.9325 \pm 0.030.41±0.080.41 \pm 0.080.47±0.100.47 \pm 0.100.85±0.020.85 \pm 0.02

The violations were implemented as follows:

  • A2 (Contextual Variability) was violated by using a non-injective latent transition zt=zt1+εtzz_t = z_{t-1} + \varepsilon^z_t with uniform noise εtz\varepsilon^z_t.
  • A3 (Latent Drift) was violated by modifying the generation to a nonlinear form xt=Wzt2+stx_t = W z_t^2 + s_t with a linear matrix WW.
  • A5 (Generation Variability) was violated by using a simplified linear additive form st=zt+εtxs_t = z_t + \varepsilon^x_t.

These results show that CaDRe maintains meaningful performance under moderate violations, with consistently high R2R^2 and MCC scores. The magnitude of performance degradation provides a practical means to quantify robustness: violations of A2 and A3 mainly impair latent representation identification, while A5 primarily affects causal discovery.

We have included these results and an extended discussion in the revised manuscript to clarify the empirical and theoretical robustness of our framework.

Q2: Physical interpretability of the latent variables and noise

We appreciate this important question, which has given us the opportunity to deepen our analysis. We want to clarify that latent factors are, by definition, unobserved; in principle, we cannot give a direct interpretation. This poses a central challenge in the field of CRL. However, once we are sure about the existence of latent factors and understand how it's related to measured variables, we can begin to interpret them and even come up with ways to measure them. For example, as long as we obtain domain knowledge of latent factors from climate experts/scientists, we can easily match them to the meaningful quantities, e.g., precipitation, solar radiation, or components of a climate foundational model representation.

Regarding the noise terms, they capture aggregated uncertainties from factors like human activity, measurement error, and unmodeled dynamics. While not directly interpretable, its distributional behavior (e.g., variability aligned with latent evolution) can still reveal useful scientific insights, such as how ocean currents influence the regional variability.

These directions mirror historical processes in science, such as the discovery of viruses, which were first hypothesized based on indirect evidence and later confirmed and measured directly. We have included this perspective in our main paper and regard it as our future work.

评论

Dear Reviewer p8vt,

We are grateful for your time on our paper, your constructive comments, and your recognition of the significance and novelty of our work. Could you please have a look at our response and let us know whether your concerns have been addressed, regarding

  • W1: Additional simulation results on higher-dimensional datasets to assess performance and efficiency scalability, showing that our model does not degrade in higher-dimensional settings with the physical prior.
  • W2: Discussion of the robustness of our identifiability theory under sparse and noisy time-series climate data.
  • Q1: Quantifying assumption violations through concrete cases, evaluating their impact on results via simulation experiments, which demonstrates the robustness of CaDRe.
  • Q2: Clarification of the physical interpretability of latent variables and noise.

Your further feedback would be highly appreciated.

Best,

The Authors of Submission 9138

评论

Thanks for the authors' detailed rebuttal. I'll keep my score.

评论

Dear Reviewer p8vt,

Thank you for taking the time to read our rebuttal. We sincerely appreciate your effort in reviewing our work.

We would highly appreciate it if you could elaborate on the points that you are not satisfied with, as we can have the opportunity to provide further clarification and refine our work.

With best regards,

The Authors of Submission 9138

评论

Dear Reviewer p8vt,

Thanks again for your time dedicated to reviewing this paper and for the strengths of the paper you formulated. From our perspective, your comments are overall rather positive, with only two questions and two comments on the limitations. As you see, we have provided detailed responses to them. We would highly appreciate it if you could have another look on our responses to confirm that all points have been addressed and consider updating your rating.

Sincerely,

The Authors of Submission 9138

评论

Dear all reviewers,

The author rebuttal period has now concluded, and authors' responses are available for the papers you are reviewing. The Author-Reviewer Discussion Period has started, and runs until August 6th AoE.

Your active participation during this phase is crucial for a fair and comprehensive evaluation. Please take the time to:

  • Carefully read the author responses and all other reviews.
  • Engage in a constructive dialogue with the authors, clarifying points, addressing misunderstandings, and discussing any points of disagreement.
  • Prioritize responses to questions specifically addressed to you by the authors.
  • Post your initial responses as early as possible within this window to allow for meaningful back-and-forth discussion.

Your insights during this discussion phase are invaluable. Thank you for your continued commitment to the NeurIPS review process.

Bests,
Your AC

评论

Dear Reviewers and AC,

We sincerely thank you for your time, effort, and constructive feedback during the review and discussion phases. Your comments have been invaluable in improving the clarity, empirical validation, and theoretical framing of our work. We are grateful for the recognition of our contributions and for the suggestions that motivated new experiments, clarifications, and discussions in the revised manuscript.

While most concerns have been addressed accordingly, for clarity to the AC and reviewers, we provide below a concise summary of our general response; full details, tables, and derivations are available in our original point-by-point replies.

I. Clarification on Theoretical Contributions

  • Robust identifiability in CRL & CD: Non-parametric identifiability theory for latent-variable CRL and CD in climate time-series, enabling recovery of latent representations and observational causal graphs without assuming invertibility, allowing sparse/noisy data (W2, Q1 to p8vt).
  • Why train an ICA instead of directly training a SEM: In nonparametric SEM X=f(X,E)X = f(X, E), noise EE cannot be separated from XX, making direct SEM training ill-posed. Nonlinear ICA can recover the latent sources (noise) under identifiability guarantees, enabling principled causal graph estimation (Experiments & Discussions in W4 to 1X1S).
  • Assumptions Discussion: Assumptions A1–A5 capture climate properties such as 1st-order (extendable) Markov structure and local variability in latent dynamics. Compared to prior CRL works, A2 & A3 are much weaker than the common invertibility requirement (Appendix A.8). We evaluated robustness under controlled violations of A2, A3, A5 to assess their impact. CaDRe maintained high R2R^2 and MCC scores, with degradation patterns depending on which assumption was violated (Q1 to p8vt, L2 to NFts).

II. Extended Empirical Validation

  • New datasets and baselines: We have added Weather and ERSST datasets, and reproduced recent strong baselines — Timer-XL, TimeMixer, TimeXer, and xLSTM-Mixer for forecasting; TCDF, IDOL, and TDRL for causal discovery (W1–W3 to 2AQA, W2 to 1X1S).
  • Quantitative causal discovery metrics: Introduced Wind-SHD and Wind-TPR to assess alignment with wind-induced reference graphs, showing CaDRe achieves the best scores across baselines (W3 to 2AQA).
  • Robustness tests: Conducted evaluations on higher-order latent dynamics, controlled violations of assumptions (A2, A3, A5), and scaling to dx={20,50,80,100,200}d_x=\{20, 50, 80, 100, 200 \}, demonstrating only mild performance degradation (Q1 to p8vt, Q2 to G8Kr).
  • Efficiency analysis: Measured training time, memory usage, and inference latency, showing CaDRe is competitive or faster than baselines in both forecasting and causal discovery tasks (W5 to 1X1S).
  • Dimensionality degradation and physical priors: In high-dimensional climate datasets, both simulated and real-world, we incorporate spatial distance–based physical priors into CaDRe. This reduces spurious dependencies, mitigates local minima, and preserves performance when scaling to large dxd_x (W1 to p8vt, Q3 to G8Kr).

III. Physical Interpretation & Visualization

  • Physical interpretation of latent variables: Latent variables are unobserved by definition; their interpretation requires domain knowledge, similar to historical scientific discoveries. Once their existence and relation to observed variables are established, they can be matched to meaningful physical quantities (e.g., precipitation, solar radiation) with expert input. Noise terms capture aggregated uncertainties such as human activity and measurement error (Q2 to p8vt, W1 & Q1 to G8Kr, W3(3) to 1X1S).
  • Visualization: Visualizations distinguish wind-field arrows, where length indicates wind speed, from estimated causal graph edges, where length indicates spatial distance between connected regions. Latent transitions are explicitly highlighted to clarify their role in the dynamic process (Q1 to 2AQA, W3(4) to 1X1S).

IV. Future Work

We will:

  • Extend CaDRe to more general structures: Support time-lagged observed causal structures and sparse generation processes, with corresponding new identifiability results (W4 to 2AQA).
  • Integration with pretrained climate models: Incorporate CaDRe into pretrained climate foundation models (e.g., ClimaX, GenCast) as a fine-tuning module for post-training causal structure recovery (W4 to 2AQA).

V. Closing

With added datasets and baselines, new physical metrics, robustness and scalability tests, efficiency analysis, and clarifications on identifiability and assumptions, we believe all major concerns are addressed. We again thank the AC and reviewers for their constructive feedback and welcome any final questions.

最终决定

This paper presents CaDRe, a novel and ambitious framework that aims to unify causal representation learning and causal discovery for time-series data, with a strong motivation from climate analysis. The authors provide a theoretical foundation for their approach, establishing identifiability conditions for simultaneously learning a latent dynamic process and the causal graph over observed variables in a nonparametric setting.

The reviewers recognised the importance of the problem and appreciated the theoretical contributions and the clarity of the presentation. However, significant concerns were raised during the initial review phase. The primary weaknesses identified were the limited empirical validation, which included outdated baselines, a single real-world dataset, and a lack of quantitative metrics for the causal discovery claims. Furthermore, reviewers questioned the practical scalability of the method and, critically, the physical interpretability of the learned latent variables, which may be particularly concerned by readers targeting on scientific discovery. The AC also noticed a perceived disconnect between the theoretical identifiability analysis and the general construction and running of the learning algorithm. The empirical evaluation suffers from a mismatch between the paper's core contribution—causal structure learning—and the chosen baselines. The authors primarily compare against models designed for time-series forecasting (with forecasting metrics in comparison), while omitting a significant number of classical and state-of-the-art methods for causal discovery. As a result, the performance evaluation for the central task of causal graph inference is inadequate, even if it has results in Fig.5, and the paper's claims in this area are not sufficiently substantiated.

The authors provided an exceptionally thorough rebuttal, adding a large volume of new experiments with more datasets, updated baselines, new metrics, and scalability analysis. While this effort was commendable and swayed several reviewers, it did not lead to a clear consensus. Moreover, the crucial issue of latent variable interpretability remains largely unaddressed and is deferred to future work, which could temper the paper's claimed impact on climate science. One reviewer remained unconvinced by the rebuttal, maintaining a negative score. This lack of consensus, combined with the major revisions required to address the initial evaluation's shortcomings, suggests the paper may not yet be ready for publication. The AC acknowledges the promising and ambitious nature of the work presented. However, after careful evaluation and in light of the highly competitive submission environment and the unfortunately limited acceptance ratio, it was collectively determined that this paper, regrettably, cannot be accepted for this round. We highly encourage the authors to build upon the promising foundation of this work and consider resubmission to a future venue, addressing any specific feedback from the reviewers.