6.6

/10

Poster5 位审稿人

最低5最高8标准差1.2

3.2

置信度

正确性2.8

贡献度2.6

表达2.8

ICLR 2025

Identification of Intermittent Temporal Latent Process

Yuke Li,Yujia Zheng,Guangyi Chen,Kun Zhang,Heng Huang

OpenReview PDF

提交: 2024-09-14更新: 2025-03-02

摘要

关键词

unsupervised representation learning

评审与讨论

审稿意见

评分: 5置信度: 32024-10-28

This paper proposes a meaningful study on learning latent variables and their identifiability theory in intermittent temporal latent processes. The author establishes a set of novel identifiability results for intermittent latent temporal processes, extending the identifiability theory to scenarios where latent factors may be missing or inactive at different time steps. However, despite the proposed identifiable theory being relatively sophisticated, the theory and the proposed network structure are relatively isolated and do not adapt well to the identifiable assumption.

优点

The author presents a promising research scenario, namely intermittent temporal latent processes.
The author proposes a complex identifiable theory, in which both non-stationary and non-stationary transition processes are applicable,
the author suggests that under certain assumptions, latent variables can be identified in blocks. By further assuming conditional independence, each latent variable can even be restored to a component by component transformation.

缺点

In this paper, the author considers latent variables as causal variables. Is the "causality" referred to Granger causality? If not, what is the difference between Granger causality and latent variables

This is more like a blind source separation task. How does it relate to mainstream structural causal models or latent outcome models? Or what is the connection with traditional Granger causality?

The relationship between the proposed theory and the established model is not closely related, making it difficult to see the relationship between the network design and the assumptions given by the theorem.
The same type of variational autoencoder introduced, such as "a transition prior module based on normalizing flows," has been used in many papers[LEAP, TDRL, etc ], and it has been found that there is no significant difference or improvement in network design compared to other methods. Does this mean that all baseline variational autoencoders have the identification capability proposed in this paper?
During the experiment, it was also difficult to directly identify where the synthesized dataset met the identifiable assumption conditions proposed in the article.

Reference:

[LEAP] Weiran Yao, Yuewen Sun, Alex Ho, Changyin Sun, and Kun Zhang. Learning temporally causal latent processes from general temporal data. In International Conference on Learning Representations, 2022.

[TDRL] Weiran Yao, Guangyi Chen, and Kun Zhang. Temporally disentangled representation learning. Advances in Neural Information Processing Systems, 35:26492–26503, 2022.

问题

This article presents a promising theory, but network design is no different from most papers(eg, LEAP, TDRL, etc ), and it is also unclear which modules are designed to meet specific identifiable conditions, which appears very disjointed and the results in a mismatch between theory and methodology.
What are the limitations of this article, and does it mean that any variational autoencoder model is applicable to any data in any situation?
The author should further clarify where the given synthetic data method satisfies the assumption of identification, and should provide a detailed description of how it is combined with the proposed theory in the process of network design, rather than simply listing the network structure
The author should clarify whether the causal variable belongs to the Granger causality category or other categories.

评论- Rebuttal by Authors

2024-11-22

Q3: The same type of variational autoencoder introduced, such as "a transition prior module based on normalizing flows," has been used in many papers[LEAP, TDRL, etc ], and it has been found that there is no significant difference or improvement in network design compared to other methods. Does this mean that all baseline variational autoencoders have the identification capability proposed in this paper?

Q5: This article presents a promising theory, but network design is no different from most papers(eg, LEAP, TDRL, etc ), and it is also unclear which modules are designed to meet specific identifiable conditions, which appears very disjointed and the results in a mismatch between theory and methodology.

A: We answer both questions here as the similarity between them.

The key distinction between our work and previous works, such as LEAP and TDRL, lies in our focus on identifying representations of intermittent temporal latent processes, which are characterized by two crucial properties:

Temporal Intermittency: At any time step, arbitrary subsets of latent factors may be missing during the nonlinear time-delayed data generation process;
Unknown Active Factors: The specific set of active latent factors at each time step is not known a priori.

LEAP, TDRL, and similar variational autoencoders fundamentally differ from our approach in terms of the learning objective handle intermittent processes: Our approach explicitly incorporates sparsity regularization terms ( $|J\_{\hat{g},t}|\_{2,1}$ , $|J\_{\hat{f},t}|\_{1,1}$ , $|J\_{\hat{f},t}|\_{2,1}$ ) in Eq. 10 to learn and account for missingness patterns. In contrast, LEAP, TDRL, and our ablation baselines (W/O s of f and W/O s of g) lack mechanisms to handle intermittent observations.

We have discussed that our proposed data generating process encompasses LEAP and TDRL as special cases from line 241 to 245 in the paper revision. However, these models cannot adequately capture the dynamics of intermittent temporal latent processes. Our experiment in Figure 3 confirms the superiority of InterLatent with respect to LEAP and TDRL.

Q4: During the experiment, it was also difficult to directly identify where the synthesized dataset met the identifiable assumption conditions proposed in the article.

Q7: The author should further clarify where the given synthetic data method satisfies the assumption of identification, and should provide a detailed description of how it is combined with the proposed theory in the process of network design, rather than simply listing the network structure.

A: We appreciate your questions. our synthesized dataset incorporates the assumptions from the identifiability results as follows:

Smoothness and Positivity (Assumption i):

We implement the transition function $f$ by $f(z_{t-1}, \epsilon_t) = z_{t-1} * sinh(\epsilon_t)$ where $\epsilon_t \sim N(0, 0.1)$ enters non-additively through multiplication. For missing components, the transition function is $f(\epsilon_t) = sinh(\epsilon_t)$ , which is both infinitely differentiable and invertible. Initial states are drawn from $z_0 ~ U(0,1)$ , ensuring positive measure. The mixing function $g$ is implemented by $g(z) = sinh(z)$ . These functions ensuring the twice differentiability requirement.

Path-connected assumption:

In our data-generating process, the functions $f(z_{t-1}, \epsilon_t) = z_{t-1} * sinh(\epsilon_t)$ and $g(z) = sinh(z)$ merely preserve this property by being continuous mappings between real spaces of $z_t$ . Therefore, The path-connectedness is guaranteed.

sufficient variability assumption:

The transition function $f$ ensures sufficient variability through the strict monotonicity of sinh over $\mathbb{R}^N$ . For support variables, multiplication with $z_{t-1}$ provides rich transitions, while the nonlinear sinh transformation ensures the Hessian has full rank over $\mathbb{R}^{d_t\times d_t}$ .

Additionally, both $f$ and $g$ are invertible through arcsinh, ensuring unique recovery of both latent states and noise terms.

2024-11-25

In the paper, the network structure presented by the author is almost identical to existing works (eg, LEAP, TDRL, etc.). Does this mean that the same network structure can solve all the problems related to the temporal causality defined by the author in temporal data?

To be more straightforward, whether the author has designed a unique module that is not obvious, which makes the network structure different from the previous work. At the same time, the changes can solve the "Intermittent" proposed in the paper, but the network without this module cannot be solved

评论- Rebuttal by Authors

2024-11-22

Q2: The relationship between the proposed theory and the established model is not closely related, making it difficult to see the relationship between the network design and the assumptions given by the theorem.

Thank you for your comment.

Since our primary contribution lies in establishing identifiability theory for intermittent temporal latent processes. These identifiability results are fundamentally estimator-agnostic - they characterize the conditions under which the true latent variables can be recovered and can be leveraged by a wide range of estimators. In our specific setting, we only require the estimator (InterLatent) to have a sparsity regularization, and match the observational distribution. Specifically, the encoder acquires latent causal representations by inferring $q\_\omega(\hat{z}\_t|x\_t)$ from observations. These learned latent variables are then used by the decoder $p\gamma(\hat{x}\_t|\hat{z}\_t)$ to reconstruct the observations, implementing the mixing function $g$ in Eq. 1. To learn the latent variables, we constrain them through the KL divergence between their posterior distribution and a prior distribution, which is estimated using a normalizing flow that converts the prior into Gaussian noise in Eq. 8. For ELBO in Eq.10, $L_\text{Recon}$ measures reconstruction quality between ground truth and reconstructed observations from the decoder; $L_\text{KLD}$ enforces the learned posterior of $\hat{z}\_t$ to match the temporal prior distribution of $z_t$ ; and the sparsity regularization terms ( $|J\_{\hat{g},t}|\_{2,1}$ , $|J\_{\hat{f},t}|\_{1,1}$ , $|J\_{\hat{f},t}|\_{2,1}$ ) implements the support sparsity to ensure proper support structure by promoting sparsity in both decoder and transition function Jacobians.

At the same time, assumptions stated in Theorems 1 (smoothness, path-connectedness and sufficient variability) are properties of the true data-generating process, which are used in generating synthetic datasets to validate our theory. We also list the specific data generating process for simulation as follows if that would also be helpful:

We implement the transition function $f$ by $f(z_{t-1}, \epsilon_t) = z_{t-1} * sinh(\epsilon_t)$ where $\epsilon_t \sim N(0, 0.1)$ enters non-additively through multiplication. For missing components, the transition function is $f(\epsilon_t) = sinh(\epsilon_t)$ , which is both infinitely differentiable and invertible. Initial states are drawn from $z_0 ~ U(0,1)$ , ensuring positive measure. Both $f$ and $g$ are continuous mappings in the real spaces of $z_t$ and $x_t$ , which establish the path-connectedness. The mixing function $g$ is implemented by $g(z) = sinh(z)$ . These functions ensuring the twice differentiability requirement Also the transition function $f$ ensures sufficient variability through the strict monotonicity of sinh over $\mathbb{R}^N$ . For support variables, multiplication with $z_{t-1}$ provides rich transitions, while the nonlinear sinh transformation ensures the Hessian has full rank over $\mathbb{R}^{d_t\times d_t}$ . Additionally, both $f$ and $g$ are invertible through arcsinh, ensuring unique recovery of both latent states and noise terms.

We hope the added discussion could further clarify our task. Please feel free to let us know if you have any further questions, and we would be more than happy to address.

评论- Rebuttal by Authors

2024-11-22

Q6: What are the limitations of this article, and does it mean that any variational autoencoder model is applicable to any data in any situation?

A: In terms of CRL, VAE frameworks must be tailored to specific data-generating processes, such as temporal data modeling [2,3,4], transfer learning [5,6], etc.. In our case, handling intermittent temporal processes requires explicit modeling of missingness through sparsity regularization. This principle is clearly demonstrated by comparing our method to LEAP and TDRL in Figure 3 in the paper. While these methods also use VAE frameworks, they cannot handle intermittent processes. This illustrates that simply using a VAE framework is insufficient - the model architecture and training objectives must match the underlying data-generating mechanism. Therefore, our work precisely shows that VAE models must be carefully designed for their specific applications, rather than being universally applicable.

As stated in our conclusion, while we have demonstrated the effectiveness of our approach on visual group activity recognition task, the lack of other applications is a limitation of this work. This is mainly because that we focus on the identifiability theory. We plan to test our theory to a wide range of applications in the future.

[1] Scholkopf, et al. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021.

[2] Yao, et al.. Learning temporally causal latent processes from general temporal data. In International Conference on Learning Representations, 2022.

[3] Yao, et al.. Temporally disentangled representation learning. Advances in Neural Information Processing Systems, 35:26492–26503, 2022

[4] Chen, et al.. Caring: Learning temporal causal representation under non-invertible generation process. In Forty-first International Conference on Machine Learning, 2024

[5] Kong, et al.. Partial disentanglement for domain adaptation. In Proceedings of the 39th International Conference on Machine Learning, 2022

[6] Li, et al.. Subspace identification for multi-source domain adaptation. In Thirty-seventh Conference on Neural Information Processing Systems, 202

评论- Rebuttal by Authors

2024-11-22

Q1: In this paper, the author considers latent variables as causal variables. Is the "causality" referred to Granger causality? If not, what is the difference between Granger causality and latent variables This is more like a blind source separation task. How does it relate to mainstream structural causal models or latent outcome models? Or what is the connection with traditional Granger causality?

Q8: The author should clarify whether the causal variable belongs to the Granger causality category or other categories.

A: Given the similarity between these two questions, we would like to address them together.

In this work, We build our work upon the context of causal representation learning (CRL) [1], where "causal" has a specific structural and generative meaning that differs fundamentally from Granger causality.

In our temporal CRL framework, we model latent variables $z_t$ as true causal factors that form a structured temporal system. These latent variables follow transition dynamics captured by the nonlinear transition function $f_n$ . The observations $x_t$ are generated through a nonlinear mixing function $g$ . The Jacobian matrices explicitly encode temporal causal influences between latent states, how latent variables causally generate observations, and sparsity patterns that reveal causal pathways.

Our work stands in contrast to Granger causality. While the Granger causality mainly considers temporal relationships between observed variables and defines causality through predictive improvement using historical data, our temporal CRL models explicit causal mechanisms through structured latent variables $z_t$ . We capture both temporal dynamics through the transition function $f_n$ and generating processes by the mixing function $g$ , allowing us to identify causal factors that may not be directly observable. Importantly, our framework represents causality through mechanistic generation rather than mere prediction.

Additionally, CRL shares the mathematical foundations with independent component analysis (ICA), which is employed for the task of blind source separation. However, our work extends ICA by:

Adding temporal structure to the latent variables through the transition function;
Imposing sparse causal structure to the data generating process;
Allows the dimensions of $z_t$ much less than $x_t$ .

These points create a temporal SCM where:

Latent variables represent true causal factors (as in SCMs);
The temporal evolution follows causal mechanisms (through structured transitions);
The mixing process defines how causes generate effects (through structured generation).

2024-11-25

"While the Granger causality mainly considers temporal relationships between observed variables and defines causality through predictive improvement using historical data, our temporal CRL models explicit causal mechanisms through structured latent variables $z_t$ . "

In my opinion, this paper also discusses the causal relationship between variables, which seems to be no different from Granger causality. Granger causality focuses on the impact of historical variables on the future and does not consider instantaneous causality. Similarly, the causality mentioned in this paper is the same. For different modeling methods, it is nothing more than discovering which variables in time series data can be called causal variables, and how they are transmitted between them. The relationship between causality and Granger causality mentioned in this paper is very vague.

For blind source separation, the goal is to identify the true cause of observation generation, which is also the primary task of causal discovery. However, the author did not mention causal discovery.

So, does the causality proposed by the author belong to the SCM category or the Granger causality category?

评论- Rebuttal by Authors

2024-11-26

Q: To be more straightforward, whether the author has designed a unique module that is not obvious, which makes the network structure different from the previous work. At the same time, the changes can solve the "Intermittent" proposed in the paper, but the network without this module cannot be solved

A: We would like to clarify two points regarding our work:

First, we understand that our network architecture is similar to TDRL. However, the key distinction lies in the objective function, which is grounded by our theorems. Specifically, we introduce an additional sparsity constraint in Equation 10, which serves as a unique module to differentiate our method from approaches like LEAP or TDRL.

Second, sharing the same architecture does not imply a lack of innovation. On the contrary, adding sparsity constraints with the same architecture highlights the effectiveness of this module. As demonstrated in Table 1, incorporating these constraints yields significant performance improvements compared to TDRL methods.

Furthermore, we want to emphasize that innovation can arise not only from changes in network architecture but also from carefully designed loss objectives. Numerous impactful works have maintained the same network structure while introducing principled constraints to address specific challenges. For instance:

ArcFace [1] incorporates an additive angular margin to enhance representation discriminability for large-scale face recognition.
Focal Loss [2] introduces a simple factor $(1-p)^{\gamma}$ to the standard cross-entropy criterion to address the foreground-background class imbalance in object detection.

We hope this explanation adequately addresses concerns regarding our unique contributions. If there are any further questions or clarifications needed, please feel free to reach out.

[1] Deng, et al. Arcface: Additive angular margin loss for deep face recognition. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019;

[2] Mukhoti, et al. Calibrating deep neural networks using focal loss. Advances in Neural Information Processing Systems 33 (2020): 15288-15299.

评论- Rebuttal by Authors

2024-11-26

Q: The relation between Granger causality and our work;

Q: The connection between causal discovery and our work;

A: Thank you for your active engagement and thoughtful discussion! We greatly appreciate the opportunity to delve deeper into these concepts and clarify them further.

Granger causality: We understand that our framework generalizes Granger causality. Specifically, we assume the absence of instantaneous relationships among the latent variables $\mathbf{z_t}$ , aligning with the general definition of Granger causality, which considers that past variables provide statistically significant information about future variables. In our previous response, we would like to show the difference between our framework and traditional methods in Granger causality, which is always assumed to be solved by modeling the relations on the observed variables, such as the Vector Autoregressive Processes (VAR) model [1,2,3]. Unlike these traditional methods, such as VAR, our approach models the data generation process using latent variables rather than relying solely on observed variables.

We have added these discussion in the Section E.1 in the paper revision

Causal discovery: Causal representation learning can be seen as a generalization of causal discovery as stated in [10]. We have included the following discussion in the Section E.2 for the discussion between casual discovery and causal representation learning in the revised paper:

"A majority of causal discovery methods for time-series data focus on identifying causal relationships within observed variables in an unsupervised manner [5,6,7,8,9]. These methods are limited when handling complex real-world scenarios like images and videos where causal effects operate in latent spaces. Our work addresses this limitation by focusing on identifying the latent causal variables that generate observations. "

SCM category or the Granger causality category: In our view, SCM and Granger causality are not inherently contradictory. SCM represents causal relationships between variables through structural equations, offering a framework for understanding and analyzing causal mechanisms. In contrast, Granger causality emphasizes predictive relationships, based on the assumption that future variables can be predicted from past variables. Notably, some approaches, such as [4], integrate SCM to model Granger causality, demonstrating their compatibility in certain contexts.

We would greatly value your thoughts on the distinctions between these terms like SCM and Granger causality. Your insights would be immensely helpful in clarifying these concepts and articulating this point effectively.

[1] H. Lutkepohl. New Introduction to Multiple Time Series Analysis. Springer, 2007. Section 2.3.1.

[2] Tank, Alex, et al. Neural granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence 44.8 (2021): 4267-4279. Eq. 1 and Eq.4

[3] Lozano, Aurelie C., et al. Grouped graphical Granger modeling methods for temporal causal modeling. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009. Eq. 1

[4] Marcinkevičs, Ričards, and Julia E. Vogt. Interpretable Models for Granger Causality Using Self-explaining Neural Networks. International Conference on Learning Representations. Eq. 1

[5] Entner, et al. On causal discovery from time series data using fci. Probabilistic graphical models, pages 121–128, 2010.

[6] Murphy, et al. Dynamic bayesian networks. Probabilistic Graphical Models, M. Jordan, 7:431, 2002.

[7] Pamfil, et al. Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics, pages 1595–1605. PMLR, 2020.

[8] Daniel, et al. Causal structure learning from multivariate time series in settings with unmeasured confounding. In Proceedings of 2018 ACM SIGKDD Workshop on Causal Discovery, pages 23–47. PMLR, 2018.

[9] Daniel, et al. Learning the structure of a nonstationary vector autoregression. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2986–2994. PMLR, 2019.

[10] Morioka, et al. Causal representation learning made identifiable by grouping of observational variables. In Forty-first International Conference on Machine Learning, 2024.

评论- Any further comments?

2024-12-02

Dear Reviewer WTHB,

Thank you for the time you've dedicated to reviewing our paper. As the discussion period deadline approaches, we would like to know if our response and revised paper have adequately addressed your concerns. If you have any additional feedback or suggestions, we are keen to hear them and respond accordingly.

The authors

审稿意见

评分: 6置信度: 32024-11-02

The authors propose a setting where observations are produced by a set of latent factors. These latent factors, however, may or may not be active from period to period. The authors suggest that under certain conditions these latent factors can be identified. Their theory informs a very specific architecture, and they demonstrate that the method works on synthetic data and on a real-world video dataset.

优点

I like and understand the setup - it is indeed reminiscent of many real-world problems. I also like how tightly the theory and the architecture are connected. Finally, I find the synthetic experiments persuasive - they do show that the proposed method indeed works.

缺点

I have 2 main concerns: (1) there is no attempt to show that this method scales to high dimensions, and (2) real-world dataset experiments are woefully insufficient.

问题

I don't have specific questions - the paper is written clearly. But scalability needs be addressed. And demonstrating that the method works only on one dataset is completely unacceptable. The LEAP paper (one of the benchmarks), for example, has 3 datasets. The setup lends itself well to the time series modality. The authors themselves mention applicability to finance - they should consider pointing this machinery at that type of data. Medicine, or other setting with sensors could work well too.

评论- Rebuttal by Authors

2024-11-22

Q1: There is no attempt to show that this method scales to high dimensions

Thanks for raising this point. Following [1,2,3,4], we aim to learn a set of low-dimensional latent variables that generates high-dimensional observations. Therefore, we intended to not using latent vectors with high dimensions to evaluate our method.

In light of your suggestion, we validate our method's scalability to higher dimensions through experiments with $N \in \\{8, 12, 18\\}$ as shown in Table. We observe that, even with higher dimensions, we can still achieve better identification results compared to CaRiNG [4]. Furthermore, we demonstrate scalability in real-world scenarios with $N=20$ for the Volleyball dataset (which is stated in Secion D.2), and $N=12$ for SSv2, both achieving state-of-the-art results.

Table A: Ablation study results on the scalability of N

	N=8	N=12	N=18
CaRiNG	0.574 ± 0.03	0.491 ± 0.02	0.428 ± 0.03
InterLatent	0.818 ± 0.01	0.655 ± 0.02	0.626 ± 0.01

These experimental results and discussions are detailed in Sections D.4 and D.5 of the paper revision.

Q2: real-world dataset experiments are woefully insufficient.

A: Our work's primary contribution is theoretical - establishing identifiability guarantees for intermittent temporal latent processes. While additional real-world experiments could provide further validation, evaluating identifiability on real-world data is inherently challenging due to the absence of ground-truth latent variables. This is a common challenge in the field; previous works like LEAP and TDRL primarily rely on semi-synthetic datasets (e.g., KiTTiMask, Mass-Spring System) where ground-truth latent variables are available, with CMU-Mocap being their only real-world application without latent ground truth.

Following your suggestion, we have expanded our real-world experiments. While our attempt to use the medical dataset from [5] was unsuccessful due to licensing constraints, we conducted additional experiments on the Something-Something V2 (SSv2) dataset [6] for action recognition. Something-Something v2 (SSv2) is a dataset containing 174 action categories of common human-object interactions. It includes 220,847 videos, with 168,913 in the training set, 24,777 in the validation set and 27,157 in the test set. In each video sequence, there might be occlusion between human and object. Thus, this dataset plays a solid ground for our experiments. InterLatent adopts ViT-B/16 [7] pretrained on as the backbone to obtain $x_t$ . Regarding the hyperparameters, we set $N = 12$ in Eq. 1. Also, we use the same two-phase training strategy as our experiments on the Volleyball dataset.

To evaluate the efficacy of identifying intermittent temporal latent processes, we benchmark InterLatent against both causal representation learning methods (TDRL[3] and CaRiNG [4]) and state-of-the-art action recognition approaches (SViT [8], VideoMAE [9], CAST [10], StructVit [11]). The Top-1 accuracy results in Table B demonstrate that InterLatent outperforms all competing methods, validating its effectiveness.

Table B: Experimental results on SSv2

	Top-1
SViT	65.8
VideoMAE	70.8
CAST	71.6
StructVit	71.5
TDRL	71.5
CaRiNG	72.0
InterLatent	72.7

[1] Scholkopf, et al. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021

[2] Yao, et al. Learning temporally causal latent processes from general temporal data. ICLR, 2022

[3] Yao, et al. Temporally disentangled representation learning. NeurIPS, 2022

[4] Chen, et al. Caring: Learning temporal causal representation under non-invertible generation process. ICML, 2024

[5] Levine, et al. Genome-wide association studies and polygenic risk score phenome-wide association studies across complex phenotypes in the human phenotype project. Med, 5(1):90–101, 2024

[6] Goyal, et al. The” something something” video database for learning and evaluating visual common sense. ICCV, 2017

[7] Radford, et al. Learning transferable visual models from natural language supervision. ICML, 2021

[8] Ben Avraham, et al. Bringing image scene structure to video via frame-clip consistency of object tokens. NeurIPS, 2022

[9] Tong, et al. Videomae: Masked autoencoders are data efficient learners for self-supervised video pre-training. NeurIPS, 2022

[10] Lee, et al. Cast: cross-attention in space and time for video action recognition. NeurIPS, 2024

[11] Kim, et al. Learning correlation structures for vision transformers. CVPR, 2024.

2024-11-24

Still a bit light experimentally, but the overall idea is worth publishing, so I'm raising my score a notch.

评论- We sincerely appreciate your decision to raise the score.

2024-11-24

Thank you so much for your updated rating. Also, we deeply appreciate your support and feedback that helps improve our work.

审稿意见

评分: 8置信度: 32024-11-04

This paper introduces a new class of discrete-time stochastic process with a series of latent variables that are allowed to (i) vary over time, (ii) be uninformative to either the observed data and/or the subsequent latent values, and (iii) be identifiable for those that are informative. Along with defining this class of processes, the paper also proposed a variational inference method for learning and modeling them, which is able to adequately represent this latent and observed sequential values.

优点

I found the general proposed process to be simple yet powerful in practice. The authors were able to derive useful theoretical findings with minimal assumptions made on the underlying process, and the proposed model itself appears to be straightforward in design. I believe that this work is of interest in general to the graphical modeling community, with advances towards interpretable latent dynamics.

缺点

While a lot of time was spent on the general class of processes, I feel like the modeling (section 4) was a bit rushed and could benefit from more explanations and commentary on the design decisions made. For instance, why was a variational approach chosen over a sampling-based one? It is clear some inference procedure is needed to account for $z$ values being latent, but not much discussion is given to justify the choices made here. Additionally, the sparsity regularization is thrown in as part of the loss without much discussion around it. I am assuming that this is to encourage latent values to be "missing" when possible but that is just speculation.

问题

I would personally rebrand describing when the Jacobian results in a 0 as the corresponding latent value being "missing" to rather be described as "uninformative" or something similar. The reason being is the latent values are always missing / never observed. Should $p(x|z_1,z_2)=p(x|z_1)$ , then that is a matter of independence rather than missingness. I am curious on your thoughts to this, or if I missed something concerning this.

评论- Rebuttal by Authors

2024-11-22

Q1: For instance, why was a variational approach chosen over a sampling-based one? It is clear some inference procedure is needed to account for values being latent, but not much discussion is given to justify the choices made here.

A: Thank you for the comment. Following the previous works on causal representation learning [1,2,3,4,5], we chose the variational approach over sampling-based methods for the following reasons:

Computational Efficiency: The variational approach allows for efficient training through gradient-based optimization, which is particularly important in our setting where we need to:
- Handle both transition and mixing functions simultaneously;
- Process temporal dependencies across multiple time steps;
- Deal with varying support sets $s_t$ across time.
Structured Posterior Approximation: Our variational framework naturally accommodates:
- The conditional independence assumption required by Theorem 2;
- The sparsity constraints on both mixing and transition functions;
- The temporal dependencies in the latent process.

While sampling-based methods could potentially be used, they might face challenges in efficiently handling the sparsity constraints and temporal dependencies that are central to our framework. The variational approach allows us to directly incorporate these structural assumptions into the optimization objective through our modified ELBO loss (Equation 10).

Q2: Additionally, the sparsity regularization is thrown in as part of the loss without much discussion around it. I am assuming that this is to encourage latent values to be "missing" when possible but that is just speculation.

A: Thanks for your observation and comment. These terms are designed to encourage the discovery of missingness patterns, and we provide more clarity about their role in the following:

Unsupervised Learning of Missingness: The sparsity regularization terms serve to:
- Learn the $s_t$ and $s^c_t$ in an unsupervised manner;
- Encourage sparse Jacobian structures that align with our theoretical framework in Section 2;
- Enable the model to discover which latent variables are inactive at different time steps;
While this approach learns $s_t$ and $s^c_t$ in a unsupervised manner, our synthetic experiments in Section 4.1 demonstrate that it effectively recovers the true missingness patterns. The approach achieves higher Mean Correlation Coefficient scores compared to baselines that don't model missingness.

In light of your suggestion, we conduct further experiments with two new baselines: "WS'' that $s_t$ and $s^c_t$ is accessible to $f$ and $g$ for both training and inference, and "ES'' that estimates $s_t$ by using Gumbel-Softmax trick. The results in the modified Figure 3 in the paper revision demonstrate that our method can obtain the MCC scores that on par with "WS'' baseline. Also, InterLatent outperforms "ES'' across all experiments. These findings suggest that we can successfully approximating the true $s_t$ and $s^c_t$ using the sparsity regularization terms in Eq. 10.

Q3: I would personally rebrand describing when the Jacobian results in a 0 as the corresponding latent value being "missing" to rather be described as "uninformative" or something similar. The reason being is the latent values are always missing / never observed. Should $p(x|z_1,z_2)=p(x|z_1)$ , then that is a matter of independence rather than missingness. I am curious on your thoughts to this, or if I missed something concerning this.

A: Thank you for this suggestion about terminology.

Our motivation for using "missing" rather than "uninformative" stems from two key aspects:

The complete absence of influence (zero Jacobian entries in both f and g);
The potential for these variables to become active/inactive over time in our nonstationary setting.

In a stationary sequence where $s_t$ remains invariant across time, latent variables that are permanently inactive could indeed be equally well described as "uninformative" or "missing." However, our framework also encompasses nonstationary sequences where the support set varies over time. In these cases, there may not exist any latent variables that are permanently inactive across all time steps. Therefore, we chose the term "missing" as it better captures the potentially temporary nature of inactivity and generalizes to both stationary and nonstationary settings.

We appreciate your suggestion and welcome further discussion on terminology that best serves the understanding of the intermittent temporal latent process.

评论- References

2024-11-22

[1] Khemakhem,et al.. Variational autoencoders and nonlinear ica: A unifying framework. In International conference on artificial intelligence and statistics, pp. 2207–2217. PMLR, 2020

[2] Kong, et al.. Partial disentanglement for domain adaptation. In Proceedings of the 39th International Conference on Machine Learning, 2022

[3] Yao, et al.. Learning temporally causal latent processes from general temporal data. In International Conference on Learning Representations, 2022.

[4] Yao, et al.. Temporally disentangled representation learning. Advances in Neural Information Processing Systems, 35:26492–26503, 2022

[5] Chen, et al.. Caring: Learning temporal causal representation under non-invertible generation process. In Forty-first International Conference on Machine Learning, 2024

评论- Reply

2024-11-27

Thank you for responding to my questions. After reading your reply and the other comments elsewhere in the thread, I have decided to maintain my original score. I enjoyed the ideas presented in the paper and hope to see them published soon.

评论- Thank you for your support

2024-11-27

Thank you for the time and effort you have dedicated to reviewing and providing feedback on our submission. We greatly appreciate your insights and help.

审稿意见

评分: 6置信度: 42024-11-06

This paper introduces InterLatent, a framework for learning latent variables with an intermitent generative process. Intermitence is defined as some variables being "switched off" from at both transition and generation processes. The authors include theoretical analysis which demonstrates identifiability of the latent variables up to permutation and non-linear scaling, and provide some interesting applications to realistic domains where the missingness assumption is useful.

优点

The idea of "switching off" variables from the generative process at certain time steps is very interesting.
The theoretical analysis establishes identifiability up to permutation and nonlinear scaling.
The paper introduces the idea very clearly and states its importance in contrast to other very recent works.
The proposed estimation method outperforms recent approaches and demonstrates applicability on realistic domains.

缺点

Summary:

This paper presents an interesting exploration of both applications and theoretical results. However, I have identified potential technical limitations in terms of formulation, theoretical rigor, and estimation methods. My comments below are organised into Problem statement, Theory, Estimation. They refer to section 2, 3, and 4 in the main paper. I would be glad to reassess my score if the authors address the issues outlined here.

Problem setting:

Line 90: It is unclear if $f_n$ is truly invertible, given the setup described. If the noise variable's dimensionality matches the output variable’s, and at least one parent is from $z_n$ , then the input dimensionality may exceed the output’s, making invertibility challenging. Could the authors clarify this statement or adjust the formulation to account for dimensionality constraints?
Line 98: Does the transition function $f^u$ have any equivalence to the previous $f_n$ ?
Line 99: “This implies that when $z^u_t$ is missing, it does not influence $z_{t-1}$ , $z_{t+1}$ , or $x_t$ ”. How is it possible that $z^u_t$ would have any effect on $z_{t-1}$ in the first place, considering time moves forward? I believe the authors mean $z_{t-1}$ does not affect $z^u_t$ when the latter is missing, could this be clarified?
Line 101, Equations (2) and (3): The definitions of $s$ are not complementary. Note that NOT (a AND b) = (NOT a) OR (NOT b). In your Eqs. you have AND in both cases. Given missingness in Eq. (3) is what you want, the authors might want to reformulate Eq. (2).
Definition of Missingness: The paper could benefit from a more formal presentation of how missingness affects the injectivity of the mixing function. Since the dimensionality of input variables varies based on the cardinality of $s_t$ , explicitly defining the generative process after establishing missingness might clarify the setup.

Theory:

Figure 2 and Injectivity under Missingness: Figure 2 introduces a mixing process affected by missingness, yet it is unclear how this interacts with the injectivity of the mixing function $g$ . Specifically, if $g$ is injective at $t=1$ with $|s_t| = 2$ , how would this property hold at $t=2$ with $|s_t| = 3$ ? Such cases seem to challenge the theoretical claims unless clarified.
Dimensionality of $d_t$ : If $d_t$ is fixed the above argument is not problematic. However, Figure 2 suggests that $d_t$ can vary, making the injectivity claim potentially problematic. Could the authors specify this constraint in the theoretical statements if $d_t$ is indeed fixed?
Possible Extension to Time-Dependent $d_t$ : One way I can think of incorporating a time-dependent $d_t$ is to intoduce a mixture distribution for $x_t$ , where each mixture component allows different values of $d_t$ . You could incorporate results from identifiability in mixture models for sequences, such as switching dynamical systems, by conditioning the mixture component at each time step. However, implementing this change would likely require substantial modifications to the theoretical framework.
Validity of Assumptions: It would be helpful if the authors could provide empirical or theoretical justifications for the assumptions (i-iii) in Theorem 1, specifically how they ensure consistency with the InterLatent model.

Estimation:

Estimating $s_t$ : The inference process feels incomplete due to the absence of information on how to estimate $s_t$ , which is central to the framework’s operation. This aspect isn’t fully explained in the main text, and more details on how $s_t$ is computed or estimated would greatly clarify the estimation procedure.
Framework Adjustments in Figure 2: Given that missingness affects the data generation process by altering the mixing function, it would be useful to understand how this variation is incorporated into the learning method. Could the authors expand on this?
Clarifying the Role of Sparsity Regularization (Eq. 10): If the authors intend for sparsity regularization to automatically account for missingness, this point could benefit from a clear explanation. Without an explicit estimate of $s_t$ , it’s challenging to understand the approach used for computing Eqs. (8) and (9). Some added details could help readers follow the inference method more easily.

问题

Below are some miscellaneous comments:

line 59 typo: “... has yet to fully addressed these challenges.
Consider using \citet instead of \citep in some cases. Examples:
- line 59: “(Wiedemer et al., 2024) relies on … “
- line 60: “(Lachapelle et al., 2023; Fumero et al., 2023; Xu et al., 2024) are restricted to linear … “
Line 64: Would it be possible to briefly define block-wise identifiability? Similarly for component-wise identifiability.
Lines 144-150: Can you define the domain for $h$ in both cases?
Line 186: I believe you refer to pdf instead of cdf in both cases.
Eq (7). With LeakyReLU(MLP(x)) you can get negative covariances. Is this expected?
Line 263: Typo in $\hat{z}_t | x_t$

评论- Rebuttal by Authors

2024-11-22

miscellaneous comments:

Q: line 59 typo: ``... has yet to fully addressed these challenges.''

A: we have correct the typo by ``The existing literature has yet to fully address these challenges''

Q: Consider using \citep instead of \citep in some cases. Examples: 1. line 59: “(Wiedemer et al., 2024) relies on … ; 2. line 60: “(Lachapelle et al., 2023; Fumero et al., 2023; Xu et al., 2024) are restricted to linear …

A: Thanks for pointing them out for us. We have replaced \citep with \citet to comply with ICLR 2025 template.

Q: Line 64: Would it be possible to briefly define block-wise identifiability? Similarly for component-wise identifiability.

A: Block-wise identifiability means that given a block of true latent variables, there exists a unique partitioning of the estimated latent variables that matches the true block up to a permutation and invertible transformation. Similarly, component-wise identifiability means that each individual component of the latent variables can be recovered up to a permutation and an invertible transformation.

Please refer to definition 2 and 3 for their formal definitions.

Q: Lines 144-150: Can you define the domain for in both cases?

A: The domain of block-wise identifiability is ${z_t^\mathcal{B} \in\mathbb{R}^{|\mathcal{B}|}}$ , which is the subset of values that the block of true latent variables can take. $|\mathcal{B}|$ is the cardinality of the blocks.

Similarly, the domain of component-wise identifiability is $z_t\in\mathbb{R}^N$ , which is a single latent variable.

Q: Line 186: I believe you refer to pdf instead of cdf in both cases.

A: Thanks for your observation and pointing it out for us. Yes, we meant to use pdf instead of cdf. This typo has been corrected in the paper.

Q: Eq (7). With LeakyReLU(MLP(x)) you can get negative covariances. Is this expected?

A: Thanks for your question. In practice, the output of Eq.7 is the mean and the logarithm of the variance. We use $\sigma$ in Eq.7 to be consistent with $\mu$ .

Q: Typo

A: Thanks for pointing out for us. We have corrected the typo by $\hat{z}_t | x_t$

[1] Yao, et al.. Learning temporally causal latent processes from general temporal data. In International Conference on Learning Representations, 2022.

[2] Yao, et al.. Temporally disentangled representation learning. Advances in Neural Information Processing Systems, 35:26492–26503, 2022

[3] Chen, et al.. Caring: Learning temporal causal representation under non-invertible generation process. In Forty-first International Conference on Machine Learning, 2024

[4] Lachapelle, et al.. Additive decoders for latent variables identification and cartesian-product extrapolation. Advances in Neural Information Processing Systems, 36, 2024

[5] Zheng, et al.. On the identifiability of nonlinear ica: Sparsity and beyond. Advances in neural information processing systems, 2022.

[6] Zheng, et al.. Generalizing nonlinear ica beyond structural sparsity. Advances in Neural Information Processing Systems, 36, 2023

[7] Zhang, et al. Causal representation learning from multiple distributions: A general setting. In Forty-first International Conference on Machine Learning, 2024

评论- General Reply to Authors

2024-11-22

Summary: Thank you very much for addressing my concerns. I still find the injectivity of the mixing function a bit confusing so I hope we could clarify some points if the discussion period allows it. I am raising my score to 6 given your efforts to improve clarity, specially given the modifications on the Estimation section.

Problem setting: Thank you for clarifying the concerns regarding the injectivity of the mixing function. I am hoping we could continue discussing to clarify some points regarding this, as I still consider your main text (particularly the Problem setting section) should refer to this for improved clarity.

I understand that you are working on an undercomplete setting where the observations lie in a higher dimensional space compared to the latents. However, I try to think about injectivity in the following way where $dim(x_t) = dim(z_t)$ , which in other words implies that $x_t$ can be represented by a lower-dimensional manifold which is $z_t$ . I believe this is the standard point of view on this type of identifiability problems.
Now, when de-activating latent variables, my thought process tells me I am removing information from $z_t$ , and therefore $dim(z_t)$ is lower given a de-activation, but $dim(x_t)$ remains the same because $g$ is maintaned according to Eq. (1). I can see that this is probably not a good point of view.
From your rebuttal, I understand that when there's missingness, the model forces to zero-out the jacobian on $g$ for the missing variable at that time-step; which maintains injectivity of $g$ . I agree that this is reasonable, but this basically implies that your mixing function $g$ is time-dependent. My confusion comes from this part. Would it be possible to re-formulate Eq. (1) in Problem Setting after the definition of the missingness mechanism? Otherwise, I believe readers might have similar confusions as mine. For example, in your updated rebuttal the sparsity term in Eq. (10) for $\hat{g}$ seems to incroporate this time depencence on the mixing function.
I appreciate your efforts to reply to Q9.

Estimation: I believe the updated rebuttal describes your method clearly now. Thank you for addressing this as from the main text it was not clear how $s_t$ was being treated. (there's a typo in line 267 in $p_{\gamma}$ ).

Re Questions:

Thank you for addressing the questions. Let me clarify some of the points I made here:

Q: Line 64: Would it be possible to briefly define block-wise identifiability? Similarly for component-wise identifiability.

Q: Lines 144-150: Can you define the domain for in both cases?

By this I am not interested particularly in the answer itself, but I was rather asking if you could make that clear in the text for clarity. I am saying this because when reading the introduction, the concept of block-wise identifiability pops out without prior explanation, and I believe this can be confusing to some readers at this conference. Same for Lines 144-150. Apologies if this was not clear.

评论- Rebuttal by Authors

2024-11-22

Estimations:

Q10: Estimating $s_t$ : The inference process feels incomplete due to the absence of information on how to estimate $s_t$ , which is central to the framework’s operation. This aspect isn’t fully explained in the main text, and more details on how $s_t$ is computed or estimated would greatly clarify the estimation procedure.

Q12: Clarifying the Role of Sparsity Regularization (Eq. 10): If the authors intend for sparsity regularization to automatically account for missingness, this point could benefit from a \red{clear explanation}. Without an explicit estimate of $s_t$ , it’s challenging to understand the approach used for computing Eqs. (8) and (9). Some added details could help readers follow the inference method more easily.

A: Given the overlap between Q10 and Q12, we would like to answer them together.

To address your questions, we would like to first clarify that we use the similar strategy using sparsity constraints as [5,6,7] that do not need to estimate $s_t$ . The reasons are:

Our key contribution lies in how we formulate the identifiability of the intermittent sequences as a support sparsity problem. Rather than treating $s_t$ as a variable to in our data generating process in Eq. 1, we characterize missingness through the Jacobian structures of both mixing and transition functions (Eqs. 2 and 3).
the sufficient variability assumption (Theorem 1.iii) is defined on the Hessian matrices of log-transition probabilities. The conditional independence assumption (Theorem 2.i) is specified directly on $z_t$ . In conclusion, our work allows us to identify $z_t$ without explicitly estimating $s_t$ because neither data generating process or any assumption requires estimating $s_t$ in their formulation.

Given our support sparsity formulation, the role of sparsity regularization terms in Eq. 10 directly enforce the Jacobian structures that characterize missingness. According to Theorem 1, the sparsity in Jacobians of both mixing and transition functions is sufficient for block-wise identifiability of $z_t$ . This means we only need to enforce the support sparsity structure in Eq. 2 and 3 through regularization on Jacobians to learn $z_t$ . In other words, explicit estimation of $s_t$ is not needed since we can identify $z_t$ through the Jacobian of $g$ and $f$ .

If we would like to estimate $s_t$ for some task, that is totally possible. We provide a summary of our proposition in the following. We propose to reformulate the support sparsity with the temporal support sparsity $\mathbb{E}(\hat{d}\_{1:T}) \leq \mathbb{E}(d\_{1:T})$ in our Proposition 1 in the paper revision, which requires us to compute $p(s_{t})$ in our identifiability results. The detailed proof is in Sec.B.3. In Sec.B.4, we estimate $s_t$ using the Gumbel-Softmax trick. Accordingly, our ELBO has been modified as Eq. 28 in the paper revision.

We also conduct experiments with a baseline using temporal support sparsity, namely "ES''. The results are shown in the modified Figure 3 in the paper revision. illustrates the results on the synthesized dataset. We can observe that InterLatent advances "ES'' baseline across all the settings.

Q11: Framework Adjustments in Figure 2: Given that missingness affects the data generation process by altering the mixing function, it would be useful to understand how this variation is incorporated into the learning method. Could the authors expand on this?

A: Our learning method InterLatent captures the difference of the data generating process across time through the sparsity regularization terms in Equation 10, which encourage the model to learn the appropriate Jacobian sparsity patterns at each time step t. This design allows the model to discover $s_t$ and $s^c_t$ in an unsupervised way by learning which Jacobian entries should be zero (indicating missingness) versus non-zero (indicating active influence)

Therefore, while the functions themselves remain fixed, InterLatent learns which latent variables are influential at each time step through the learned Jacobian structure. This aligns with our theoretical framework where missingness is characterized through Jacobian patterns rather than through modifications to the underlying functions.

评论- Rebuttal by Authors

2024-11-22

Q9: Validity of Assumptions: It would be helpful if the authors could provide empirical or theoretical justifications for the assumptions (i-iii) in Theorem 1, specifically how they ensure consistency with the InterLatent model.

A: We organize our response in the following:

Justifications of the assumptions in Theorem 1:

Smoothness and positivity properties are essential for applying the change of variables formula in Eq. 14 and deriving the Hessian relationship in Eq. 16. Without twice differentiability, we cannot establish the key relationship between true and estimated transition probabilities. Positivity ensures the probability densities are well-defined throughout the latent space.
Since any real space $\mathbb{R}^N$ is naturally path-connected, both $J\_{h^{-1}}(\hat{z}\_t)$ and $J{h^{-1}}(\hat{z}\_{t-1})$ inherit this property. This ensures the permutation matrix in Eq. 18 remains invariant for all $t\in[1,T]$ . While this assumption is naturally satisfied in our $\mathbb{R}^N$ setting, we make it explicit in Theorem 1 following [4] to ensure mathematical rigor.
Sufficient variability ensures the span of Hessian matrices of log-transition probabilities covers the full space of possible transitions within the support (st). This is crucial for establishing Eq. 17-19 in our proof. It allows us to uniquely identify the block structure of the transformation h through the relationship between true and estimated transition probabilities, leading to block-wise identifiability.

The design of InterLatent:

In this work, we focus on the identifiability theory, which aims to study under what conditions the underlying data generating process can be uncovered with guarantees. Thus, it is inherently estimator - agnostic.

What InterLatent does implement are the structural assumptions from the data generating process (Eq. 1), and use ELBO to approximate the observational equivalence. More specifically, the encoder acquires latent causal representations by inferring $q\_\\omega(\\hat{z}\_t|x\_t)$ from observations. These learned latent variables are then used by the decoder $p_\\gamma(\\hat{x}\_t|\\hat{z}\_t)$ to reconstruct the observations, implementing the mixing function $g$ in Eq. 1. To learn the latent variables, we constrain them through the KL divergence between their posterior distribution and a prior distribution, which is estimated using a normalizing flow that converts the prior into Gaussian noise in Eq. 8. For ELBO in Eq.10, $L_\text{Recon}$ measures reconstruction quality between ground truth and reconstructed observations from the decoder; $L_\text{KLD}$ enforces the learned posterior of $\hat{z}\_t$ to match the temporal prior distribution of $z_t$ ; and the sparsity regularization terms ( $|J\_{\\hat{g},t}|\_{2,1}$ , $|J\_{\\hat{f},t}|\_{1,1}$ , $|J\_{\\hat{f},t}|\_{2,1}$ ) implements the support sparsity to ensure proper support structure by promoting sparsity in both decoder and transition function Jacobians.

Data synthesizing under assumptions in Theorem 1:

Given the assumptions in Theorem 1 are made for the true $z_t$ across time, we expose these assumptions to our synthesized data:

We implement the transition function $f$ by $f(z_{t-1}, \epsilon_t) = z_{t-1} * sinh(\epsilon_t)$ where $\epsilon_t \sim N(0, 0.1)$ enters non-additively through multiplication. For missing components, the transition function is $f(\epsilon_t) = sinh(\epsilon_t)$ , which is both infinitely differentiable and invertible. Initial states are drawn from $z_0 \sim U(0,1)$ , ensuring positive measure. The mixing function $g$ is implemented by $g(z) = sinh(z)$ . These functions ensuring the twice differentiability requirement
In our data-generating process, the functions $f(z_{t-1}, \epsilon_t) = z_{t-1} * sinh(\epsilon_t)$ and $g(z) = sinh(z)$ merely preserve this property by being continuous mappings between real spaces of $z_t$ . Therefore, The path-connectedness is guaranteed.
sufficient variability assumption: The transition function $f$ ensures sufficient variability through the strict monotonicity of sinh over $\mathbb{R}^N$ . For support variables, multiplication with $z_{t-1}$ provides rich transitions, while the nonlinear sinh transformation ensures the Hessian has full rank over $\mathbb{R}^{d_t\times d_t}$ . Additionally, both $f$ and $g$ are invertible through arcsinh, ensuring unique recovery of both latent states and noise terms.

评论- Rebuttal by Authors

2024-11-22

Theory:

Q6: Figure 2 and Injectivity under Missingness: Figure 2 introduces a mixing process affected by missingness, yet it is unclear how this interacts with the injectivity of the mixing function $g$ . Specifically, if $g$ is injective at $t=1$ with $|s_t| = 2$ , how would this property hold at $t=2$ with $|s_t| = 3$ ? Such cases seem to challenge the theoretical claims unless clarified.

A: Our framework maintains the injectivity of g through an undercomplete setting where:

The dimension of observations $K$ is greater than or equal to the full latent dimension $N$
That is, $K \geq N$ for all t
This ensures $g: \mathbb{R}^N \rightarrow \mathbb{R}^K$ remains injective regardless of which latent variables are in $s_t$ or $s^c_t$ ;

In the specific example mentioned ( $t = 1$ with $|s_t| = 2$ vs $t = 2$ with $|s_t| = 3$ ):

The injectivity of g is preserved because $K \geq N$ holds throughout
The varying size of $s_t$ does not affect the injectivity of $g$ as missingness only remove the elements in its domain, but not change the elementes in its codomain.
The missingness mechanism (through zero Jacobian columns) determines which latent variables influence the output, but does not affect the injectivity property of $g$ .

In light of your suggestoin, we have stated in the paper that ``In this work, we work on the undercomplete case, where $K\geq N$ to ensure the injectivity of $g$ ''.

Q7: Dimensionality of $d_t$ : If $d_t$ is fixed the above argument is not problematic. However, Figure 2 suggests that $d_t$ can vary, making the injectivity claim potentially problematic. Could the authors specify this constraint in the theoretical statements if $d_t$ is indeed fixed?

A: We guess that our description in ``we maintain constant values of $d_t$ '' in Section 5.1 leaves the impression that $d_t$ is fixed. However, we do not fix $d_t$ for our identifiability results.

In our work, the identifiability results in Theorem 1 allows $d_t$ varies across time, as long as the sufficient variability assumptions hold true. Therefore, Figure 1 and 2 are correct. The injectively of the mixing function $g$ holds if the undercompleteness of the data generating process, $N \leq K$ is satisfied.

In light of your suggestion, we recognize that further experiments on various $d_t$ for a sequence is needed. Thus, we also conduct experiments on the newly synthesized data that the $d_t$ varies across time. More specifically, In particular, we synthesize another 10,000 sequences with identical procedure in Section D.1. The only difference is that we introduce the missingness by choosing $d_t=1, (t\in[1, 5])$ and $d_{5:9}=2, (t\in[5, 9])$ . The experimental results have been added in Figure 6 in the paper revision. Despite WS having access to ground truth values of $s_t$ and $s^c_t$ across all time steps, InterLatent evidently achieves comparable performance, ranking second best among all comparisons. This demonstrates the effectiveness of applying sparsity regularization to the Jacobians of both functions $f$ and $g$ in Eq. 10.

Also, we have removed the description ``Additionally, we maintain constant values of $d_t$ and $d_t^c$ , respectively, across all time steps $t = 1, \ldots, T$ .'' from the paper to avoid any confusion.

Q8: Possible Extension to Time-Dependent $d_t$ : One way I can think of incorporating a time-dependent $d_t$ is to intoduce a mixture distribution for $x_t$ , where each mixture component allows different values of $d_t$ . You could incorporate results from identifiability in mixture models for sequences, such as switching dynamical systems, by conditioning the mixture component at each time step. However, implementing this change would likely require substantial modifications to the theoretical framework

A: Notably, our our theoretical framework allows $d_t$ to vary across time steps as we discussed previously.

By extending $d_t$ , we understand that you are interested to see if $s_t$ can be estimated as $d_t$ is the cardinality of $s_t$ . While incorporating switching dynamical systems as suggested would be interesting, it presents a fundamentally different challenge. Such an approach would require simultaneous identification of both $s_t$ and $z_t$ in the data generating process, representing a substantial modification of our current framework. To address the core question of estimating $s_t$ , we have added Proposition 1 in the appendix in paper revision, which provides theoretical guarantees for estimating $s_t$ by leveraging temporal support sparsity.

评论- Rebuttal by Authors

2024-11-22

Problem setting:

Q1: Line 90: It is unclear if $f_n$ is truly invertible, given the setup described. If the noise variable's dimensionality matches the output variable’s, and at least one parent is from $z_n$ , then the input dimensionality may exceed the output’s, making invertibility challenging. Could the authors clarify this statement or adjust the formulation to account for dimensionality constraints?

A: Thank you for the observations and pointing it out for us. When we refer to the invertibility of $f_n$ , we are specifically discussing the to the ability to convert $z^n\_t$ into the noise term $\\epsilon^n\_t$ through $f^{-1}\_n$ . Formally, $\forall j, \\{\hat{z}\_t^j | j\in\hat{s}_t\\}$ , we formulate the prior module as $\\hat{\\epsilon}^j\_{t}=\\hat{f}^{-1}\_j(\\hat{z}\_{t}^j|\\hat{z}\_{t-1})$ , which is used in our normalizing flow-based prior estimation to obtain Eq.8. We have discussed these information in Section 4.1, ``Temporal Prior Estimation''. This is different from requiring $f_n$ to be invertible as a mapping in Eq.1.

The similar approach can be found in [1,2,3] In light of your suggestion, we have removed `` $f_n$ is in vertible'' from our paper to avoid any confusions.

Q2: Line 98: Does the transition function $f^u$ have any equivalence to the previous $f_n$ ?

A: Thanks so much for pointing it out for us. We regret for the typo here. $f^u$ should be $f_n$ in line 98. The typo in the paper has been corrected.

Q3: Line 99: “This implies that when $z^u_t$ is missing, it does not influence $z_{t-1}$ , $z_{t+1}$ , or $x_t$ ”. How is it possible that $z^u_t$ would have any effect on $z_{t-1}$ in the first place, considering time moves forward? I believe the authors mean $z_{t-1}$ does not affect $z^u_t$ when the latter is missing, could this be clarified?

A: Thanks for bringing this to our attention. We regret for the misinformation. What we want to express is that $z^u_t$ is not impacted by $z_{t-1}$ , and it does not influence $z_{t+1}$ , or $x_t$ .

To clarify, we have revised our paper from Line 101 by stating that: ``This implies that when $z^u_t$ is missing, it neither receives influence from $z_{t-1}$ nor exerts influence on $z_{t+1}$ or $x_t$ in the data generation process. ''

Q4: Line 101, Equations (2) and (3): The definitions of $s$ are not complementary. Note that NOT (a AND b) = (NOT a) OR (NOT b). In your Eqs. you have AND in both cases. Given missingness in Eq. (3) is what you want, the authors might want to reformulate Eq. (2).

A: Thank you for your observation. The AND operators $\wedge$ in both equations are essential for our data generating process: For both the support $s_t$ and missingness $s^c_t$ , we require all three conditions to be met to consider a latent variable fully active in both observation and temporal dynamics In order to avoid the confusion, we have remove the word ``complement''

Q5: Definition of Missingness: The paper could benefit from a more formal presentation of how missingness affects the injectivity of the mixing function. Since the dimensionality of input variables varies based on the cardinality of $s_t$ , explicitly defining the generative process after establishing missingness might clarify the setup.

A: In this work, the injectivity of the mixing function holds true for the undercomplete case, where the dimensions of the observations are much larger than the dimensions of the latent variables for all time steps.

Since the number of observed variables exceeds that of latent variables in our setting, it is natural to assume that our mixing function is injective. At the same time, since the missingness influences only the latent variables, it can only remove/deactivate elements in the domain of the mixing function, and thus the injectivity is always preserved.

评论- Rebuttal by Authors

2024-11-23

Q: Time-dependent mixing function

A: Thanks so much for your positive feedback, and we are also very grateful for your further questions, which helps us to further improve the clarity of our paper.

According to your insightful suggestion, we have highlighted the following after introducing the missingness in the problem setting:

``The mixing function g itself remains unchanged across time steps - only its Jacobian structure varies based on the missingness in $z_t$ . To illustrate this, we give an example with our implementation of synthesized data. Let $g(z_t) = sinh(z_t)$ as our mixing function. When certain components of $z_t$ are missing, the corresponding columns in the Jacobian become zero, but g remains the same function. For instance, let us say $N=2$ , and $K=2$ , if $z_t^2$ is missing at time t, the jacobian of the mixing function is: $\frac{\partial g}{\partial z_t} = \begin{bmatrix} \frac{\partial x_t^1}{z_t^1}=cosh(z_t^1) & 0 \\\\ \frac{\partial x_t^2}{\partial z_t^1}=cosh(z_t^1) & 0 \end{bmatrix}$ ''

We hope this could help to avoid potential confusion. Please feel free to let us know if you have any further suggestions. If needed, we are always more than happy to make any further changes.

Q: Typos and introduction revision:

A: In light of your suggestions, we have incorporated clearer definitions of block-wise and component-wise identifiability, along with definitions of their domains, into our paper revision. We appreciate your suggestions that helped improve the clarity of these fundamental concepts.

审稿意见

评分: 8置信度: 32024-11-09

The work proposes a method for identifying latent causal variables for observed time sequences, motivated by sparsity of the causal connections. The identifiability of the latent variables is shown. Unlike most previous work, the considered setting allows the support of the mixing function to change over time.

优点

The considered setting is interesting and well-motivated.

缺点

What $\mathcal{Z}$ refers to in condition (ii) of Theorem 1? Specifically, is $\mathcal{Z}$ a fixed subset of the support of $\mathbf{z}_{t}$ or the estimate? Additionally, where exactly is condition (ii) used in the proof? Finally, why isn’t the support sparsity assumption included explicitly in the statement of Theorem 1?
There should be more in-depth discussion on condition (iii) and the support sparsity assumption in Theorem 1, as these are critical for the identifiability results. Currently, it is unclear how strong these assumptions are. Including simple examples where condition (iii) holds naturally could be helpful.
The presentation of Section 4.1 and 4.2 needs to be improved. The relationships between the components in Section 4.1, the loss function, and the illustration in Figure 2 are currently difficult to follow. The rationale behind the loss design is not clearly explained.

问题

See weaknesses 1 and 2

评论- Rebuttal by Authors

2024-11-22

Q3: The presentation of Section 4.1 and 4.2 needs to be improved. The relationships between the components in Section 4.1, the loss function, and the illustration in Figure 2 are currently difficult to follow. The rationale behind the loss design is not clearly explained.

A: Thank you for this valuable feedback.

The relationships between components in Section 4.1:

The architecture of InterLatent comprises of three key components. The encoder acquires latent causal representations by inferring $q_\\omega(\\hat{z}\_t|x\_t)$ from observations. These learned latent variables are then used by the decoder $p_\\gamma(\\hat{x}\_t|\\hat{z}\_t)$ to reconstruct the observations, implementing the mixing function $g$ in Eq. 1. To learn the latent variables, we constrain them through the KL divergence between their posterior distribution and a prior distribution, which is estimated using a normalizing flow that converts the prior into Gaussian noise in Eq. 8.

The rationale behind loss design

The ELBO loss in Eq.10 approximates Eq. 6 to implement the observation equivalence in Eq. 1. $L_\text{Recon}$ measures reconstruction quality between ground truth and reconstructed observations from the decoder; $L_\text{KLD}$ enforces the learned posterior of $\hat{z}_t$ to match the temporal prior distribution of $z_t$ ; and the sparsity regularization terms implements the support sparsity to ensure proper support structure by promoting sparsity in both decoder and transition function Jacobians. Therefore, by optimizing Eq.10, we can obtain the $\hat{z}_t$ that satisfies our identifiability results for the intermittent temporal latent process.

Revision of the caption of Figure 2

In light of your suggestions, We have included the previous discussions in our revisions. Also, we rewrite the captions of Fig.2 in the following: ``The overall framework of InterLatent consists of: (1) an encoder that maps observations $x\_t$ to latent variables $\\hat{z}\_t$ ( $t\\in[1,T]$ ), (2) a decoder that reconstructs observations $\\hat{x}\_t$ ( $t\\in[1,T]$ ) from $z\_t$ , and (3) a temporal prior estimation module that models the transition dynamics between latent states. We train InterLatent by $L\_\\text{Recon}$ along $L\_\\text{KLD}$ . $\\hat{\\epsilon}\_t$ ( $t\\in[1,T]$ ) denotes the estimation of the true noise terms $\\epsilon\_t$ ( $t\\in[1,T]$ ).

[1] Lachapelle, et al.. Additive decoders for latent variables identification and cartesian-product extrapolation. Advances in Neural Information Processing Systems, 36, 2024.

评论- reply

2024-11-26

The clarifications have addressed my concerns, and I will adjust my score accordingly.

评论- Thanks for your feedback!

2024-11-26

We would like to express our gratitude for your constructive feedback and approval of our work. Your comments have been invaluable. We believe that incorporating your suggestions has significantly enhanced the quality of our submission. Thank you for your support again.

评论- Rebuttal by Authors

2024-11-22

Q1: What $\mathcal{Z}$ refers to in condition (ii) of Theorem 1? Specifically, is a fixed subset of the support of or the estimate? Additionally, where exactly is condition (ii) used in the proof? Finally, why isn’t the support sparsity assumption included explicitly in the statement of Theorem 1?

A: In our theorem, $\mathcal{Z}$ denotes $\mathbb{R}^N$ , the space where all latent variables $z_t$ ( $t\in[1,T]$ ) reside. Since any real-space is naturally path-connected, $J_{h^{-1}}(z_t)$ and $J_{h^{-1}}(z_{t-1})$ inherit this property from $\mathcal{Z}$ , and the proof is based on it. For instance, the permutation matrix in Eq. 18 remains invariant for all $t\in[1,T]$ . Otherwise, the path-connectedness in $\mathcal{Z}$ is violated. Following [1], we make this assumption explicit in Theorem 1 to ensure mathematical rigor.

We formulate the support sparsity by the sparsity constraint $\hat{d}_t\leq d_t$ in our proof of Theorem 1. We do not include it in the Theorem 1 as we treat the support sparsity as a regularization rather than treating an explicit assumption. In light of your suggestion, we have added it in the statement of Theorem 1 for a better clarity as follows: (Support sparsity regularization): for any time step $t$ , $s_t$ is not an empty set, $\hat{d}_t \leq d_t$ .

Q2: There should be more in-depth discussion on condition (iii) and the support sparsity assumption in Theorem 1, as these are critical for the identifiability results. Currently, it is unclear how strong these assumptions are. Including simple examples where condition (iii) holds naturally could be helpful.

A: Thank you for the question. To evaluate the strength of condition (iii) and support sparsity, we analyze them for establishing block-wise identifiability:

Condition (iii):

The sufficient variability assumptions guarantee unique transition patterns for latent variables within their respective supports. This uniqueness enables the crucial connection between $s_t$ and $\hat{s}_t$ through Eq. 17.
Without sufficient variability, the span conditions would exhibit rank deficiency, making Eq. 17 impossible to satisfy and preventing the identification of unique mappings between true and estimated latent variables.

Support Sparsity ( $\hat{d}_t \leq d_t$ ):

Combined with condition (iii), this constraint enables the construction of a permutation $\sigma$ that establishes the crucial relationship $\hat{s}_t = \sigma(s_t)$ in Eq. 20.
Without this constraint, we can only obtain Eq. 19. The the permutation relationship in Eq. 20 between $s_t$ and $\hat{s}_t$ cannot be established, making block-wise identifiability impossible.

Both condition (iii) and support sparsity are essential: condition (iii) ensures distinguishability of latent variables, while support sparsity enables proper mapping between true and estimated supports. The absence of either condition would make block-wise identifiability unattainable.

Let us now illustrate its strength through and example human pose estimation, where different body parts become temporarily invisible due to occlusion. The sufficient variability condition requires the Hessian matrices of log transition probabilities span the full space of the support of $z_t$ . Consider a pose representation $z=\\{z^1,z^2,z^3,z^4,z^5\\}$ where $z^1,z^2$ represent arms (left, right), $z^3$ represents torso, and $z^4,z^5$ represent legs (left, right).

At time $t=1$ , due to camera angle, only the right side is visible. The support is $s_1 = \\{z^2, z^5\\}$ with $d_1 = 2$ . When the camera viewpoint changes at $t=2$ to capture the left side, the support shifts to $s_2 =\\{z^1, z^4\\}$ . The Hessian spanning condition ensures the transitions from $\\{z^2, z^5\\}$ to $\\{z^1, z^4\\}$ . Therefore, left arms and left legs are identifiable at $t=2$ .

This example demonstrates how our condition naturally handles temporal changes in visibility patterns through the spanning requirement on the Hessian matrices, enabling identification even when different parts of the system become observable at different times.

AC 元评审

2024-12-21

The paper introduces a framework for learning latent variables with an intermittent generative process where variables can be switched off at different time steps.

Strengths:

meaningful problem setup that reflects real-world scenarios where latent factors may be intermittently active
identifiability results that apply to both stationary and non-stationary transition processes
empirical validation on synthetic data experiments

Weaknesses:

Disconnect between the theoretical framework and the practical implementation/network design; the architecture seems also similar to existing approaches
Limited experimental validation; scalability to high dimensions was not sufficiently demonstrated
Technical gaps in the formulation, particularly around the estimation of missingness indicators and unclear dimensionality constraints in the mixing function

审稿人讨论附加意见

All reviewers are in favor of acceptance; they generally saw merit in the theoretical contribution but had concerns about practical implementation and validation.

最终决定Accept (Poster)

2025-01-22

Accept (Poster)