/10

Spotlight4 位审稿人

最低3最高5标准差0.7

ICML 2025

$K^2$VAE: A Koopman-Kalman Enhanced Variational AutoEncoder for Probabilistic Time Series Forecasting

Xingjian Wu,Xiangfei Qiu,Hongfan Gao,Jilin Hu,Bin Yang,Chenjuan Guo

提交: 2025-01-17更新: 2025-07-26

摘要

Probabilistic Time Series Forecasting (PTSF) plays a crucial role in decision-making across various fields, including economics, energy, and transportation. Most existing methods excell at short-term forecasting, while overlooking the hurdles of Long-term Probabilistic Time Series Forecasting (LPTSF). As the forecast horizon extends, the inherent nonlinear dynamics have a significant adverse effect on prediction accuracy, and make generative models inefficient by increasing the cost of each iteration. To overcome these limitations, we introduce $K^2$VAE, an efficient VAE-based generative model that leverages a KoopmanNet to transform nonlinear time series into a linear dynamical system, and devises a KalmanNet to refine predictions and model uncertainty in such linear system, which reduces error accumulation in long-term forecasting. Extensive experiments demonstrate that $K^2$VAE outperforms state-of-the-art methods in both short- and long-term PTSF, providing a more efficient and accurate solution.

关键词

Time Series Probabilistic Forecasting

评审与讨论

审稿意见

评分: 42025-03-04

This study presents $K^2VAE$ , an efficient variational autoencoder (VAE)-based generative model. It utilizes a KoopmanNet to convert nonlinear time series into a linear dynamical system. Furthermore, it designs a KalmanNet to enhance predictions and model uncertainty in this linear system, thereby reducing error accumulation in long-term forecasting.

给作者的问题

None

论据与证据

The statement that "The model exhibits robust generative capability and performs well in both short- and long-term probabilistic forecasting" is strongly supported by experimental findings on real-world datasets.

方法与评估标准

There are detailed running examples and complexity comparison covering 9 baselines. 8 datasets. 2 metrics. The selection of baselines aligns with domain of time series forecasting.

However, why were Continuous Ranked Probability Score (CRPS) and Normalized Mean Absolute Error (NMAE) chosen as the main evaluation metrics?

理论论述

The equations are correct and align with the code submitted.

实验设计与分析

The design of the experiments is clear and comprehensive, and support $K^2VAE$ claims.

Although the authors analyzed the overall performance, key components, and efficiency of $K^2VAE$ , there is a lack of analysis on parameter sensitivity. It is hoped that the authors can add an analysis of key parameters.

补充材料

Source code and the dataset are available are available in the submission page. I have checked the $K^2VAE$ backbone codes.

与现有文献的关系

Transferring data from a nonlinear space to a linear space can resolve non-stationarity issues, thereby effectively alleviating the predominantly non-stationary nature of real-world datasets. This approach is highly recommended for broad application.

遗漏的重要参考文献

I think there are no essential references that have not been discussed.

其他优缺点

Strengths：

S1. The paper is well-motivated, as the challenge of effectively modeling both short and long-term multivariate time series probabilistic forecasting.

S2. This paper is well written. The notations are clear.

S3. Experimental results demonstrate that $K^2VAE$ consistently outperforms state-of-the-art baselines across multiple real-world datasets.

Weaknesses：

W1. Why were CRPS and (NMAE) chosen as the main evaluation metrics? Why not use other evaluation criteria?

W2. The conclusion of the paper exceeds two lines and needs to be adjusted to meet the final format requirements of ICML.

W3. For input token embeddings, the conventional patch partitioning method was not used, and no detailed explanation was provided, which is confusing.

W4. Although the authors analyzed the overall performance, key components, and efficiency of $K^2VAE$ , there is a lack of analysis on parameter sensitivity. It is hoped that the authors can add an analysis of key parameters.

其他意见或建议

It appears that there is an inconsistency between Equation 12 and Figure 3. The more accurate representation should be using Equation 12 to express $Res$ . Please correct this, and unify the representations in these two places.

作者回复

2025-03-31

Reply to W1. Why were CRPS and (NMAE) chosen as the main evaluation metrics? Why not use other evaluation criteria?

Thank you for your valuable comments. In probabilistic forecasting, evaluation metrics including CRPS (Continuous Ranked Probability Score), CPRS_sum, NMAE (Normalized Mean Absolute Error), NMAE_sum, NRMSE (Normalized Root Mean Square Error), and NRMSE_sum are widely adopted as standard performance metrics [1] [2] [3]. From a comprehensive evaluation perspective, CRPS and its aggregated variant CRPS_sum quantify the distributional approximation quality, while NMAE/NMAE_sum or NRMSE/NRMSE_sum assess point estimation accuracy. Typically, one metric from each category is selected for evaluation. The ProbTS benchmark [3], recognized as a comprehensive framework for probabilistic forecasting models, employs CRPS and NMAE as primary evaluation criteria. This benchmark configuration, having undergone extensive empirical validation across numerous algorithms, was adopted in our study to ensure methodological consistency and equitable comparison with baseline models.

[1] Kollovieh, Marcel, et al. "Predict, refine, synthesize: Self-guiding diffusion models for probabilistic time series forecasting." Advances in Neural Information Processing Systems 36 (2023): 28341-28364.

[2] Rasul, K., Seward, C., Schuster, I., and Vollgraf, R. Autoregressive denoising diffusion models for multivariateprobabilistic time series forecasting. In ICML, volume139, pp. 8857–8868, 2021a.

[3] Zhang, Jiawen, et al. "ProbTS: Benchmarking point and distributional forecasting across diverse prediction horizons." Advances in Neural Information Processing Systems 37 (2024): 48045-48082.

Reply to W2. The conclusion of the paper exceeds two lines and needs to be adjusted to meet the final format requirements of ICML.

Thanks! We will modify this issue.

Reply to W3. For input token embeddings, the conventional patch partitioning method was not used, and no detailed explanation was provided, which is confusing.

The conventional patching strategy, as introduced in PatchTST [4], aims to uniformly partition multivariate time series across channels to facilitate subsequent channel-wise independent modeling. The dimensional transformation process operates as: $X \in \mathbb{R}^{B \times N \times T} \to X^\prime \in \mathbb{R}^{B \times N \times n \times p} \to X^{patch} \in \mathbb{R}^{(B \times N) \times n \times p} \to X^{embedding} \in \mathbb{R}^{(B \times N) \times n \times d}$ , where $n$ denotes the number of patches, $p$ denotes the patch size. This architecture collapses the batch dimension $B$ and variate dimension $N$ into a single axis, effectively enforcing univariate modeling at the patch level.
In contrast, our adopted patching mechanism maps multivariate subsequences to high-dimensional representations compatible with Koopman operator theoretic frameworks, enabling systematic state transition modeling. The modified transformation pipeline is formalized as: $X \in \mathbb{R}^{B \times N \times T} \to X^\prime \in \mathbb{R}^{B \times N \times n \times p} \to X^{patch} \in \mathbb{R}^{B \times n \times (N \times p)} \to X^{embedding}\in \mathbb{R}^{B \times n \times d}$ , where $n$ denotes the number of patches, $p$ denotes the patch size. Our architecture collapses the patch dimension $p$ and variate dimension $N$ into a single axis.

[4] Nie, Yuqi, et al. "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers." The Eleventh International Conference on Learning Representations.

Replay to W4. Although the authors analyzed the overall performance, key components, and efficiency of K2VAE , there is a lack of analysis on parameter sensitivity. It is hoped that the authors can add an analysis of key parameters.

Thanks for your valuable comments. Following your suggestions, we systematically evaluated three critical hyerparameters: Patch Size $p$ , and the dimensionality $d$ of hidden layers in the Measurement Function, and the number of hidden layers $l$ in the Measurement Function. Specific experimental results are presented in the following Tables :

https://anonymous.4open.science/r/K2VAE-D957/sensitivity.md

These empirical findings substantiate that $K^2$ VAE maintains robust performance across different hyperparameter configurations. To achieve best performance, we recommand a group of stable hyperparameters for both short-term probabilistic forecasting and long-term probabilistic forecasting:

Patch Size:
- Short-horizon forecasting tasks achieve peak performance with patch_size $p$ : 8
- Long-horizon forecasting benefits from extended context capture with patch_size $p$ : 24
Network Architecture:
- Number of hidden layers $l$ : 2-3 layers yield optimal performance-efficiency balance
- Dimensionality of hidden layers $d$ : 256 provides sufficient representational capacity

Thanks again for the Reviewer auu7's valuable opinion!

审稿意见

评分: 32025-03-12

This paper points out that traditional probabilistic methods in predicting the collapse of long-term series uncertainty estimate provide a new perspective. Then, to overcome these limitations, this paper introduces K2VAE, an efficient VAE-based generative model that leverages a KoopmanNet to transform nonlinear time series into a linear dynamical system, and devises a KalmanNet to refine predictions and model uncertainty in such a linear system, which reduces error accumulation in long-term forecasting.

给作者的问题

Why can the proposed method solve the problem of error accumulation well?
According to my understanding, this paper constructs a probabilistic time series model based on Koopas [1]. This should further clarify. Why is the experiment not compared with Koopas?
Is the proposed method for more accurate prediction of long time series or for effective uncertainty modeling over longer periods of time? if the former, you can add any probabilistic model with koopa.

[1]Liu, Yong, Chenyu Li, Jianmin Wang, and Mingsheng Long. "Koopa: Learning non-stationary time series dynamics with koopman predictors." Advances in neural information processing systems 36 (2023): 12271-12290.

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes

实验设计与分析

Yes

补充材料

N/A

与现有文献的关系

N/A

遗漏的重要参考文献

The connection and difference between proposed method and [1] should be further clarified.

其他优缺点

Strongthness:

The experimental results show the advantages of the proposed model in terms of inference time and computing resources.
The experimental results show that the proposed method is effective.
This paper points out that the traditional probability method in forecasting long time series uncertainty estimation collapse problem, provides a new point of view
K2VAE outperforms state-of-the-art baselines, showing notable improvements in predictive performance.

Weakness:

其他意见或建议

Figure 3 is complicated and difficult to understand.

作者回复

2025-03-31

Reply to Q1: Why can the proposed method solve the problem of error accumulation well?

Thank you for your valuable comments. In $K^2$ VAE, the KoopmanNet models time series in linear dynamical systems, where uncertainties are represented as deviations from the linear system’s predictions. Then KalmanNet works in a two-phase operation:

Prediction Phase: Outputs state predictions and covariance matrices of uncertainty distributions for the linear system. $\hat{z} _{k} = A z _{k-1} + B u _k$ $\hat{\mathrm{P}} _k = A\mathrm{P} _{k-1}A^T + Q$
Update Phase: Integrates observations to compute the Kalman gain $K_k$ , refining predictions and covariance via: $K _k = \hat{\mathrm{P}} _k H^T(H\hat{\mathrm{P}} _k H^T + R)^{-1}$ $z _k = \hat{z} _{k} + K _k(\hat{x} _k^H - H \hat{z} _{k})$ $\mathrm{P} _k = (I - K _k H) \hat{\mathrm{P}} _k$

The prediction and the update phase fuse the information from Integrator ( $u_k$ ) and KoopmanNet ( $\hat{x}^H _k$ ) to effectively eliminate error accumulation in the prediction $\hat{z} _k \to z _k$ and the covariance matrix $\hat{\mathrm{P}} _k \to \mathrm{P} _k$ . To demonstrate KalmanNet’s effectiveness, we include an ablation study of it in https://anonymous.4open.science/r/K2VAE-D957/ablations.md. The experiments demonstrate the importance of the KalmanNet in multiple PTSF tasks, particularly in tasks of L=336 and 720.

Reply to Q2: According to my understanding, this paper constructs a probabilistic time series model based on Koopas. This should further clarify. Why is the experiment not compared with Koopas?

We first explain the differences between $K^2$ VAE and Koopa in the following points:

Koopa applies Koopman Theory to address the non-stationarity issue in point forecasting tasks. $K^2$ VAE utilizes the KoopmanNet to make It easier to model the uncertainties. Koopa is a model completely based on the Koopman Theory and our proposed $K^2$ VAE benefits from the Koopman Theory, Kalman Filter, and VAE.
Koopa is a deterministic model while $K^2$ VAE is a non-deterministic model. Koopa is a MLP based model while $K^2$ VAE adopts the generative architecture VAE, and excels at modeling the distributions over the target horizon.
Koopa assumes that the time series can be accurately modeled based on the pure Koopman Structures, so that it utilizes multi-scale MLP to simulate the Koopman process, trying to model an unbiased linear dynamical system in the measurement space. While $K^2$ VAE is a model tailored for PTSF tasks, which starts from analyzing the ``bias'' in the linear dynamical system modeled by KoopmanNet, using a KalmanNet to model these uncertainties.
A key design in Koopa's is that it utilizes multiple MLP layers to continuously models the residual parts. In our asssumptions, such residual parts may be uncertainties which are hard to model. Therefore, $K^2$ VAE not only designs a KoopmanNet to linearize the time series, but also models the uncertainties through the subsequent KalmanNet.

We include Koopa in our baselines. Specifically, we equip it with a Gaussian distribution, which is consistent with other point-forecasting baselines such as FITS, iTransformer and PatchTST. The experimental results are shown in https://anonymous.4open.science/r/K2VAE-D957/comare_with_koopa.md . In both long-term and short-term prediction tasks, Koopa lags behind $K^2$ VAE.

Reply to Q3: Is the proposed method for more accurate prediction of long time series or for effective uncertainty modeling over longer periods of time? if the former, you can add any probabilistic model with koopa.

$K^2$ VAE focuses on probabilistic forecasting, which is in line with ``effective uncertainty modeling over longer periods of time.'' Combining with the answers to Q1 and Q2, $K^2$ VAE employs KalmanNet to achieve better uncertainty modeling. Based on the generative model VAE, $K^2$ VAE accurately describes the variational distribution in the measurement space constructed by KoopmanNet, thus enhancing the ability to construct the target distribution of the forecasting horizon. On the other hand, Koopa is constructed for the deterministic long-term point forecasting tasks. Its characteristic lies in constructing a multi-scale MLP structure and modeling the time series from the perspective of decomposition. And $K^2$ VAE does not directly apply the structure of Koopa. Instead, it has a greater inclination towards the application of the original Koopman Theory. For the part of uncertainties, $K^2$ VAE models it with the KalmanNet. We will discuss the differences from Koopa in the Related Works and will include Koopa as an additional baseline (as elaborated in Q2) in the paper.

Reply to Other Problems

Figure 3 is redrawn in https://anonymous.4open.science/r/K2VAE-D957/README.md .Thanks again for the valuable comments! May I ask if our responses have resolved your questions? Or we can have further discussions!

审稿意见

评分: 42025-03-12

This study presents an efficient framework named K2VAE, which transforms nonlinear time series into a linear dynamical system. By predicting and refining the process uncertainty within the system, K2VAE showcases powerful generative capabilities and excels in both short- and long-term probabilistic forecasting.

给作者的问题

Koopman theory requires that the measurement function maps inputs to several observable variables to construct a linear dynamical process in that space. Traditional methods typically choose fixed basis functions to meet this requirement, but this paper adopts a learnable measurement function. How can we ensure that the constructed space is within a linear dynamical system?

论据与证据

Claim 1: This study asserts that KoopmanNet can fully leverage the inherent linear dynamical properties within the measurement function space, streamline modeling processes, and thereby enhance model efficiency.

Claim 2: This study contends that KalmanNet can effectively mitigate error accumulation in long-term predictive state forecasting.

Both claims have been substantiated through ablation experiments.

方法与评估标准

This paper compares with the state-of-the-art (SOTA) in probabilistic forecasting, utilizing common evaluation metrics.

理论论述

Most equations and derivations can be followed.

实验设计与分析

Although the author has provided a detailed analysis of the overall performance, there is a lack of sensitivity analysis regarding the parameters.

补充材料

I have checked the appendix and the code, but have no time to check them in detail.

与现有文献的关系

The author's methodology is related to time series analysis.

遗漏的重要参考文献

Related works seem to be covered.

其他优缺点

Strength.

S1. Multivariate time series probabilistic forecasting is important to time-series analysis. S2. This work focuses on an important problem that could have real-world applications. S3. The tables and figures used in this work are clear and easy to read.

Weakness.

W1. Koopman theory requires that the measurement function maps inputs to several observable variables to construct a linear dynamical process in that space. Traditional methods typically choose fixed basis functions to meet this requirement, but this paper adopts a learnable measurement function. How can we ensure that the constructed space is within a linear dynamical system?

W2. In the ablation experiments for the components of KoopmanNet in Table 3, why does using the local Koopman operator lead to numerical instability?

W3. There is a lack of sensitivity analysis regarding the parameters, which obscures understanding the behavior of the proposed method in different hyperparameter settings.

其他意见或建议

Some citations should be updated to published versions instead of preprints.

作者回复

2025-03-31

Reply to W1. Koopman theory requires that the measurement function maps inputs to several observable variables to construct a linear dynamical process in that space. Traditional methods typically choose fixed basis functions to meet this requirement, but this paper adopts a learnable measurement function. How can we ensure that the constructed space is within a linear dynamical system?

Thank you for your valuable comments. Koopman Theory requires that the measurement function possess sufficient projection capabilities to map nonlinear inputs into a high-dimensional measurement space and model them using a linear dynamical system. Traditional methods would use some fixed functions, such as polynomial bases, because these functions have strong high-order fitting capabilities. Similarly, in deep learning, the universal approximation theorem guarantees that multi-layer perceptron models (MLP) have strong fitting capabilities, theoretically able to approximate any continuous function to arbitrary precision. $K^2$ VAE, as a deep learning model, also adopts multi-layer perceptron fitting for measurement functions. KoopmanNet can construct a linear dynamic system because it designs the Measurement Function $\psi$ and Decoder $\psi^{-1}$ in a fully symmetric manner and controls the outputs of the linear system constructed by the Koopman Operator $\mathcal{K}$ close to the contextual time series in the original space by using reconstruction loss $\mathcal{L} _{rec}$ . Additionally, the Koopman Operator is also linear, which aligns with the assumptions of Koopman Theory.

Reply to W2. In the ablation experiments for the components of KoopmanNet in Table 3, why does using the local Koopman operator lead to numerical instability?

This is because we use one-step eDMD to accelerate the efficient computation of $\mathcal{K}_{loc}$ , as shown in the following steps.

$X^{P^\ast} _{back} = \left[x^{P^\ast} _1,x^{P^\ast} _2,\cdots,x^{P^\ast} _{n-1}\right]$ $X^{P^\ast} _{fore} = \left[x^{P^\ast} _2,x^{P^\ast} _3,\cdots,x^{P^\ast} _{n}\right]$ $\mathcal{K} _{loc} = X^{P^\ast} _{fore} (X^{P^\ast} _{back})^{\dagger}$

By using a one step offset and then calculating the pseudo-inverse, we estimate $\mathcal{K} _{loc}$ . This process is prone to matrix computation errors when the difference between the contextual length and the horizon length is significant (e.g., 96 and 720). And this phenomenon often occurs when the model is just beginning to be optimized, and the space constructed by the measurement function is not good enough. Theoretically, though the Moore-Penrose Pseudo Inverse always exists, the corresponding computational method relies on Singular Value Decomposition (SVD), which can be numerically unstable and lead to computation failures. Therefore, we need to use a learnable $\mathcal{K} _{glo}$ to ensure the computational stability at the beginning of training.

Reply to W3. There is a lack of sensitivity analysis regarding the parameters, which obscures understanding the behavior of the proposed method in different hyperparameter settings.

https://anonymous.4open.science/r/K2VAE-D957/sensitivity.md

Patch Size:
- Short-horizon forecasting tasks achieve peak performance with patch_size $p$ : 8
- Long-horizon forecasting benefits from extended context capture with patch_size $p$ : 24
Network Architecture:
- Number of hidden layers $l$ : 2-3 layers yield optimal performance-efficiency balance
- Dimensionality of hidden layers $d$ : 256 provides sufficient representational capacity

Reply to Other Problems

We will carefully examine and correct the normativity of the citation, thank you for the correction!

Thanks again for the Reviewer rXKc7's valuable opinion! We deal with W1-W3, and provide analysis and evidence from the perspective of model design and experiment. Do our reply solve your question? If there are any other questions, we can discuss them further!

审稿人评论

2025-04-02

There are still some doubts about "linear dynamic systems." The Measurement Function seems to only map patches to high-dimensional spaces, but how can we ensure that these high-dimensional vectors are in a linear dynamic system and can be described by a linear operator K?

作者评论

2025-04-03

Thank you for the question! This issue needs to be viewed from the perspective of optimization. When the model begins training, the transition process of the high-dimensional vectors constructed by the Measurement Function is difficult to be precisely described by the linear operator $\mathcal{K}$ . Therefore, we mentioned in the text that the linear dynamical system constructed in this way is ``biased''. Specifically, this bias is reflected in the fact that the Measurement Function $\psi$ may not yet have been well fitted, making it hard for the constructed high-dimensional vectors to be accurately modeled by the linear system described by operator $\mathcal{K}$ . In other words, operator $\mathcal{K}$ describes the deterministic part, while the bias denotes the uncertainty or non-deterministic part, which needs to be characterized using probabilistic methods. Subsequently, KalmanNet, through iterations, can effectively enhance the deterministic prediction part and use covariance $\mathrm{P}$ to describe the uncertain part. Meanwhile, as the network is gradually optimized through backpropagation, the Measurement Function $\psi$ converges gradually, and the generated high-dimensional vectors become easier to be described by operator $\mathcal{K}$ in the linear dynamical system. At this point, the deterministic part is enhanced, and the uncertain part also becomes easier to describe. Additionally, we control the linear system constructed by the Koopman Operator to be close to the contextual time series in the original space through a reconstruction loss $\mathcal{L}_{rec}$ . And the structures of KoopmanNet and the Decoder are similar to AutoEncoders, which have been applied and proven effective in [1] [2] [3].

[1] Azencot, Omri, et al. "Forecasting sequential data using consistent koopman autoencoders." International Conference on Machine Learning. PMLR, 2020.

[2] Otto, Samuel E., and Clarence W. Rowley. "Linearly recurrent autoencoder networks for learning dynamics." SIAM Journal on Applied Dynamical Systems 18.1 (2019): 558-593.

[3] Liu, Yong, et al. "Koopa: Learning non-stationary time series dynamics with koopman predictors." Advances in neural information processing systems 36 (2023): 12271-12290.

Thanks again for the Reviewer rXKc7's valuable opinion! Do our reply solve your question? If there are any other questions, we can discuss them further!

审稿意见

评分: 52025-03-13

This study introduces $K^2$ VAE , a VAE-based probabilistic forecasting model designed to address PTSF. By leveraging the KoopmanNet, $K^2$ VAE converts nonlinear time series into a linear dynamical system, enabling a more effective representation of state transitions and inherent process uncertainties. Additionally, the KalmanNet models uncertainty within this linear dynamical system, reducing error accumulation in long-term forecasting tasks.

给作者的问题

The experimental design of this paper is very comprehensive, evaluating multiple datasets on both long and short step tasks. However, in Table 5, some datasets have the same name but different actual lengths. What is the reason for this?

论据与证据

This work is dedicated to addressing the challenges of nonlinear phenomena in time series probabilistic forecasting and the cumulative errors in long-step predictions. Theoretically, it proposes corresponding modules for improvement based on Koopman theory and the Kalman Filter, respectively. Experimentally, it has been proven effective through evaluations under numerous settings and also demonstrated that the model is lighter compared to generative models based on Diffusion and Flow, suggesting a promising application outlook.

方法与评估标准

The design of $K^2$ VAE appears to be quite sophisticated, with the proposed KoopmanNet and KalmanNet having logically reasonable connections. The KoopmanNet models nonlinear time series as a linear transition process between measurements, while the KalmanNet is suitable for handling error accumulation issues in linear dynamic systems. Extensive experimental evidence has also demonstrated the outstanding performance of $K^2$ VAE in both short-term and long-term probabilistic prediction scenarios, while maintaining a lightweight structure.

理论论述

I have checked the proofs of Theorems 3.1-3.2 in the appendix, and they are all correct and consistent with the purpose of the article.

实验设计与分析

Although the ablation studies of key components in KoopmanNet and KalmanNet are discussed, the complete removal of these two modules to analyze their impacts has not been considered.

补充材料

The Supplementary Material of this paper, like that of many others, provides a detailed introduction to the data, comprehensive experimental results, and visualizations of the prediction performance.

与现有文献的关系

This work has inspired the design of more lightweight probabilistic forecasting models in the field of time series, for full-scenario temporal probabilistic forecasting tasks. Previous Diffusion-based works generally caused a large amount of computational resource overhead and seemed unable to model probability distributions well over longer forecasting windows. The proposal of $K^2$ VAE has inspired researchers to shift their focus to the design of VAE, spending more effort on designing proper structures that better conform to the inductive biases of time series.

遗漏的重要参考文献

The related work section discusses the application of VAE in time series data but seems to omit some classic algorithms, such as $D^3$ VAE. It is suggested to include the discussion and experiments of these algorithms.

其他优缺点

Strength.

S1. The paper introduce an efficient frameworkcalled $K^2$ VAE. It transforms nonlinear time series into a linear dynamical system. Through predicting and refining the process uncertainty of the system.

S2. $K^2$ VAE demonstrates strong generative capability and excells in both the short- and long-term probabilistic forecasting.

Weakness.

W1. The experimental design of this paper is very comprehensive, evaluating multiple datasets on both long and short step tasks. However, in Table 5, some datasets have the same name but different actual lengths. What is the reason for this?

W2. The related work section discusses the application of VAE in time series data but seems to omit some classic algorithms, such as $D^3$ VAE. It is suggested to include the discussion and experiments of these algorithms.

W3. Although the ablation studies of key components in KoopmanNet and KalmanNet are discussed, the complete removal of these two modules to analyze their impacts has not been considered.

W4. The conclusion of the article needs to be adjusted to meet the final format requirements of ICML, as it currently exceeds two lines.

其他意见或建议

The conclusion of the article needs to be adjusted to meet the final format requirements of ICML, as it currently exceeds two lines.

作者回复

2025-03-31

Reply to W1. The experimental design of this paper is very comprehensive, evaluating multiple datasets on both long and short step tasks. However, in Table 5, some datasets have the same name but different actual lengths. What is the reason for this?

Thank you for your valuable comments. We follow the evaluation protocol in ProbTS, a well known benchmark for probabilistic forecasting tasks. Some datasets with same names but different suffixes, e.g., Electricity-S and Electricity-L, are different datasets with different lengths and channels. The ETT datasets share the same in short- and long-term probabilistic forecasting tasks. We have summarized the details of datasets in our appendix, we list the table here:

https://anonymous.4open.science/r/K2VAE-D957/datasets_info.md

Reply to W2. The related work section discusses the application of VAE in time series data but seems to omit some classic algorithms, such as D3VAE. It is suggested to include the discussion and experiments of these algorithms.

Thanks for the suggestions. Although $D^3$ VAE is closer to a Diffusion model, it does indeed follow the paradigm of VAE. Specifically, $D^3$ VAE is a dual-directional variational auto-encoder that combines diffusion, denoising, and factorization. By coupling diffusion probability models, it expands time series data, reduces data uncertainty, and simplifies the inference process. Furthermore, it treats latent variables as multivariate and minimizes total correlation to separate them, thereby enhancing the interpretability and stability of predictions. Both $D^3$ VAE and $K^2$ VAE aim to better model the uncertainty in the target window through decoupling methods, and we will discuss the similarities and differences in the Related Works section. In terms of experiments, we also provide a detailed comparison with $D^3$ VAE, assessing $D^3$ VAE in both short- and long-term PTSF tasks:

https://anonymous.4open.science/r/K2VAE-D957/compare_with_d3vae.md

We keep the contexutal length equals to the horizon length to meet the setting of $D^3$ VAE. The results show that $K^2$ VAE outperforms $D^3$ VAE in both short and long step scenarios, and $D^3$ VAE also seems less suitable for long-term PTSF tasks.

Reply to W3. Although the ablation studies of key components in KoopmanNet and KalmanNet are discussed, the complete removal of these two modules to analyze their impacts has not been considered.

For KoopmanNet and KalmanNet, they act as two important parts of $K^2$ VAE. We further analyze the impact of these two key modules on PTSF tasks, and supplement the ablation experiments using the same datasets in the paper:

https://anonymous.4open.science/r/K2VAE-D957/ablations.md

The results demonstrate that KalmanNet is more critical for LPTSF tasks. When KalmanNet is removed, the arrcuracy breaks down sharply, indicating that the importance of KalmanNet to effectively eliminate the cumulative errors in LPTSF tasks. When removing KoopmanNet, on the other hand, would lead to a performance decline across all the tasks. Without KoopmanNet, the nonlinear time series is hard to model, and the uncertainties in such nonlinear system is also difficult to be captured through KalmanNet, which furtherly declines the model's modeling capabilities. The experiment shows that both KoopmanNet and KalmanNet are critical and indispensable in our design.

Reply to W4. The conclusion of the article needs to be adjusted to meet the final format requirements of ICML, as it currently exceeds two lines.

Thanks for your reminder! We will fix it.

Thanks again for the Reviewer iEFP's valuable opinion! We deal with W1-W4, and provide analysis and evidence from the perspective of model design and experiment. Do our reply solve your question? If there are any other questions, we can discuss them further!

审稿人评论

2025-04-06

Thanks for the clarifications. The additional analysis, particularly regarding the dataset splits, D3VAE comparisons, and the role of KoopmanNet, addresses my concerns. I've accordingly raised my score from 4 to 5.

作者评论

2025-04-06

We are thrilled that our responses have effectively addressed your questions and comments. We would like to express our sincerest gratitude for taking the time to review our paper and provide us with such detailed feedback.

最终决定Accept (spotlight poster)

2025-05-01

This paper presents $K^2$ VAE, a novel variational autoencoder framework that combines Koopman operator theory and Kalman filtering for probabilistic time series forecasting. The proposed model effectively transforms nonlinear dynamics into a linear latent space via KoopmanNet and refines uncertainty through KalmanNet, addressing long-standing challenges in long-term forecasting such as error accumulation and inefficiency in generative models. The method is theoretically sound, and the implementation is carefully designed and well-motivated. Extensive experiments across diverse datasets show that $K^2$ VAE consistently outperforms strong baselines, including recent diffusion-based and transformer-based approaches, while remaining lightweight and efficient. The authors provide thorough ablations and comparative studies, including with $D^3$ VAE and Koopa, and respond convincingly to reviewer concerns. Overall, this is a strong, well-executed contribution to the field, with both theoretical depth and practical relevance. I recommend acceptance.