PaperHub
7.2
/10
Spotlight4 位审稿人
最低3最高4标准差0.4
4
3
4
4
ICML 2025

Non-stationary Diffusion For Probabilistic Time Series Forecasting

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24

摘要

关键词
Probabilistic Time Series ForecastingDenoising Diffusion Process ModelNon-stationary Forecasting

评审与讨论

审稿意见
4

This paper introduces NsDiff, a novel diffusion-based framework for probabilistic time series forecasting that explicitly addresses non-stationary uncertainty. Recognizing that conventional DDPMs rely on a fixed variance assumption from the additive noise model (ANM), the authors propose the integration of a Location-Scale Noise Model (LSNM) to allow the variance to vary with the input data. NsDiff combines a pre-trained conditional mean and variance estimator with an uncertainty-aware noise schedule that dynamically adapts noise levels at each diffusion step. Extensive experiments on nine real-world and synthetic datasets demonstrate that NsDiff significantly outperforms existing methods, such as TimeGrad and TMDM, especially in capturing changing uncertainty patterns. Although the paper provides thorough theoretical derivations and promising empirical results, some aspects—such as the integration of the pre-trained estimators and the robustness of the noise schedule—could benefit from further clarification.

  1. Why does NsDiff not adopt a fully end-to-end joint optimization approach? Is it necessary to pre-train the two networks separately?

  2. Figure 2 contains numerous curved lines. While I understand that this may reflect the authors’ intentional design style, it appears rather unappealing and should be modified.

  3. Although the task focuses on probabilistic MTS forecasting, I recommend additionally reporting metrics such as MSE and MAE, since the mean is an important characteristic of the distribution.

给作者的问题

See the summary.

论据与证据

See the summary.

方法与评估标准

See the summary.

理论论述

See the summary.

实验设计与分析

See the summary.

补充材料

See the summary.

与现有文献的关系

See the summary.

遗漏的重要参考文献

See the summary.

其他优缺点

See the summary.

其他意见或建议

See the summary.

作者回复

Q1: Why does NsDiff not adopt a fully end-to-end joint optimization approach? Is it necessary to pre-train the two networks separately?

A1: yes, the networks can be trained jointly without large performance loss. Below is an example to train on ETTh1 with and without pretraining, where we report CRPS metric.

epochpretrainjointtrain
10.41810.4407
20.40410.4227
30.39770.4045
40.39260.4004
50.38890.3868
60.37950.3873
MSE earlystop

As can be seen, although joint train experiences a slight performance degradation (1.86%), it still outperforms the previous state-of-the-art TMDM (0.452). However, compared to pretraining, co-training is slightly harder to converge. We will clarify this in the updated version.

Q2: Figure 2 should be modified, e.g. remove curved lines.

A2: Thanks for this suggestion, we will modify Figure 2 according to your advice in the lastest version.

Q3: Report additional metrics such as MSE/MAE.

A3: Thanks for your insight. We give the results on real and synthetic datasets at the following tables with an addtional baseline CSBI (as required by Reviewer gSNF), the settings are consistent with the main context. The repo is updated accordingly to include CSBI code.

We present the results of MSE/MAE of synthetic datasets at follows:

VarianceLinearQuadratic
ModelsMSEMAEMSEMAE
TimeGrad1.5463.7261.6264.173
CSDI1.5163.6411.5463.768
TimeDiff1.5373.7761.5593.762
DiffusionTS1.7384.7661.6894.823
TMDM1.5143.6391.4933.568
NsDiff1.5123.6161.4793.448

As seen in this Table, NsDiff still achieves SOTA in uncertainty variation conditions.

We present the results of MSE/MAE of real datasets at follows:

ModelsDatasetsETTh1ETTh2ETTm1ETTm2ECLEXGILISolarTraffic
TimeGradMSE0.8131.4960.8310.9670.5041.0581.4140.4460.535
(2021)MAE1.0623.4621.2181.6900.5051.5674.1970.4750.983
CSDIMSE0.7080.9000.7521.0690.8221.0811.4810.6750.925
(2022)MAE0.9491.2261.0021.7231.0071.7014.5150.7631.731
CSBIMSE0.6340.8200.7570.6360.7830.8971.4380.6510.848
(2023)MAE0.7620.6590.5260.8410.9230.7464.3440.7481.527
TimeDiffMSE0.4790.4850.4770.3330.7640.4461.1690.7130.784
(2023)MAE0.5170.4560.5370.2680.8790.4023.9580.8211.350
DiffusionTSMSE0.7741.4110.7441.2320.8561.5641.7880.7400.815
(2024)MAE1.0893.2731.0302.3721.0723.6286.0530.7491.473
TMDMMSE0.6070.4900.4550.3950.3590.4301.1750.3160.425
(2024)MAE0.6960.5120.4940.3150.2570.3343.6360.2500.679
NsDiffMSE0.5230.4900.4550.3520.3060.4120.9850.3070.373
(ours)MAE0.5940.5140.4880.2810.2090.3002.8460.2420.637

Note: TimeDiff is a model speciffically designed for long-term point forecasting.

As in this Table, in datasets with high non-stationarity, NsDiff still achieves SOTA, attributed to the dynamic mean and variance endpoint and the uncertainty-aware noise schedule.

Here we give the results of an addtional baseline CSBI, where NsDiff still remains SOTA.

ModelsDatasetsETTh1ETTh2ETTm1ETTm2ECLEXGILISolarTraffic
CSBICRPS0.5520.5710.5020.4910.5850.6591.1090.4980.875
(2023)QICE6.1415.2303.4718.9187.9826.8707.17510.83011.382
审稿意见
3

This paper introduces a new probabilistic time series forecasting method based on non-stationary diffusion by estimation the step-wise means and variances. The proposed method is validated on different real-world datasets.

给作者的问题

Given the substantial computational cost of training this proposed method compared to other diffusion-based models, is the performance improvement significant enough to justify the expense?

论据与证据

No. Below are some of my concerns. 1)Estimating the variance of time series is numerically tricky. Please clarify how to predict the variances in a numerically stable way using pretraining in Sec.4.3. And what if the variance predicted via MLE is too large (when some spiky data points appear)? Using sliding windows might not address this issue perfectly.

2)From Algorithm 1, it seems that the proposed training method is in fact based on fine-tuning, which requires path sampling during training. Is it too computationally expensive for diffusion-based time series forecasting models?

方法与评估标准

No. Apart from the results reported in Table 3, it would be more convincing if the authors could report the relevant RMSEs and MAEs to show the proposed method is an unbiased point-wise predictor. In my opinion, PICP and QICE may be more suitable for evaluating uncertainty quantification tasks than forecasting tasks.

理论论述

Yes.

实验设计与分析

Yes.

1)The CRPS and QICE may not be sufficient to evaluate the model’s forecasting performance.

2)The authors may also need to report the time efficiency and memory costs of the proposed training and inference procedures.

补充材料

Yes, all the parts.

与现有文献的关系

The proposed method is largely built on the following paper.

Reference: [1] Li, Y., Chen, W., Hu, X., Chen, B., Zhou, M., et al. Transformer-modulated diffusion models for probabilistic multivariate time series forecasting. In The Twelfth International Conference on Learning Representations, 2024.

The derivations in Appendix A in this paper (which is the core of the proposed method) is directly adapted from Appendix D in [1].

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  1. The writing is clear and easy to follow.

  2. Including uncertainty estimation for time series diffusion model is reasonable.

  3. The relevant arithmetic prof regarding the computation of the step-wise variances is given.

Weaknesses:

  1. The novelty of the proposed method is limited as it heavily builds upon the existing method [1].

  2. The numerically stability of the proposed method is somewhat questionable.

  3. The computational cost of proposed method is high compared to existing diffusion methods, e.g., CSDI.

4)Some important evaluation metrics for forecasting are missing, e.g., MAE and RMSE.

其他意见或建议

Please include a detailed account of how the proposed method builds upon and differs from prior works, e.g., [1]., with appropriate citations; in the Appendix and main text.

作者回复

Q1: The differences from TMDM.

A1: We believe there are some misunderstandings regarding the relationship between our method and TMDM, particularly in the contributions and derivation aspects. We believe the differences from TMDM are clearly presented throughout the paper. To aid clarity, we list some key differences between NsDiff and TMDM along with parts of where they are discussed in the paper. We hope the following table helps the reviewer quickly identify and understand these distinctions with TMDM.

differencesmentioned in
TMDM uses ANM assumption, while NsDiff uses LSNM assumption.Figure 1, right to line 23-27.
TMDM uses N(fϕ(x),I)\mathcal{N}(f_\phi(x),\mathbf{I}) endpoint while NsDiff uses N(fϕ(x),gψ(x))\mathcal{N}(f_\phi(x),g_\psi(x)) endpointFigure 1, left to line 57-74.
TMDM uses traditional noise schedule, while NsDiff introduces uncertainty-aware noise schedule.Section 4.6 gives how NsDiff can degenerated to TMDM. Table 5 further gives exp. results of different noise schedule.
TMDM does not optimize variance in loss functionEq. 13 where TMDM has only the left term.
TMDM does not estimate reverse variance.line 284 in Algorithms 2.
TMDM could not handle non-stationary varianceTable 3/Table 4 gives the exp. results on real/synthetic dataset, Figure 3/4 gives a visualized illustration.

We believe that in the derivation presented in Appendix A, all the relevant parts listed in the table above are different from those in TMDM. For example, NsDiff infer the reverse distribution variance by solving Eq. 38, while TMDM does not introduce this step.

Q2: the method is based on fine-tuning, which requires path sampling during training. Is this computationally expensive for DDPM in time series?

A2: The fact is no path sampling is required during training. The training and inference procedures are the same as those of a standard DDPM, except for using a different endpoint and noise schedule. We only introduce some basic operations during training and inference, so the overall computational complexity remains. See Q5 for experimental results.

Q3: What if the variance predicted via MLE is too large (when some spiky data points appear)? Using sliding windows might not address this issue perfectly.

A3: We do introduce a design to address this issue by using an uncertainty-aware noise schedule, which incorporates the true variance into the diffusion training process. This design reduces reliance on the variance predicted by the estimator (e.g., via MLE), which can be overly large in the presence of spiky or noisy data. See Table 5 for the experiments results, where it can be seen that by explicitly learning from the true variance, NsDiff becomes more robust to such cases, rather than relying solely on sliding window heuristics or estimator outputs.

Q4: The numerically stability of NsDiff.

A4: In NsDiff, the only potential numerical issue is the solvability of the equation in Eq. 15. We have already provided the conditions for solvability in Eq. 17. As stated in the paper (lines 269-272), this equation always has a solution. Therefore, theoretically, NsDiff does not introduce additional numerical instability issues. In practice, no numerical instability has been observed in our experiments. We kindly invite reviewers to check and run our code to verify this.

Q5: The efficiency and memory costs in training and inference phases vs. performance improvements.

A5: As shown in the following table, compared to TMDM, NsDiff achieves SOTA and has smaller memory costs and higher efficiency. This is because NsDiff does not introduce additional hidden variables and only adds a small number of basic operations.

ModelMem.Train(MB)Mem.Inference(MB)Tim.Train(ms)Tim.Inference(ms)CRPSQICE
TimeGrad27.478.6147.898319.290.6066.731
CSDI109.8122.6160.50446.700.4923.107
TimeDiff15.663.4033.93238.780.46514.931
DiffusionTS65.0379.2394.518214.530.6036.423
TMDM221.58213.4633.26237.370.4522.821
NsDiff68.2057.7532.13208.070.3921.470

The results are tested on ETTh1, with 100 diffusion steps.

Q6: Some metrics are missing, e.g., MAE.

A6: NsDiff still achieves SOTA on these metrics, see Reviewer ZzLi Q3.

审稿人评论

Thanks for the detailed response. My recommendation score has been updated.

审稿意见
4

In this paper, the authors considered modeling the uncertainty quantification when applying diffusion models to time-series forecasting tasks. In the beginning, the authors first demonstrated a toy case study that the DDPM may not perform well on uncertainty prediction tasks due to the traditional Additive Noise Model (ANM) scheme. After that, the authors designed the location scale noise model to alleviate this issue, and proposed the Non-stationary Diffusion Model (NsDiff) framework. In the NsDiff, the authors redesigned the forward noise process and reformulated the backward generation process rigorously. Finally, the authors conducted various experiments to demonstrate the efficacy of the proposed approach.

给作者的问题

See the abovementioned chat window.

论据与证据

The claims made in this submission are supported by rigorous and convincing evidence. However the reviewer has the following two issues:

  1. On page 3, Eq. 1. Given that diffusion models' inference is solving the SDE/ODE, should this part be given as the integral form?
  2. In Figure 2, the proposed paper, to the reviewer's understanding mainly focuses on the uncertainty prediction of time series. However, the picture during the inference stage, does not include the uncertainty interval.

方法与评估标准

Yes, the methods and evaluation criteria make sense for demonstrating the derivation of convergence analysis. Nevertheless, the reviewer still has the following concerns:

  1. To the best of the reviewer's knowledge, the authors attempt to transform some predicted value (distribution 1) to another similar distribution, which includes the uncertainty information. Based on this, to the reviewer's understanding, this problem can be treated as a kind of Schrodinger bridge problem. In addition, related works have applied the Schrodinger bridge [1] in the imputation procedure, which is similar to the CSDI model.
  2. Regarding the evaluation criteria, to the best of the reviewer's knowledge, time-series forecasting mainly focuses on prediction accuracy. The authors have not included related evaluation metrics like mean square error or mean absolute error [2]. It would be better to clarify this issue.

References: [1]. Provably Convergent Schrodinger Bridge with Applications to Probabilistic Time Series Imputation, ICML-2023 [2]. Transformers in Time Series: A Survey

理论论述

The reviewer attempts to check the derivation of the theorem and it seems that nearly all the derivations look good to the best of the reviewer's knowledge.

实验设计与分析

  1. In the supplementary material, Figure 5, the TMDM model can perform well under ETT datasets compared to the proposed NsDiff, what causes this result?
  2. The computational time has not been proposed.
  3. Is it possible to apply the diffusion model solvers like DPM solver during the model inference stage?
  4. As mentioned above, the baseline comparison lacks related baseline models like Schrodinger bridge imputation models.

References:
[1]. DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps, NeurIPS 2022

补充材料

The reviewer reviews the supplementary material. It would be better add an illustration during the derivation of Eq. 21.

与现有文献的关系

The diffusion model is of great importance and applying the diffusion model to model the uncertainty information in the context of time-series forecasting is of great necessity. Thus, the key contributions of the paper related to the broader scientific literature are of great important.

遗漏的重要参考文献

As mentioned above, there remain the related works from the following two aspects have not been discussed:

  1. Bridge-based Models: The diffusion models designed by diffusion bridges [1,2] have not been well discussed. The DDPM can be treated by a special kind of Ornstein–Uhlenbeck bridge [3].
  2. The noise editing: It seems that the noise was modified to delineate the predicted time-series result. Thus, related works on noise selection [4] could be considered to some extent.

References:
[1]. Provably Convergent Schrodinger Bridge with Applications to Probabilistic Time Series Imputation, ICML-2023
[2]. Flow Matching for Generative Modeling, ICLR 2023
[3]. Image Restoration Through Generalized Ornstein-Uhlenbeck Bridge, ICML 2024
[4] Xiefan Guo etal, Initno: Boosting text-to-image diffusion models via initial noise optimization. CVPR, 2024

其他优缺点

Strengths

  1. The topic is related to the ICML conferences.
  2. The proposed approach is interesting.
  3. The derivation is rigorous.

Weaknesses

  1. Major weaknesses have been listed in the abovementioned items.
  2. To the reviewer's knowledge, the initial value is of great importance for solving ODE. What would happen if the pre-trained model does not predict well (in the context of ODE, we call it stiffness [1])? It would be better to demonstrate the results under various random seeds to demonstrate the robustness of the proposed approach.
  3. The convergence of the proposed approach has not been discussed.

References:
[1]. Numerical Methods for Ordinary Differential Equations

其他意见或建议

See the abovementioned chat window.

作者回复

Q1: should Eq. 1. be given as the ODE/SDE integral form?

A1: Thanks for the insight. However, we believe the reviewer may be referring to Eq. 7 instead of Eq. 1: Eq. 1 describes the LSNM and is not a stochastic process, so it cannot be written as SDE; Eq. 7 defines the data perturbation process and can be written as SDE. We give a theoretical discussion as follows. First, the Euler-Maruyama discrete form of NsDiff is

Y(t+t)=Y(t)12β(t)(Y(t)fϕ(X))t+σY0β(t)(σY0gψ(X))tβ(t)tz(t)\mathbf{Y}(t+\triangle t) = Y(t) -\frac{1}{2}\beta(t)(\mathbf{Y}(t) - f_\phi(\mathbf{X}))\triangle t + \sqrt{\sigma_{Y_0} - \beta(t)(\sigma_{Y_0} -g_\psi(\mathbf{X}))\triangle t} \sqrt{\beta(t)\triangle t}\mathbf{z}(t)

The diffusion coefficient is related to the t\triangle t, which is non-Itô-integrable, making it intractable to define a clean continuous-time reverse SDE for the process. To give more theoretical insight, we resort to the perfect estimator assumption (i.e., σY0=gψ(X)\sigma_{Y_0}=g_\psi(\mathbf{X})) and give the forward/reverse SDE as follows:

dY=12β(t)(Yfϕ(X))dt+gψ(X)β(t)dwd\mathbf{Y} = -\frac{1}{2}\beta(t)(\mathbf{Y} - f_\phi(\mathbf{X}))dt + \sqrt{g_\psi(\mathbf{X})\beta(t)}d\mathbf{w} dY=[12β(t)(Yfϕ(X))gψ(X)β(t)Ylogpt(Y)]dt+gψ(X)β(t)dwˉd\mathbf{Y} = [-\frac{1}{2}\beta(t)(\mathbf{Y} - f_\phi(\mathbf{X}))-g_\psi(\mathbf{X})\beta(t)\nabla_\mathbf{Y}\log p_t(\mathbf{Y})]dt + \sqrt{g_\psi(\mathbf{X})\beta(t)}d\mathbf{\bar w}

We will include more detail in the latest version.

Q2: improve Figure 2 and the description of Eq. 21.

A2: Thanks for this suggestion, we will revise the paper according to your advice.

Q3: SB problem and include SB-based baselines.

A3: Indeed, the problem can be viewed as a SB problem. Taking CSBI as an example, we can introduce an uncertainty-aware prior for it by replacing its ppriorp_\text{prior} with the prior of NsDiff N(fϕ(X),gψ(X))\mathcal{N}(f_\phi(X),g_\psi(X)), to combine the advantages of NsDiff. Compared to CSBI, NsDiff can estimate the true variance σY0\sigma_{Y_0} by σθ\sigma_\theta (see App. A.4) to learn uncertainty more accurately (as reflected in the results of Sec. 5.4). We include a SB-based baseline CSBI in Table 3 and 4, see Revewer ZzLi Q3.

Q4: MSE/MAE, computational time.

Q4: NsDiff still achieves SOTA on MSE/MAE. Furthermore, compared to previous SOTA, NsDiff has smaller memory costs and higher efficiency. This is because NsDiff does not introduce additional hidden variables and only adds a small number of basic operations.

Q5: Figure 5, the TMDM model seems to perform better, why?

A5: In Figure 5, TMDM actually performs worse than NsDiff. Of course, the difference is less significant compared to Traffic as ETT datasets have relatively small uncertainty variation as in Table 2 (e.g. 1.2 ETTm1, 1.3 ETTm2). Specifically, Figure 5 shows TMDM produces a less accurate mean prediction, leading to larger MAE and MSE than Nsdiff; and, TMDM's predictions do not sufficiently cover the true values, resulting in poorer CRPS and QICE scores.

Q6: Is it possible to apply the diffusion model solvers during inference stage?

A6: Yes, following A1, NsDiff has the following ODE form:

dY=[12β(t)(Yfϕ(X))12gψ(X)β(t)Ylogpt(Y)]dtd\mathbf{Y} = [-\frac{1}{2}\beta(t)(\mathbf{Y} - f_\phi(\mathbf{X}))- \frac{1}{2}g_\psi(\mathbf{X})\beta(t)\nabla_\mathbf{Y}\log p_t(\mathbf{Y})]dt

where the ODE follows a semi-linear structure, remaining compatible with DPM-Solver. However, since time series tasks typically require only a few steps (less than 100) for effective performance, the inference efficiency is already sufficient. Therefore, we do not recommend applying DPM-Solver at the cost of prediction accuracy, given an additional assumption in A1.

Q7: Bridge-based Models [1-3] and noise editing [4] should be discussed.

A7: Thanks for these references. we have discussed SB-based methods in Q3. The noising editing method is interesting and relevant. While NsDiff gives a distribution-wise improvement on the initial noise, the given paper consider a sample-wise perspective to optimize the initial noise to a more reasonable space. We will provide more discussion in the latest version.

Q8: What would happen if the pre-trained model does not predict well (stiffness). Include various random seeds to demonstrate the robustness.

A8: NsDiff incorporates the true variance σY0\sigma_{Y_0} into the learning process to alleviate this stiffness problem. Although the pre-trained model gψ(X)g_\psi(X) may not predict well, NsDiff use Eq. 18 to estimate the true variance. As shown in Table 5 of the ablation experiments (we report results mean and std on various seeds), incorporating this variance estimation improves both the performance and the robustness of our method.

Q9: The convergence of NsDiff has not been discussed.

A9: We acknowledge that the convergence analysis is a critical problem, which is missing in basically all previous works, e.g. TimeGrad, TMDM etc. We will explore this more in future work.

审稿人评论

The reviewer appreciates the authors' detailed and thoughtful rebuttal. However, the reviewer would like to suggest a few additional revisions to further strengthen the rigor and clarity of the work:

  1. It would enhance the rigor of the manuscript to include a demonstration of convergence results across epochs.
  2. Based on the descriptions of Algorithm 1 and Algorithm 2, it seems beneficial to summarize them into a combined Algorithm 3. This would help rectify and streamline the workflow of the proposed approach.
  3. Since Table 5 presents experiments conducted with various seeds, it would be more robust to include a paired-sample tt-test to statistically validate the results.
作者评论

Q10: a demonstration of convergence results across epochs.

A10: According to your advice, we provide a figure visual illustration with train loss/test result across epochs of fϕ(x)f_\phi(x), gψ(x)g_\psi(x), NsDiff at https://1drv.ms/i/c/f104f0574e8cb377/EQz0OZHb9ahKtv2ZU9tA5HIBVklJYSOJVnRlFwQjq42HLw?e=MUC2Rn.

As can be seen, NsDiff could help improving the uncertainty prediction by combining two mean/variance estimators. we would include the figure in the updated version to demonstrate the convergence.

Q11: summarize Algorithm 1,2 into a combined Algorithm 3.

A11: Thanks for this advice, we would provide a comprehensive Algorithm 3 in the updated version.

Q12: paired-sample t-tests for ablation exp. in Table 5.

A12: We thank the reviewer for this valuable suggestion. The results evident and well-supported by visualized results (Figures 3 and 4), which is why we didn’t initially consider statistical testing. To ensure reliability across experiments, we used identical seeds (1, 2, 3) throughout the paper. To address your concern, we support the statistical analysis with addtional seeds [1, 2, 3, 4, 5, 6] for the paired-sample t-tests. The results of these tests are summarized in the table below:

Comparisont-statisticp-value
CRPS: NsDiff vs w/o LSNM-3.45490.0181
CRPS: NsDiff vs w/o UANS-3.99490.0104
QICE: NsDiff vs w/o LSNM-3.09780.0269
QICE: NsDiff vs w/o UANS-4.21170.0084

Both CRPS and QICE comparisons for NsDiff against the ablation variants (w/o LSNM and w/o UANS) yielded statistically significant results, indicating a consistent performance advantage of the full NsDiff model. We will include these updated results in the revised version of the paper.

审稿意见
4

The paper introduces a novel diffusion-based probabilistic forecasting framework, called NsDiff, which is designed to address the non-stationary nature of uncertainty in time series data. Traditional Denoising Diffusion Probabilistic Models (DDPMs) typically rely on an Additive Noise Model (ANM) with fixed variance, limiting their ability to capture the dynamic uncertainty observed in many real-world applications. To overcome this limitation, the authors propose incorporating a Location-Scale Noise Model (LSNM) that allows the noise variance to vary with the input context. The authors also try to validate their results in a wide range of datasets with comparisons to some existing methods.

给作者的问题

I don't have any questions for authors.

论据与证据

I think the claims made in the submission are clear and well-supported by the evidence detailed in the main text and the supplementary material. It is nice that the authors open-sourced their code.

方法与评估标准

I think the proposed methods make sense for the applications of probabilistic forecasting of non-stationary systems.

理论论述

I looked at the proof in Appendix A, which is straightforward and easy to follow.

实验设计与分析

I think the experiments done by the authors are impressive and comprehensive as they test the proposed method on many real-world datasets of different nature and dimensionalities.

补充材料

I reviewed the proof and the experimental details in the supplementary material.

与现有文献的关系

I think the proposed method has a potential of being applied to many scientific applications, including weather forecasting, forecasting of physical systems (e.g. fluid dynamics), and epidemics.

遗漏的重要参考文献

I believe the paper would benefit from incorporating several additional essential references that provide valuable context and complementary perspectives. For example:

[1] Chen, Y., Goldstein, M., Hua, M., Albergo, M. S., Boffi, N. M., & Vanden-Eijnden, E. (2024). Probabilistic Forecasting with Stochastic Interpolants and Föllmer Processes. arXiv preprint arXiv:2403.13724. This work is relevant because it also leverages diffusion models for probabilistic predictions, proposes innovative loss functions for optimizing noise schedules, and offers flexible alternatives for the base measure within diffusion models.

[2] Jiang, R., Lu, P. Y., Orlova, E., & Willett, R. (2023). Training Neural Operators to Preserve Invariant Measures of Chaotic Attractors. Advances in Neural Information Processing Systems, 36, 27645–27669. This reference is pertinent as it addresses similar applications where noise induces non-stationarity in observations and presents methods to enhance long-term prediction accuracy.

Incorporating these references would strengthen the discussion by situating the proposed work within the broader context of recent advances in diffusion models and long-term forecasting under non-stationary conditions.

其他优缺点

N/A

其他意见或建议

N/A

作者回复

Q1: additional key references [1-2].

A1: Thanks for your recognition of our work. We agree that the references [1–2] provide meaningful context and will help strengthen our discussion. In particular, we find several aspects of these works especially relevant to our setting:

The work by Chen et al. [1] proposes a novel probabilistic forecasting framework based on Föllmer processes and stochastic interpolants. Their approach to learning noise schedules through tailored loss functions resonates with our motivation to adapt diffusion endpoints for better uncertainty modeling, especially under non-stationary conditions.

Jiang et al. [2] address the challenge of long-term forecasting in chaotic systems by preserving invariant measures. Their use of contrastive learning to stabilize dynamics over time without requiring domain-specific priors is an inspiring direction that aligns with our interest in modeling non-stationary behavior robustly.

We will cite and briefly discuss these works in the revised version to better contextualize our contributions within the broader landscape of diffusion-based and non-stationary forecasting techniques.

[1] Chen, Y., Goldstein, M., Hua, M., Albergo, M. S., Boffi, N. M., & Vanden-Eijnden, E. (2024). Probabilistic Forecasting with Stochastic Interpolants and Föllmer Processes. arXiv preprint arXiv:2403.13724.

[2] Jiang, R., Lu, P. Y., Orlova, E., & Willett, R. (2023). Training Neural Operators to Preserve Invariant Measures of Chaotic Attractors. Advances in Neural Information Processing Systems, 36, 27645–27669.

最终决定

The work under review proposes to explicitly model the time-dependent uncertainty inherent to time series forecasting in diffusion models. A location-scale noise model is used to flexibilize the usual simple additive noise model.

All reviewers agree about the soundness and value of this contribution. The method is theoretically sound, widely applicable, and extensively tested empirically. Given the relevance and prominence of diffusion-based methods, this work has potential to impact many others.

Please do include the point-based results (MSE, MAE) requested by multiple reviewers in the final version.