Collaborative Deterministic–Probabilistic Forecasting for Real-World Spatiotemporal Systems
We propose a collaborative approach that combines a deterministic model and a diffusion model, leveraging their complementary strengths for probabilistic forecasting.
摘要
评审与讨论
This paper introduces CoST, a collaborative deterministic–probabilistic forecasting framework designed for real-world spatiotemporal systems. Recognizing the complexity of fully probabilistic modeling using diffusion models alone, CoST employs a mean-residual decomposition approach, where a deterministic model predicts the conditional mean, and a diffusion model focuses on residual uncertainties. A scale-aware diffusion mechanism addresses spatial heterogeneity, enhancing the model's flexibility and accuracy. Experimental evaluations across various domains—including climate, energy, communication, and urban systems—demonstrate that CoST significantly outperforms existing state-of-the-art methods, achieving an average improvement of 25% on key metrics, while notably reducing computational costs.
优缺点分析
The innovations highlighted by the authors appear to overlap significantly with previously published works, such as TMDM. The authors should explicitly clarify how CoST differs from TMDM.
The model configuration proposed by the authors closely resembles that of TMDM; the distinctions need to be clearly articulated.
Numerous existing studies have already investigated signal decomposition into two or more components for forecasting. The authors must specify how CoST differentiates itself from these studies [2,3].
The experiments need to include comprehensive comparisons with various existing methods, as the current evaluation is insufficient.
[1] Li, Y., Chen, W., Hu, X., Chen, B., & Zhou, M. (2024). Transformer-modulated diffusion models for probabilistic multivariate time series forecasting. In The Twelfth International Conference on Learning Representations.
[2] Wang, S., Wu, H., Shi, X., Hu, T., Luo, H., Ma, L., ... & Zhou, J. (2024). Timemixer: Decomposable multiscale mixing for time series forecasting. arXiv preprint arXiv:2405.14616.
[3] Yu, G., Zou, J., Hu, X., Aviles-Rivero, A. I., Qin, J., & Wang, S. (2024). Revitalizing multivariate time series forecasting: Learnable decomposition with inter-series dependencies and intra-series variations modeling. arXiv preprint arXiv:2402.12694.
问题
See summary or Weaknesses.
局限性
See summary or Weaknesses.
最终评判理由
Thank you very much for the author's reply. I believe that in terms of innovation, this article still has some limitations.Taking into account the opinions of the other reviewers, I have decided to maintain my score.
格式问题
NO
We sincerely thank Reviewer b7LL for the valuable comments. Below, we provide detailed responses to your questions.
Q1. Clarification on the differences with TMDM
Tanks for your feedback. While both CoST and TMDM aim to combine deterministic and diffusion models for improved probabilistic forecasting, we would like to emphasize that the two differ significantly. (We have briefly discussed the differences between CoST and TMDM in Lines 107–108. To better clarify their distinctions, we have now expanded Section 2.3 in the revised manuscript to provide a more detailed comparison.)
1. Decomposition vs. Modulation
(i) CoST adopts a mean–residual decomposition framework: the deterministic model predicts the conditional mean, while the diffusion model learns only the residual distribution. This simplifies the diffusion objective, supported by the law of total variance theory[1], and avoids modeling full temporal dependencies from scratch.
(ii) TMDM, in contrast, uses a modulation approach, where the deterministic output is injected as a prior to guide the diffusion model in generating the entire target sequence. Thus, the diffusion component in TMDM still learns the full complex distribution.
2. Spatiotemporal-Specific Design
CoST targets spatiotemporal forecasting, and introduces a scale-aware diffusion mechanism that computes location-specific fluctuation scales (via FFT) to address spatial heterogeneity. TMDM, as a time-series model, does not include spatial modeling or similar mechanisms.
3. Clarification on Model Configuration
(i) In the deterministic component, both CoST and TMDM adopt well-established architectures from their respective domains. We test STID, ConvLSTM, STNorm, and iTransformer in our framework, while TMDM uses Transformer, Informer, Autoformer, and NSformer.
(ii) For the diffusion model, both models build upon the conditional injection mechanism from CARD [2]. However, their purposes differ: TMDM uses it to incorporate the deterministic model's output into the diffusion process, whereas CoST leverages it to better integrate the customized spatial fluctuation scale Q into the residual learning process, thereby enhancing spatial modeling. Therefore, any architectural similarity reflects a shared technique, not shared objectives.
[1] Introduction to Probability. Athena Scientific, 2008
[2] CARD: Classification and Regression Diffusion Models. NeurIPS, 2022
Q2. Distinction from Decomposition-Based Methods (Leddam & TimeMixer)
We thank the reviewer for their insightful comment and for prompting us to clarify the novelty of our work in relation to prior decomposition-based models such as Leddam and TimeMixer. While we agree that signal decomposition is a well-established concept in time series analysis, we respectfully argue that CoST introduces a fundamentally different paradigm, particularly in its objective, decomposition formulation, and architectural design.
1. Forecasting Objective: Probabilistic vs. Deterministic Forecasting
(i) CoST: CoST is designed for probabilistic forecasting, aiming to model the full predictive distribution . This enables robust uncertainty quantification, which is critical in high-stakes, real-world decision-making. Accordingly, we evaluate CoST using distributional metrics such as CRPS, QICE, and IS.
(ii) Leddam and TimeMixer: They are focused exclusively on deterministic forecasting. They are optimized to minimize point-wise errors (e.g., MSE, MAE), and do not generate uncertainty estimates, nor are they evaluated under probabilistic criteria.
2. Decomposition Strategy: Mean-Residual vs. Trend-Seasonal
(i) CoST: We propose a mean-residual decomposition (). This is a functional decomposition. We task a powerful deterministic model to capture the conditional mean (), which represents the predictable, regular patterns. A lightweight diffusion model is then tasked only with learning the distribution of the residual (), which represents the stochastic, unpredictable variations. This division of labor is the cornerstone of our framework.
(ii) Leddam & TimeMixer: Both models use a traditional trend-seasonal decomposition. This is a signal processing technique aimed at disentangling different components of the historical signal to create better features for a single, unified forecasting model.
3. Model Architecture: Collaborative vs. Unified
(i) CoST employs a collaborative architecture, wherein:
- A deterministic model learns the conditional mean, and
- A diffusion model learns the distribution of the residual.
This division of labor not only improves modeling efficiency, but also allows the diffusion model to focus solely on capturing uncertainty, leading to more calibrated forecasts.
(ii) Leddam and TimeMixer: In contrast, Leddam and TimeMixer use a single deterministic model that internally integrates decomposed components (e.g., trend and seasonal signals) as features. These components are not separately modeled or interpreted, and the model ultimately outputs only a point estimate.
Q3. More Baselines.
Thank you for the suggestion. In addition to the original six probabilistic forecasting baselines (D3VAE, DiffSTG, TimeGrad, CSDI, DYffusion, NPDiff), we have further included four more probabilistic models (GP, DeepAR, DeepState, and TMDM) as additional baselines in our revised experiments.
| 64→64 | Climate | MobileSH | EETh1 | EETh2 | ||||||||
| 64→64 | CRPS | QICE | IS | CRPS | QICE | IS | CRPS | QICE | IS | CRPS | QICE | IS |
| GP | 0.086 | 0.146 | 9.18 | 0.537 | 0.112 | 7.13 | 0.7056 | 0.114 | 6.1 | 0.379 | 0.093 | 4.1 |
| DeepAR | 0.035 | 0.036 | 7.21 | 0.401 | 0.049 | 0.764 | 0.505 | 0.015 | 23.6 | 0.7421 | 0.072 | 203.0 |
| DeepState | 0.031 | 0.018 | 6.09 | 0.707 | 0.066 | 0.924 | 0.577 | 0.033 | 27.7 | 0.610 | 0.121 | 129.4 |
| TMDM | 0.0926 | 0.115 | 7.36 | 0.799 | 0.127 | 16.1 | 0.395 | 0.041 | 4.8 | 0.196 | 0.018 | 2.2 |
| CoST | 0.024 | 0.011 | 4.87 | 0.158 | 0.016 | 0.218 | 0.311 | 0.007 | 1.6 | 0.109 | 0.007 | 0.78 |
To ensure a fairer comparison, we evaluate these models on long-term forecasting tasks, as TMDM’s original experimental setting focuses on longer prediction horizons. Specifically, we also use the ETTh1 and ETTh2 datasets provided by the TMDM paper, and CoST achieves the best performance on both. We conjecture that TMDM performs suboptimally on spatiotemporal data for two reasons: (1) It lacks explicit spatial modeling mechanisms, and (2) Spatiotemporal data often have high spatial dimensionality, which can be challenging for models like TMDM to learn effectively, especially when the dataset size is limited.
Thank you very much for the author's reply. I believe that in terms of innovation, this article still has some limitations.Taking into account the opinions of the other reviewers, I have decided to maintain my score.
Dear Reviewer b7LL,
We sincerely thank the reviewer for their valuable time and feedback. Your comments helped us better understand your concerns regarding novelty, and we greatly appreciate the opportunity to clarify the unique contributions of our work. Our primary contribution lies in proposing CoST, a principled and generalizable collaborative framework for probabilistic spatiotemporal forecasting, which explicitly bridges deterministic prediction and probabilistic uncertainty modeling. We acknowledge that our original manuscript may not have sufficiently emphasized these contributions. Below, we summarize how CoST differs from prior works in several key aspects:
1. Compared to TMDM: CoST introduces a mean–residual decomposition, where the diffusion model learns only the residual distribution (a lower-variance and more tractable target) instead of modeling the full data distribution as in TMDM. This not only simplifies optimization but also avoids the limitations of diffusion models in capturing temporal dynamics. Moreover, CoST generalizes beyond spatiotemporal data, it performs strongly even on purely temporal datasets used in TMDM.
2. Compared to Leddam and TimeMixer: These are purely deterministic models with signal-level trend-seasonal decomposition. In contrast, CoST focuses on distributional modeling with a residual-based diffusion process, enabling probabilistic forecasting.
3. Compared to DiffCast and CasCast: While these works also combine deterministic and diffusion models, they are task-specific(precipitation nowcasting) and do not rigorously decompose the predictive distribution, thereby increasing the complexity of the diffusion learning process. CoST is a general-purpose framework with an rigorous decomposition, and uniquely introduces unit-specific fluctuation priors to model spatial heterogeneity, a feature absent in prior methods.
We acknowledge that our initial draft may not have clearly conveyed these distinctions. In response, we have substantially revised the manuscript, especially in the Introduction and Related Work sections, to better clarify our contributions and differentiate CoST from prior works. We sincerely appreciate the reviewer’s time and thoughtful comments, which have helped us improve the clarity of our paper.
If you have any further questions or concerns, we would sincerely welcome the opportunity to engage in further discussion to help clarify our work.
This paper presents CoST, a novel collaborative framework for spatiotemporal forecasting that unifies deterministic and probabilistic modeling in a synergistic way. CoST decomposes the spatiotemporal forecasting task into a deterministic prediction (capturing the main trend) and a probabilistic residual prediction (modeling uncertainty and stochastic variation). Extensive experiments across six spatiotemporal benchmarks demonstrate that CoST significantly outperforms SOTA baselines in terms of accuracy, uncertainty quantification, and robustness.
优缺点分析
Strengths:
S1. This paper has a clear motivation and well-defined problem setting. It addresses a fundamental trade-off in spatiotemporal forecasting: accuracy vs. uncertainty, and proposes a hybrid framework that leverages the strengths of both deterministic and probabilistic modeling. The ability to provide accurate predictions along with calibrated uncertainty estimates is essential for real-world applications in spatiotemporal systems.
S2. Theoretical analysis of the mean-residual decomposition offers support for the collaborative design of CoST. This theoretical grounding enhances the credibility of the proposed architecture.
S3. The empirical evaluation is extensive and convincing. Experiments span diverse domains, including urban systems, communication networks, climate systems, and renewable energy, and demonstrate consistent improvements over both classical and SOTA baselines.
S4. The paper is well-written and logically structured.
Weakness:
The joint training of the deterministic model and the diffusion model introduces added complexity and may increase optimization time.
问题
Q1. I am particularly interested in how CoST performs under extreme cases. Could the authors elaborate on the mechanisms that allow it to handle such cases effectively?
Q2. Does the probabilistic component always refine the deterministic output? Are there failure cases where they conflict?
Q3. How can the learned uncertainty estimates be effectively used in downstream decision-making tasks, such as traffic management, public safety alerts, or ride-sharing dispatch?
局限性
The authors have discussed the limitations of this approach. Besides, inference cost of diffusion models may pose challenges in real-time applications without acceleration strategies.
最终评判理由
After reading the rebuttal and other reviews, I still keep my original positive score.
格式问题
NA.
We sincerely thank Reviewer pqiK for the valuable comments. Below, we provide detailed responses to your questions.
Q1. Concern about computational cost.
Our method significantly reduces modeling time via mean residual decomposition, which simplifies the learning task by focusing the diffusion model on residuals. This enables the use of a lightweight denoising network, thereby reducing training complexity and time. We have conducted additional efficiency evaluations, with key results summarized below. CoST achieves strong computational efficiency: 2.5 minutes for training and 43 seconds for inference on CrowdBJ dataset, enabling real-world deployment. More results and analysis are in Section 5.4 and Appendix Table 10.
| CrowdBJ | Training | Inference |
| D3VAE | 4min | 2min37s |
| DiffSTG | 32min | 25min |
| TimeGrad | 5min | 57s |
| CSDI | 3h | 1h20min |
| Dyffusion | 28h | 5h |
| CoST | 2min30s | 43s |
Q2. How CoST performs under extreme cases.
Thank you for the question. CoST handles extreme cases effectively through its mean–residual decomposition: the deterministic model captures regular patterns, while the diffusion model focuses on residual uncertainty, which typically increases during extreme events. This separation enables CoST to better model rare, high-variance deviations. Additionally, the scale-aware diffusion enhances sensitivity to localized fluctuations. As shown in Section 5.7 (Figures 6c, 6f), CoST outperforms baselines in capturing sudden spikes or sharp fluctuations. Since these cases are hard to quantify, we have supplemented the analysis with additional visualizations and qualitative assessment in the appendix.
Q3. Does the probabilistic component always refine the deterministic output?
Thank you for the question. Our diffusion component is not introduced to refine the deterministic output, but to provide probabilistic forecasts by modeling the residual uncertainty. To assess whether any conflict exists between the two components, we compared the number of samples where CoST (with both components) outperforms the diffusion-only variant, using the Interval Score (IS, a metric that evaluates the quality of probabilistic forecasts at the sample level). We find that the vast majority of samples benefit from the addition of the deterministic component, showing improved IS. While a small number of cases experience marginal degradation, overall results confirm that the two components collaborate effectively rather than conflict.
| Worsened samples | Improved samples | Improvement ratio | |
| MobileNJ | 644118 | 1002 | 99.84% |
| MobileSH | 123842 | 1273916 | 91.14% |
| CrowdBJ | 0 | 1745280 | 100% |
| Climate | 0 | 3834000 | 100% |
Q4. How can the learned uncertainty estimates be effectively used in downstream decision-making tasks?
Thank you for the question. CoST produces not only a point forecast but also a predictive distribution, which reflects how confident the model is and provides probabilistic bounds (e.g., upper/lower quantiles or intervals) for possible outcomes. This enables risk-aware decision-making:
1. Traffic management: If CoST predicts heavy congestion with high uncertainty in a certain area, traffic control systems can adopt conservative routing policies (e.g., increase buffer time, suggest detours, prepare contingency plans) to reduce risk. In contrast, low-uncertainty predictions can be used more aggressively.
2. Public safety alerts: For crowd events or storms, if the upper bound of predicted density or precipitation crosses a safety threshold—even if the mean doesn’t—early warnings can be issued preemptively, improving preparedness without overreacting.
3. Ride-sharing dispatch: Predictive intervals help platforms assess best- and worst-case demand, allowing for dynamic driver reallocation or pricing that balances service quality and operational cost under uncertainty.
In short, CoST doesn’t just tell you how confident it is; it also gives a range of possible outcomes, like the best and worst cases. This helps decision-makers plan ahead under uncertainty, which is especially important in real-world systems like traffic, public safety, or energy, where acting early can make a big difference.
Thanks for your rebuttal and I will maintain my positive score on this submission.
This paper introduces CoST, a spatiotemporal forecasting method that hybridizes deterministic forecasting model and diffusion model. CoST uses a deterministic model to capture the conditional mean of the spatiotemporal data, and applies the diffusion model to learn the residuals. Since the residual distribution is not i.i.d, the author proposed the "scale-aware diffusion process", which employs FFT to extract the fluctuation components for each spatial unit. The variance of the residual fluctuations are used to formulate the prior in diffusion model. The framework works for both regular grid data and data on graphs due to the adoption of FFT. The author conducted experiments on five real-world spatiotemporal forecasting datasets and reported performance boost over pure diffusion models via CoST.
优缺点分析
Strengths:
- CoST frameworks works for both grid-structured spatiotemporal data and graph-structured spatiotemporal data and boosts the performance over pure diffusion model consistently, as shown in Table 3.
- The author demonstrated that the "scale-aware diffusion process" boosts the model performance in Figure 8 in Appendix.
- The author adopts probabilistic evaluation metrics like QICE and IS to evaluate the model performance. The analysis in Section 5.3 is intuitive and demonstrates the CoST captures the probabilistic distribution better.
Weakness:
- The idea of hybridizing deterministic forecasting model and diffusion model has been explored in DiffCast (https://openaccess.thecvf.com/content/CVPR2024/papers/Yu_DiffCast_A_Unified_Framework_via_Residual_Diffusion_for_Precipitation_Nowcasting_CVPR_2024_paper.pdf) and CasCast (https://arxiv.org/pdf/2402.04290). Hybridizing these two types of forecasting model is not a novel idea.
- Following the first weakness, the author also did not compare with other methods for handling non-i.i.d residual errors. For example, DiffCast build an auto-regressive model that predicts the residual map at timestamp t with the predicted residual map at timestamp t-1.
Also, minor typo in line 199. reprent --> represent.
问题
I may have misunderstood the results but how did you remove the scale-aware diffusion process (w/o s) while keeping the customized fluctuation scale as a prior (w/o q)? I need more clarification about the experiment setting used in Figure 8. According to Section 4.3, the customized fluctuation scale is part of the scale-aware diffuion process. I'm not sure how to apply scale-aware diffusion process without the customized fluctuation scale, or how to apply customized fluctuation scale without the scale-aware diffusion process.
局限性
Yes, the limitations are mentioned in Section 6.
最终评判理由
Share the concern with Reviewer b7LL on novelty of the paper. Leaning towards rejection.
格式问题
No concern.
We sincerely thank Reviewer pFy7 for the valuable comments. Below, we provide detailed responses to your questions.
Q1. Key Differences from CaCaST and DiffCaST.
Response: We thank the reviewer for their insightful feedback and for referencing the important works of DiffCast and CasCast.We agree that the high-level concept of hybridizing deterministic and diffusion models has been explored. . However, we respectfully argue that CoST is not an incremental work but introduces a distinct and novel approach in its core methodology, technical contributions, and demonstrated generality.
1. Fundamentally Different Methodological Formulation.
(i) CoST introduces a principled Mean-Residual Decomposition, which is theoretically motivated by variance reduction. This simplifies the learning task for the diffusion model by having it only capture the residual distribution, whose variance is smaller than that of the original data. (ii) DiffCast, in contrast, applies a residual diffusion on top of a deterministic model to refine the deterministic output, but does not explicitly formulate or isolate the conditional mean. This does not decouple uncertainty estimation cleanly and still burdens the diffusion model with learning complex spatiotemporal structures.
(iii) CasCast follows a cascaded pipeline, where a deterministic model first predicts coarse global structures and a diffusion model generates refined results in a latent space. However, the diffusion model operates on full future states rather than isolated residuals.
2. Generality and Scope of Application.
(i) We validate CoST as a general-purpose framework on 10 diverse, real-world datasets spanning four different domains (climate, energy, urban systems, and communication). This demonstrates a far broader applicability and generality compared to DiffCast and CasCast, which are specialized models designed and tested for the specific task of precipitation nowcasting on radar data.
(ii) CoST, is fundamentally focused on delivering well-calibrated probabilistic forecasts and uncertainty quantification for general spatiotemporal systems. In contrast, DiffCast and CasCast are primarily designed to enhance the visual fidelity of precipitation nowcasting. They use their diffusion components differently: DiffCast corrects a deterministic output via an additive residual, while CasCast refines it through a guided generation process in a latent space. CoST's goal is not just to produce a sharper image, but to learn the entire predictive distribution.
3. Handling Spatial Heterogeneity via Scale-Aware Diffusion.
CoST introduces a scale-aware diffusion mechanism, guided by customized spatial fluctuation statistics derived from Fourier analysis. This explicitly addresses spatial heterogeneity, a feature not present in either DiffCast or CasCast.
We're grateful to the reviewer for bringing these two very insightful works to our attention. We have now included a specific subsection in Section 2 Related Work, to elaborate on these methods that combine diffusion with deterministic models.
Q2.More Baselines
Response: We chose not to emphasize DiffCast as a strong baseline because our approach fundamentally differs. While DiffCast uses diffusion components as a corrector for its backbone network, our method leverages diffusion for probabilistic distribution modeling, which is central to our work. Furthermore, DiffCast's performance was poor on our benchmarks(we report a subset of results below), likely due to its domain-specific architecture (U-Net, PhyDNet) tailored for precipitation nowcasting, which is suboptimal for the general sequence dynamics emphasized in our datasets.
| Climate | BikeDC | |||||
| 12→12 | CRPS | QICE | IS | CRPS | QICE | IS |
| DiffCast | 0.174 | 0.176 | 63.7 | 0.883 | 0.179 | 8.86 |
| +STID (Deterministic Model) | 0.021 | 0.009 | 4.04 | 0.419 | 0.028 | 3.45 |
Q3. Minor typo in line 199. reprent --> represent.
Response: we've corrected it and thoroughly reviewed the manuscript for any other errors.
Q4. Clarification about the experiment setting used in Figure 8 (Ablation Study).
Response: To address your query, it's important to understand that we constructed these model variants by progressively removing key components, rather than simply removing a single module in isolation. Here's a breakdown of what "(w/o s)" and "(w/o q)" signify:
(w/o s): This variant means we do not use the scale-aware diffusion process to integrate the prior information from the customized fluctuation scale. Instead, the customized fluctuation scale is simply provided as a conditional input to the denoising network.
(w/o q): This variant builds upon the "(w/o s)" setting. In addition to not using the scale-aware diffusion process, we further remove the customized fluctuation scale as a conditional input to the denoising network. This means the model operates without any information about the customized fluctuation scale, neither through the scale-aware diffusion process nor as a direct conditional input.
We hope this explanation clarifies the experimental setup for our ablation study. We've revised the description in the ablation study section to make it clearer.
Thanks for the rebuttal.
I read all reviews and author's reply. Similar to Reviewer b7LL, I think the paper has limited novelty. The idea of decomposing the signals for forecasting has been explored by lots of prior works. Thus, I do not believe the paper meets the acceptance standards of NeurIPS.
This paper proposes a simple approach for multivariate probabilistic forecasting by decoupling the learning process into two distinct components. First, a deterministic model is trained to predict the mean of the forecast distribution. Subsequently, a diffusion model is conditioned on this predicted mean to capture the complex, time-varying uncertainties and dependencies inherent in the multivariate data. The experimental results demonstrate that this decomposition strategy leads to improved performance on univariate evaluation metrics.
优缺点分析
Strengths
(S1) The proposed method is straightforward and well-explained, making it easy for other researchers to understand and replicate the experiments.
(S2) The paper is clearly written and logically structured, which effectively communicates the authors' ideas and findings.
Weaknesses
(W1) The primary goal of the paper is multivariate probabilistic forecasting, yet the evaluation relies solely on univariate scoring rules. This is a significant limitation, as it fails to assess the model's ability to capture the joint distribution and dependencies between the different time series. The inclusion of multivariate scoring rules, such as the Energy Score or Variogram Score, is essential for a complete and accurate evaluation.
(W2) The paper focuses heavily on diffusion models but overlooks a number of widely used and relevant baseline models for probabilistic multivariate forecasting. A more thorough comparison against established methods like DeepAR, DeepStateSpace, or models based on Normalizing Flows would provide a more convincing case for the proposed model's effectiveness and better contextualize its contributions.
(W3) The core weakness lies in the justification for using a diffusion model. While diffusion models excel at capturing highly complex, multimodal distributions, the paper does not provide strong evidence that short-term time series forecasting benefits from such a powerful and computationally intensive model. The motivation would be stronger if the authors could demonstrate, for instance, that the forecasting task involves complexities that simpler models like Gaussian processes or traditional state-space models cannot handle.
问题
(Q1) The results suggest that conditioning on a predicted mean is beneficial. Could a simpler, non-parametric mean achieve similar results? For instance, have you considered using a simple historical average (e.g., the value from the previous season) or a global mean as the conditional input for the diffusion model? An ablation study with these simpler mean functions would help clarify whether the performance gain comes from the sophisticated deterministic model or simply from providing mean guidance.
(Q2) Could you please elaborate on how the "ideal" multimodal distribution in Figure 4 was derived? In a real-world forecasting task, the ground truth is a single future value, not a distribution. Understanding how this target distribution was generated is crucial for interpreting the model's qualitative performance.
(Q3) Could you specify how the CRPS was calculated? Sample-based or closed-form? For other metrics, are 50 samples sufficient in obtaining a reliable metric?
局限性
yes.
最终评判理由
I would like to thank the authors for their efforts in addressing my comments. However, I agree with other reviewers that the novelty of this solution is limited. In terms of interpretation, the solution is inconsistent as the deterministic part is trained on minimizing MSE that gives an isotropic Gaussian noise assumption on the error term. In this case, I would not refer to the deterministic term as the mean. I decide to keep my current evaluation.
格式问题
NA
We sincerely thank Reviewer kEBZ for the valuable comments. Below, we provide detailed responses to your questions.
Q1. Lack of multivariate scoring rules.
Response: Thank you for your question. We address your concern from two perspectives below:
- Following Established Precedent: Our evaluation protocol aligns with the common practice adopted in prior influential works[1-4]. They have also only used univariate evaluation metrics, without incorporating multivariate assessments.
[1] Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. ICML 2021
[2] CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. NeurIPS 2021
[3] Transformer-Modulated Diffusion Models for Probabilistic Multivariate Time Series Forecasting. ICLR 2024
[4] Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models. ICLR 2025
- Adopting Suggestion and Adding New Experiments: Nevertheless, we fully agree that multivariate scoring rules are crucial for a comprehensive evaluation. To address this, we have already conducted supplementary experiments using these metrics on both the baselines you mentioned and selected models from our paper. Due to time constraints, we report a subset of results below.
We will add more results and a discussion on multivariate evaluation to the revised manuscript's experiments section. We believe this addition enhances our work's rigor and helps establish a more complete benchmark for future research in this area.
| Climate | MobileSH | TaxiBJ | ETTh2 | |||||
| 12→12 | ES | VS | ES | VS | ES | VS | ES | VS |
| GP | 2.04 | 10.2 | 0.702 | 10.67 | 2.83 | 6.19 | 0.236 | 0.561 |
| DiffSTG | 1.62 | 9.4 | 1.25 | 2.50 | 1.28 | 1.35 | - | - |
| NPDiff | 1.59 | 10.8 | 1.77 | 4.27 | 1.15 | 1.24 | 0.478 | 6.32 |
| DeepAR | 1.82 | 11.8 | 1.67 | 3.83 | 0.941 | 1.07 | 0.805 | 8.70 |
| DeepState | 1.77 | 21.5 | 1.58 | 7.02 | 1.37 | 2.65 | 0.214 | 1.15 |
| CoST | 1.37 | 9.8 | 0.611 | 1.66 | 0.546 | 0.634 | 0.085 | 0.152 |
DiffSTG is excluded for ETTh2 datasets as they lack adjacency graphs.
Q2. More Baselines
Response: Thank you for your valuable feedback. As suggested, we've added GP, DeepAR, and DeepStateSpace as new baselines. Part of the results is shown below, and CoST still achieves the best overall performance. In the revised version, we will provide full experimental results and discussions of these works in the Related Work section.
| Climate | MobileSH | TaxiBJ | |||||||
| 12→12 | CRPS | QICE | IS | CRPS | QICE | IS | CRPS | QICE | IS |
| GP | 0.083 | 0.158 | 9.98 | 0.495 | 0.120 | 6.90 | 0.217 | 0.137 | 258.9 |
| DeepAR | 0.030 | 0.029 | 5.56 | 0.422 | 0.052 | 0.810 | 0.217 | 0.021 | 159.1 |
| DeepState | 0.027 | 0.010 | 5.05 | 0.441 | 0.043 | 0.651 | 0.384 | 0.050 | 470.23 |
| CoST | 0.021 | 0.009 | 4.04 | 0.147 | 0.014 | 0.215 | 0.100 | 0.023 | 95.3 |
Q3. Justification for using a diffusion model.
Response: Thank you for raising this important point. We provide our justification for using diffusion models from four key perspectives:
- Complexity of real-world spatiotemporal systems. These systems(e.g., traffic, climate) often exhibit complex dependencies and inherently multimodal, stochastic behavior. Many studies have shown that diffusion models benefit both short-term and long-term forecasting in such systems[1-3].
[1] Non-autoregressive Conditional Diffusion Models for Time Series Prediction. ICML 2023
[2] DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models. SIGSPATIAL 2023
[3] Transformer-Modulated Diffusion Models for Probabilistic Multivariate Time Series Forecasting. ICLR 2024
- Limitation of GPs and SSMs. GPs are limited to unimodal predictive distributions due to their Gaussian assumptions, which restrict their ability to model multiple plausible futures. Classical SSMs assume linear dynamics and Gaussian noise, and even their nonlinear variants (e.g., EKF, particle filters) typically produce poorly calibrated or unimodal predictions and are sensitive to tuning.
- Empirical evidence. Our study covers both short-term and long-term forecasting. We have added empirical comparisons with GP and DeepStateSpace (SSM-based) in both settings. As shown in Table Q2 (short-term) and Tables 8–9 (long-term), diffusion models consistently outperform these baselines.
- Advancing the frontier. Despite higher computational costs, exploring diffusion models in this context pushes the boundaries of spatiotemporal modeling and offers promising new capabilities.
| Climate | MobileSH | CrowdBJ | |||||||
| 64→64 | CRPS | QICE | IS | CRPS | QICE | IS | CRPS | QICE | IS |
| GP | 0.086 | 0.146 | 9.18 | 0.537 | 0.112 | 7.13 | 0.660 | 0.143 | 9.40 |
| DeepAR | 0.035 | 0.036 | 7.21 | 0.401 | 0.049 | 0.764 | 0.559 | 0.042 | 33.9 |
| DeepState | 0.031 | 0.018 | 6.09 | 0.707 | 0.066 | 0.924 | 0.925 | 0.073 | 42.4 |
| CoST | 0.024 | 0.011 | 4.87 | 0.158 | 0.016 | 0.218 | 0.217 | 0.011 | 11.5 |
Q4. Ablation studies on conditional mean.
Response: Thank you for the insightful suggestion. We conduct an ablation study using the historical average as the mean component. As shown in the results below, conditioning on simple historical statistics does provide some improvement over a single diffusion model. However, using a learned deterministic model (STID) to generate the conditional mean yields better performance.
| Climate | BikeDC | |||||
| 12→12 | CRPS | QICE | IS | CRPS | QICE | IS |
| Diffusion | 0.030 | 0.030 | 6.58 | 1.090 | 0.059 | 12.6 |
| +Historical Average | 0.022 | 0.017 | 4.40 | 0.9193 | 0.018 | 7.46 |
| +STID (Deterministic Model) | 0.021 | 0.009 | 4.04 | 0.419 | 0.028 | 3.45 |
Q5. How was the "ideal" multimodal distribution in Figure 4 derived?
Response: Thank you for the question. We agree it is challenging to obtain a full target distribution in real-world forecasting, as only one future outcome per instance is observed. However, spatiotemporal systems often exhibit multimodal behavior due to complex, latent conditions. For example, traffic volume at a specific road segment can vary drastically under different latent conditions (e.g., normal vs. accident scenarios). A particular time and location may yield distinct values depending on unobserved events, leading to multimodal behavior even if we can only observe one outcome at a time. In Figure 4, the “ideal” distribution is not a true conditional target but an empirical marginal distribution at a fixed location across time, used to qualitatively illustrate the multimodal nature of the data. While it does not represent a ground-truth future distribution, it helps highlight whether the model captures uncertainty beyond a single mode. We will clarify this in the revised version.
Q6. How was the CRPS calculated?
Response: The CRPS is calculated using a sample-based approach. We follow prior work [1], which conducted empirical analysis on spatiotemporal forecasting tasks and found that using 50 samples is sufficient to obtain a stable and reliable estimate of CRPS. Additionally, we perform an ablation study on our dataset to evaluate the impact of the number of samples on CRPS values. The results show that increasing the sample size beyond 50 yields negligible improvements.
[1] DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models. SIGSPATIAL 2023
| 10 | 30 | 50 | 70 | 100 | |
| SST | 0.0227 | 0.0214 | 0.0212 | 0.0211 | 0.0211 |
| MobileSH | 0.156 | 0.149 | 0.147 | 0.147 | 0.146 |
| TaxiBJ | 0.117 | 0.111 | 0.100 | 0.099 | 0.099 |
| CrowdBJ | 0.230 | 0.218 | 0.215 | 0.215 | 0.214 |
Thank you for addressing my questions. I appreciate that the authors have adopted the Energy Score (ES) as a benchmark metric and that the primary goal of the paper is multivariate forecasting. However, my methodological concerns largely remain.
First, as noted, the simple historical average already provides substantial improvement, suggesting that the deterministic component is heavily driven by the periodic nature of the underlying time series. In this case, the deterministic mean serves a proper trend term that can be pre-specified outside the diffusion model (no need for joint learning). Also, I suspect that incorporating a stronger mean/deterministic model could further enhance performance—for example, by introducing low-frequency Fourier components or dynamic modes as the mean function.
In addition, some points remain unclear regarding the role of the deterministic mean. Such a mean is likely to help when the target variable exhibits a unimodal distribution (e.g., a dominant periodic signal with random perturbations). However, long-term forecasting often involves multimodal distributions, which are more challenging. For instance, as traffic prediction is mentioned in the rebuttal, a 2‑hour‑ahead forecast is often either congested or free-flowing, with few intermediate states. In such cases, relying on the mean may not provide meaningful predictive or interpretative value.
Given these, I will keep my current score.
Dear Reviewer kEBZ,
Thank you very much for your thoughtful reply. It greatly help us better understand the core of your concerns. After carefully reviewing your comments, we believe there might be some misunderstandings about some aspects of our method. We would like to clarify our motivation and design choices through the following points:
Q1. Clarification on the Motivation and Design of Our Method.
Response: We fully agree with your observation that an effective deterministic component plays a crucial role in improving overall forecasting performance. In fact, this insight is precisely the foundation of our method's design. You mentioned that the deterministic mean could be viewed as a pre-specified trend term external to the diffusion model. This is exactly how we treat it. Our core motivation is to decompose the complex probabilistic forecasting problem into two simpler and more tractable subproblems, which are: (i) Deterministic modeling: We utilize a strong and well-established spatiotemporal forecasting model (e.g., STID) to predict the conditional mean . This model is pretrained independently and kept frozen during the training of the diffusion model. It does not participate in 'joint learning', but instead serves as a trend-removal or de-meaning tool, providing a stable anchor for the residual modeling.
(ii) Diffusion modeling: With the conditional mean fixed, the diffusion model focuses solely on modeling the residual distribution . This simplification greatly reduces the variance and complexity of the modeling target, allowing the diffusion model to leverage its powerful distribution-learning capabilities without being burdened by the challenges of learning long-range spatiotemporal dynamics from scratch.
Q2. On the Importance of Stronger Mean Models.
Response: We fully agree with your suggestion that stronger deterministic models (such as those based on low-frequency Fourier components) could further improve performance. In fact, this directly supports the flexibility and generality of our CoST framework. In fact, this directly supports the flexibility and generality of our CoST framework. One of CoST's core strengths is its plug-and-play modularity: it can seamlessly benefit from any deterministic model used for mean estimation. A stronger mean predictor leads to: (i) More accurate approximation of the conditional mean; (ii) Lower variance in the residuals, thus reducing the modeling burden on the diffusion module.
To validate this, we conducted additional experiments comparing the use of Historical Average (HA), low-frequency FFT components, and modern spatiotemporal deterministic model(STID) as mean estimators. The results are clear. While HA and FFT do improve performance to some extent demonstrating the general validity of the mean-residual decomposition; Stronger deterministic models (e.g., STID) bring significant additional gains, especially on datasets with complex spatiotemporal patterns (e.g., Los-Speed, BikeDC, CrowdBM. On such datasets with weak periodicity or complex spatiotemporal patterns ,the improvement brought by simple mean functions like HA or FFT is marginal.). These results confirm that our choice of using state-of-the-art spatiotemporal models as the mean function is both justified and beneficial.
| Climate | BikeDC | Los-Speed | CrowdBM | ||||||
| 12→12 | CRPS | IS | CRPS | IS | CRPS | IS | CRPS | IS | |
| Diffusion | 0.030 | 6.58 | 1.09 | 12.6 | 0.065 | 35.5 | 0.303 | 45.7 | |
| +HA | 0.022 | 4.40 | 0.92 | 7.46 | 0.064 | 34.9 | 0.316 | 44.2 | |
| +FFT | 0.022 | 4.17 | 1.05 | 9.89 | 0.060 | 33.8 | 0.302 | 44.4 | |
| +STID (Deterministic Model) | 0.021 | 4.04 | 0.419 | 3.45 | 0.056 | 31.9 | 0.256 | 37.8 |
Q3. Role of the Conditional Mean under Multimodal Distributions.
Response: This is a particularly insightful question, and it directly touches on the core motivation behind our framework. As you correctly pointed out, in long-horizon forecasting with inherently multimodal target distributions (e.g., traffic states being either "congested" or "free-flowing"), the conditional mean may fall into a low-probability region, offering limited predictive value on its own. This is a well-known issue with deterministic models trained under L1 or L2 losses. Our design explicitly addresses this limitation:
- The deterministic model captures coarse-grained, high-level patterns (such as daily/weekly cycles or seasonal flows) serving as a structural scaffold.
- The diffusion model then builds upon this scaffold by modeling the residual uncertainty, capturing fine-grained, non-Gaussian, and even multimodal behaviors that the deterministic component inherently cannot express.
Lastly, we sincerely thank you for sparing the time to review our response amid your busy schedule. Should you have any further concerns or questions, please do not hesitate to let us know, we would be happy to provide additional details to help you gain a clearer understanding of our work.
I would like to thank the authors for their efforts in addressing my comments. However, I agree with other reviewers that the novelty of this solution is limited. In terms of interpretation, the solution is inconsistent as the deterministic part is trained on minimizing MSE that gives an isotropic Gaussian noise assumption on the error term. In this case, I would not refer to the deterministic term as the mean. I decide to keep my current evaluation.
Summary of Rebuttal
We sincerely appreciate the thoughtful comments and constructive suggestions provided by all reviewers. During the rebuttal period, we have made every effort to thoroughly address the concerns raised, supported by new experiments, detailed clarifications, and substantive revisions. Below is a summary of our key responses:
-
Strengthened Experimental Evaluation: To address concerns about baselines and evaluation metrics, we have significantly expanded our empirical validation:
(i) Incorporated multivariate scoring rules (Energy Score and Variogram Score).
(ii) In addition to the original six probabilistic baselines (D3VAE, DiffSTG, TimeGrad, CSDI, DYffusion, NPDiff), we have added four new probabilistic baselines as suggested: GP, DeepAR, DeepState, and TMDM. -
Clarified Novelty and Contributions: We further clarified the core contributions of our work, particularly the mean–residual decomposition framework, where a deterministic model predicts the conditional mean and a diffusion model focuses on modeling the residual distribution for probabilistic forecasting.
We now provide detailed comparisons that explicitly differentiate CoST from prior works, including:- Decomposition-based methods (e.g., Leddam, TimeMixer)
- Modulation-based methods (e.g., TMDM)
- Hybrid deterministic–diffusion approaches (e.g., DiffCast, CasCast)
-
Additional Ablation Studies: To further support our design choices, we conducted new ablation studies:
(i) Validated the superiority of using a learned deterministic component (STID) over simpler alternatives such as historical average (HA) and FFT components.
(ii) Analyzed the computational efficiency of CoST, showing it is suitable for real-world deployment.
(iii) Provided quantitative evidence that the deterministic and probabilistic components work collaboratively rather than conflicting.
We have done our utmost to address all the reviewers’ concerns and suggestions. We are confident that our revised manuscript now presents a clearer, stronger, and more generalizable framework for probabilistic spatiotemporal forecasting. Thank you again for your time and consideration.
This paper focuses on enhancing existing point-estimate time-series forecasters with diffusion-based distributional estimation, which is an important and practically valuable problem. The AC agrees with the authors that this topic is of high significance, though it is often misunderstood or undervalued by reviewers. Some of the initial reviews clearly reflected this misunderstanding, and the AC appreciates the authors’ thorough and constructive rebuttals, which helped clarify the core contributions of the paper.
Despite the low ratings, the AC finds the paper well written, carefully benchmarked, and addressing a problem of genuine importance for the field. It is not uncommon for strong works of this nature, those that respect prior art, perform thorough comparisons, and tackle impactful yet underappreciated challenges, to be judged harshly under the criterion of ``novelty.''
That said, the AC also recognizes why the reviewers might perceive the contribution as limited. The shift from conditioning a diffusion model directly on deterministic forecasts (as in modulation-based approaches) to modeling de-meaned residuals could appear incremental at first glance. However, this work goes beyond that: it carefully integrates spatiotemporal considerations and introduces mechanisms absent in closely related previous efforts such as CARD or TMDM.
Balancing these perspectives, the AC finds it difficult to recommend acceptance in the face of the current reviewer consensus. Nevertheless, the AC encourages the authors to continue pursuing this line of research, as it provides a valuable position for advancing the field of spatiotemporal probabilistic forecasting. A venue such as TMLR—where methodological soundness and significance can outweigh “perceived novelty”—may be more receptive to the important contributions of this work.