PaperHub
6.0
/10
Poster3 位审稿人
最低6最高6标准差0.0
6
6
6
4.0
置信度
ICLR 2024

MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process

OpenReviewPDF
提交: 2023-09-21更新: 2024-03-16
TL;DR

We introduce the MG-TSD model, which leverages multiple granularity levels within the data to guide diffusion models for probabilistic time series forecasting.

摘要

关键词
denoising diffusion modelsmulti-granularitytime series forecasting

评审与讨论

审稿意见
6

This work proposes a Multi-Granularity Time Series Diffusion (MG-TSD) model for time series prediction. In general, MG-TSD controls the learning process of the diffusion model by leveraging the temporal signals in time series data at different granularity levels. In particular, the authors link the forward process of the diffusion model to the data smoothing process. Motivated by this, this work develops a multi-granularity guidance diffusion loss function, such that the inherent features within data can be preserved and a regularized sampling path can be achieved. Experiments on real-world time series datasets demonstrate the effectiveness of the proposed approach.

优点

  1. In the context of time series forecasting, it is a good idea to stabilize the diffusion model with the help of coarse-grained temporal signals in time series data.
  2. The derivation of the multi-granularity guidance loss function is solid. Besides, the learning procedure devised in this work is applicable.

缺点

  1. The assumption that "... forward process of the diffusion model ... intuitively aligns with the process of smoothing fine-grained data into a coarser-grained representation ..." is not verified through theoretical analysis or empirical study.

  2. Experimental settings for the time series forecasting task are unclear. For example, how the context interval and prediction interval are constructed in a time series dataset? Do you utilize a sliding window to roll the time series to build the context and predict intervals? Do the consecutive context/predict intervals overlap or not?

  3. More experimental results are expected. For instance, the authors only evaluate the time series forecasting methods under one length setting in each dataset, like the setting of context-24-predict-24 is utilized in Solar, Electricity, Traffic, Taxi, and KDD-Cup datasets.

  4. The reproductivity of this work is a concern.

\======================\

After rebuttal.

Most of my concerns are clarified.

问题

According to the inference procedure, the proposed time series forecasting approach can predict one future horizon time step at a time. How to apply this predictive approach for the long-term time series forecasting task effectively?

评论

We sincerely appreciate your thorough review and valuable feedback. Enclosed are our responses to your insightful questions and concerns.

Q1. According to the inference procedure, the proposed time series forecasting approach can predict one future horizon time step at a time. How to apply this predictive approach for the long-term time series forecasting task effectively?

We are grateful to the reviewer for the insights.

  • In our Temporal Process Module, we utilize the RNN as the backbone model to encode the temporal dependence sequentially. Therefore, predicting m prediction length involves taking n-to-1 (n timesteps predicts one timestep) for m times autoregressively, where each inference step incorporates previous predictions into the context. Based on the extra experiment concerning long-term forecasting, our method experiences only a moderate decrease in performance while extending prediction length.
  • It is worth noting that our Temporal Process Module is adaptable and can accommodate other encoders capable of handling variable-length time steps. This includes, but is not limited to, autoregressive RNNs in an n-to-1 manner or non-autoregressive methods of an n-to-m manner such as transformers.
  • In certain scenarios, a Transformer-based TPM might be a more efficient solution, taking m timesteps predictions with one step. However, Transformers have limitations regarding the maximum length of the time series, which means the context length and prediction length cannot be increased infinitely. Additionally, for high-dimensional time series data (such as Wikipedia dataset with 2000 dimensions), the attention mechanism might suffer from huge training and inference overhead.

Since the primary focus of this paper is on multi-granularity guidance, we did not test the eligibility of transformer variant for TPM and we will take it as an important future work.

W1. The assumption that "... forward process of the diffusion model ... intuitively aligns with the process of smoothing fine-grained data into a coarser-grained representation ..." is not verified through theoretical analysis or empirical study.

Thanks to the reviewer for the valuable insights. We would like to point out that the results depicted in Figure 3 of our manuscript provide empirical support for our assumption. Please note that, in Figure 3, the x-axis denotes the denoising diffusion steps, and the y-axis signifies the textCRPS_textsum\\text{CRPS}\_{\\text{sum}} values.

Figure 3 illustrated an investigation into the effect of share ratio, where we evaluate the performance of MG-TSD using various share ratios across different coarse granularities.

In Figure 3(a)-(d), the dashed blue curve in each plot represent textCRPS_textsum\\text{CRPS}\_{\\text{sum}} values between the coarse-grained targets and 1-hour samples come from 1-gran (finest-gran) model at each intermediate denoising step; each point on the orange polylines represents the textCRPS_textsum\\text{CRPS}\_{\\text{sum}} value of 1-hour predictions by 2-gran MG-TSD models (where one coarse granularity is utilized to guide the learning process for the finest-grained data), with different share ratios ranging from [0.2, 0.4, 0.6, 0.8, 1.0], and the lowest point of the line segment can be used to characterize the most suitable share ratio for the corresponding granularity.

The four subplots from (a) to (d) illustrate a gradual smoothing transformation of the distribution of increasingly coarser targets. A key observation is that from the left to the right panel, the distribution of coarser targets gradually aligns with the distribution of intermediate samples at larger diffusion steps. More specifically, as granularity transitions from fine to coarse (4h→6h→12h→24h), the denoising steps at which the distribution most resembles the coarse-grained targets increase (approximately at steps 20→40→60→60). This comparison underscores the similarity between the diffusion process and the smoothing process from the finest-grained data to coarse-grained data, both of which involve a gradual loss of finer characteristics from the finest-grained data through a smooth transformation.

评论

I am confused that, as the four subplots from (a) to (d) in Figure 3 suggested, the denoising diffusion steps at which the distribution most resembles the coarse-grained targets decrease, i.e., 80 \rightarrow 60 \rightarrow 40 \rightarrow 40, which is opposite to your statement. Please correct me if I missed any key points.

评论

Thank you very much for your response, and you are correct. There is indeed a typographical error in our previous message. In the statement "denoising steps at which the distribution most resembles the coarse-grained targets increase (approximately at steps 20→40→60→60)", the term "denoising step" should indeed be corrected to "diffusion step". What we intended to express was that the corresponding steps in the diffusion process are approximately 20→40→60→60, whereas in the denoising process, the corresponding steps are approximately 80→60→40→40, exactly as you pointed out. We sincerely apologize for this oversight in our response. Please be assured that our expressions in the manuscript are correct.

评论

Thanks for your clarification.

Again, I am still not well convinced that, regarding a time series, the connection between the diffusion process (forward process) and the smoothing process does hold. Could you provide more convincing evidence?

From the point of view of time series decomposition, a time series XX can be approximated as X=T+S+ϵX = T + S + \epsilon where TT and SS denote trend and seasonality, respectively. And ϵ\epsilon corresponds to the noisy part, which can also be deemed as details of XX. As the diffusion steps increase, XX will be corrupted by X=T+S+ϵ+ϵX^{\prime} = T + S + \epsilon + \epsilon^{\prime}, how to prove that XX^{\prime} represents the coarser version of XX?

评论

Thanks for the quick response and your insights. We would like to highlight the similarity of the two processes lying in that the forward diffusion and smoothing process both lead to a loss of finer features of the original complex distribution of the fine-grained data.

  • In real-world applications, the finest-grained time series data exhibit significant fluctuations and complex temporal dynamics. According to additive decomposition X=T+S+epsilonX=T+S+\\epsilon, after extracting the trend and seasonality, the delicate fine features of the distribution are majorly encoded in the seasonality SS and residual epsilon\\epsilon terms. During the smoothing process from fine-grained to coarse-grained, the seasonality term SS and the residual terms epsilon\\epsilon distributions are susceptible to shifts and corruption, while the trend TT may be preserved when the window size is not too large. The superposition of diverse seasonality patterns and noise can cancel each other out, gradually losing the finer features in the previous decomposition.

  • As for the forward diffusion process, the decomposition can be rewritten as Xprime=sqrtbaralpha_n(T+S+epsilon)+sqrt1baralpha_nepsilonprime=sqrtbaralpha_nT+sqrtbaralpha_nS+sqrtbaralpha_nepsilon+sqrt1baralpha_nepsilonprime.X^{\\prime}=\\sqrt{\\bar{\\alpha\_n}}(T+S+\\epsilon)+\\sqrt{1-\\bar{\\alpha\_n}}\\epsilon^{\\prime}=\\sqrt{\\bar{\\alpha\_n}}T+\\sqrt{\\bar{\\alpha\_n}}S+\\sqrt{\\bar{\\alpha\_n}}\\epsilon+\\sqrt{1-\\bar{\\alpha\_n}}\\epsilon^{\\prime}. The seasonality term SS can further be decomposed into the lower-frequency S_textlowS\_\\text{low} and the high-frequency S_texthighS\_\\text{high}. The delicate fine features of the distribution are mainly encoded in the S_texthighS\_{\\text{high}}. During the diffusion forward process, the S_texthighS\_{\\text{high}} term is gradually overwhelmed by larger white noise and is susceptible to corruption in the first few diffusion steps. The lower-frequency seasonality and the trend term are preserved within a certain number of diffusion steps.

  • For a better illustration, we sampled series from Solar dateset and we conducted a Fast Fourier Transform to extract the seasonality components of the series, as well as the samples of different granularities and corresponding noisy samples along the forward diffusion process. Results are appended in the appendix B.2.3 of our manuscript. As depicted in Figure 7(a), as granularity becomes coarser, the components of all outstanding frequencies get lower, while the high-frequency peak (around 125 and 80) diminishes quicker than lower-frequency peak (around 45). Figure 7(b) demonstrates the distribution of frequency components of the same noisy series with gradually ascending forward diffusion steps and the same pattern is observable. This result further indicates the connection of the two processes in losing finer informative features, which motivated us to utilize coarse granularity data as guidance and the main experiment in our manuscript demonstrates performance gain brought by incorporating multi-granularity guidance.

评论

W2. Experimental settings for the time series forecasting task are unclear.

Thanks the reviewer for the question. To clarify, we generate coarse-grained data from the finest-grained time series data over the entire timeline. We rely on the GluonTS[1] library for data splitting and creating the training and testing instances. The GluonTS library is widely used in time series forecasting. Further explanations are provided below:

  • Generation of the coarse-grained data: Section 3.1 has the details to generate multi-granularity data. A sliding window with a pre-defined size sgs^g for granularity gg is applied to the finest-grained data across the entire timeline, where g=1,2,...,Gg=1,2,...,G. These sliding windows are intentionally non-overlapping. Within each window, we smooth the finest-grained data by averaging and replicate the average sgs^g times to align with the timeline [1,T][1, T].
  • Dataset splitting: All the datasets we used in the benchmark experiments are public and available in GluonTS[1]. The library has these datasets pre-split into training and testing sets, and we follow this splitting method to obtain our training and testing datasets.
  • Creation of training and testing instances: we randomly sample the context window, followed by the prediction window, from the complete training data. This process can be viewed as applying a moving window to auto-regressively roll through the entire timeline, with consecutive time intervals overlapping. Furthermore, for different datasets, the context length and prediction length vary, such as 24-hour-24-hour for Electricity, and 30-Day-30-Day for Wikipedia, as detailed in Appendix C.1.

W3. More experimental results about different prediction lengths are expected.

Thanks the reviewer for insightful suggestion. We have conducted more experiments to test the performance of MG-TSD method with different prediction length.

Experiment setting: In the current experiment, we assess the long-term prediction performance of MG-TSD by comparing it with the baseline TimeGrad, chosen for its competitive performance in the main experiment (Table 1) and its shared characteristic of forecasting in an autoregressive manner. For the solar and electricity datasets (with the frequency of 1 hour), we set the context length to 24 hours and evaluate the performance of methods with prediction lengths of 24 hours, 48 hours, 96 hours and 144 hours. The average textCRPS_textsum\\text{CRPS}\_{\\text{sum}}, textNRMSE_textsum\\text{NRMSE}\_{\\text{sum}}, and textNMAE_textsum\\text{NMAE}\_{\\text{sum}} metrics are computed for both MG-TSD and the baseline over 10 independent runs, with error bars indicating the corresponding standard deviations. (We would include additional baselines in the plots later to strengthen our conclusions for updated paper. )

Experiment results and findings: The preliminary results indicate that MG-TSD performs well for long-time forecasting. The results indicate that as the prediction length increases, the performance of our proposed method stays robust, exhibiting no sudden decline. Furthermore, our method consistently outperforms the competitive baseline. This performance advantage is anticipated to persist in future trends, with no indication of convergence between the approaches.

Table G: Results for Solar Dataset

Prediction LengthMethodtextCRPS_textsum\\text{CRPS}\_{\\text{sum}}textNRMSE_textsum\\text{NRMSE}\_{\\text{sum}}textNMAE_textsum\\text{NMAE}\_{\\text{sum}}
24hTimeGrad0.3335±0.06530.6952±0.06440.3637±0.0665
48hTimeGrad0.3615±0.02980.7392±0.05660.4070±0.0298
96hTimeGrad0.3737±0.02130.7905±0.04810.4113±0.0238
144hTimeGrad0.4301±0.01400.9285±0.02190.4768±0.0109
24hMG-TSD0.3178±0.03420.6591±0.05030.3480±0.0356
48hMG-TSD0.3401±0.02710.7234±0.03980.3862±0.0217
96hMG-TSD0.3500±0.02700.7395±0.04390.3909±0.0264
144hMG-TSD0.3659±0.03110.8179±0.06000.4226±0.0292

Table H: Results for Electricity Dataset

Prediction LengthMethodtextCRPS_textsum\\text{CRPS}\_{\\text{sum}}textNRMSE_textsum\\text{NRMSE}\_{\\text{sum}}textNMAE_textsum\\text{NMAE}\_{\\text{sum}}
24hTimeGrad0.0205±0.00330.0348±0.00570.0266±0.0049
48hTimeGrad0.0264±0.00200.0474±0.00340.0343±0.0026
96hTimeGrad0.0304±0.00480.0558±0.00920.0407±0.0065
144hTimeGrad0.0532±0.00900.0953±0.01530.0674±0.0096
24hMG-TSD0.0174±0.00420.0296±0.00860.0226±0.0071
48hMG-TSD0.0212±0.00280.0334±0.00450.0279±0.0042
96hMG-TSD0.0224±0.00690.0376±0.01030.0286±0.0086
144hMG-TSD0.0341±0.00910.0609±0.01420.0473±0.0116

Action taken: We have included the experiments and results in Appendix B.2. Figure 5 in the appendix visualizes the results in Table G and Table H provided here, providing a clearer presentation of our conclusions.

评论

W4. Reproducibility

We plan to release our code soon. We are intensively working on cleaning up our code to ensure it is well-documented, thoroughly tested, and user-friendly. Additionally, to guarantee the reproducibility of our experiments, we performed 10 independent runs for each setting and reported both the average values and the standard deviation (std).

References:

[1]Alexandrov A, Benidis K, Bohlke-Schneider M, et al. Gluonts: Probabilistic and neural time series modeling in python[J]. The Journal of Machine Learning Research, 2020, 21(1): 4629-4634.

审稿意见
6

Diffusion probabilistic models which can generate high-fidelity samples reserve stochastic nature. This characteristic makes it less effective in probabilistic time series forecasting tasks. To improve the efficiency of Diffusion probabilistic models, this paper introduces a novel MG-TSD model with an innovatively designed multi-granularity guidance loss function that efficiently guides the diffusion learning process. To effectively utilize coarse-grained data across various granularity levels, this paper propose a concise implementation method. What’s more, this approach does not rely on additional external data, making it versatile and applicable across various domains. Extensive experiments conducted on real-world datasets demonstrate the superiority of the proposed model, achieving the best performance compared to the state-of-the-art methods.

优点

  1. In the context of the time series forecasting, where fixed observations exclusively serve as objectives, diffusion probabilistic models would result in forecasting instability and inferior prediction performance. Unlike constraining the intermediate states during the sampling process, this paper creatively leverages multiple granularity levels within data to guide the learning process of diffusion models.
  2. This paper provides a series of ablation experiments to test the effect of share ratio and the number of granularities, it evaluate the performance of MG-TSD using various share ratios across different coarse granularities and the number of granularities.
  3. Clarity: The paper offers a clear presentation to the model architecture with a good explanation of the methodology.

缺点

  1. This paper lacks an evaluation of the time complexity of the model. It may be more sufficient to add experiments that consume memory and time.
  2. MG-TSD is consisting of Multi-granularity Data Generator, Temporal Process Module (TPM), and Guided Diffusion Process Module. However, the ablation experiment part of this paper lacks performance testing of each module, especially Multi-granularity

问题

  1. In Equation 7 of Section 3.2.1, is the distribution of ∈ consistent with the distribution of x_N? Does the distribution of ∈ obey the normal distribution? The distribution and meaning of ∈ are not pointed out.
  2. Compared with the existing diffusion probabilistic models, does the MG-TSD model framework differs only in the Guided Diffusion Process Module ?

伦理问题详情

No ethics review needed.

评论

W2. Concern about the ablation study for the Multi-granularity Data Generator module and TPM module

Thanks to the reviewer for the comment.

We would like to highlight that Table 3 in our manuscript serves as an ablation study of the Multi-granularity component in the MG-TSD model. The experiment, presented in Table 3, explores the effect of varying the number of granularity levels on the model's performance. Testing the performance of the model with an additional granularity level inherently assesses the effectiveness of all three modules. More specifically, the Multi-granularity Data Generator Module is executed to generate the data at a new granularity level. The Temporal Process Module (TPM) incorporates an additional RNN cell to encode the temporal pattern within the added granularity data, while the Guided Diffusion Process Module adapts the learning and generation process with the additional coarse-grained data. The results in Table 3 reveal that increasing the number of granularity levels typically improves the performance of the MG-TSD model. This finding confirms that the mechanism currently designed to generate coarse-grained data is effective in enhancing performance. Theoretically, the MG-TSD model can handle an unlimited increase in granularity levels. However, in practical scenarios, we observe that the marginal benefit tends to diminish as the number of granularity levels increases. Our findings suggest that employing four to five granularity levels typically suffices to achieve optimal performance.

To further clarify, the Multi-granularity Generator Module functions as a predefined data pre-processing step and is utilized to generate coarse-grained data at various granularity levels. It does not involve parameter optimization during the training stage. The Temporal Process Module serves as an encoder in the model, which captures and compresses the temporal dependencies in time series data up to a certain timestep. The RNN is adopted in our model as it presents a feasible and convenient method of implementation for this module. The choice of RNN as the backbone encoder in TPM aligns with various previous works, including TimeGrad[1] and GP-copula[2]. These studies utilize RNN for modeling temporal dependencies, which attests to the effectiveness of RNN. Since the primary focus of this paper is on multi-granularity guidance, we did not conduct extra ablation studies on the designs of these components.

References:

[1] Rasul K, Seward C, Schuster I, et al. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting[C]//International Conference on Machine Learning. PMLR, 2021: 8857-8868.

[2] Salinas D, Bohlke-Schneider M, Callot L, et al. High-dimensional multivariate forecasting with low-rank Gaussian copula processes[J]. Advances in neural information processing systems, 2019, 32.

评论

We are grateful for your constructive comments. Below we address each question and concern.

Q1: In Equation 7 of Section 3.2.1, is the distribution of epsilon\\epsilon consistent with the distribution of x_Nx\_N? Does the distribution of epsilon\\epsilon obey the normal distribution? The distribution and meaning of epsilon\\epsilon are not pointed out.

Thank the reviewer for bringing attention to this. The epsilon\\epsilon in equation (7) adheres to the standard normal distribution, consistent with the backbone denoising diffusion models. The dimension of the epsilon\\epsilon is the same with xgx^{g}.

Action taken: We have incorporated this clarification into our manuscript.

Q2: Compared with the existing diffusion probabilistic models, does the MG-TSD model framework differs only in the Guided Diffusion Process Module?

Thanks for this question. We would like to further clarify on our module design.

  • The core difference between MG-TSD and previous work principally lies in the approach of incorporating information from data. We creatively propose a multi-granularity guidance approach to naturally exploit intrinsic coarse-to-grain features within data to stabilize the samples from the diffusion model. This is facilitated by the Guided Diffusion Process Module.
  • The multi-granularity data generator is unique in our work for the preprocessing of original data into multiple alternatives with different granularities. This module collaborates synergistically with the Guided Diffusion Process Module, enabling it to adapt to multi-granularity requirements.
  • Although the temporal process module is similar to previous work[1][2] in its function for extracting historical context, it was dedicatedly designed to adapt for the multi-granularity cases. To be more specific, temporal process module employs separate RNN submodule for each granularity, without parameters sharing across different granularities.

W1. Concern about the time complexity and memory of the model.

Experiments have been conducted to evaluate the time and memory usage of the MG-TSD model during training across various granularities.

Experiment setting: These experiments were executed using a single A6000 card with 48G memory capacity. The Solar dataset was utilized in this context, with a batch size of 128, input size of 552, 100 diffusion steps, and 30 epochs.

Experiment results and findings: As illustrated in the corresponding graph and table, there is a linear increase in memory consumption with an increase in granularity. A slight addition in training time is also observed. These findings are coherent with the architecture of our model. In particular, each additional granularity results in the introduction of an extra RNN in the Temporal Process Module and an increase in computation within the Guided Diffusion Process Module. As per theoretical expectations, these resource consumptions should exhibit linear growth. Moreover, it is pertinent to mention that an excessive increase in granularity may not notably boost the final prediction results, hence the granularity should be kept within a certain range. Therefore, the consumption of memory will not rise indefinitely.

Table F: Comparison of Time and Memory Consumption at Different Granularity Levels in MG-TSD Model Training

GranularityMemory(GB)RunTime(Minute)Relative Memory OccupancyRelative Run Time
26.0625.93100%100%
313.4527.48222%106%
420.0333.33331%129%
529.0933.52480%129%

Summary: In summary, while it is evident that an increase in granularity escalates the consumption of both time and memory, these increases are reasonable and fall within acceptable boundaries.

Action taken: We have included the experiments and results in Appendix B.2. Figure 6 in the appendix visualizes the results in Table F provided here, providing a clearer presentation of our conclusions.

审稿意见
6

This paper employs the diffusion model for time series forecasting and introduces the Multi-Granularity Time Series Diffusion model, which comprises three key components: 1). The Multi-Granularity Data Generator, responsible for generating multi-granularity data. 2). The Temporal Process Module, which utilizes an RNN architecture to capture temporal dynamics. 3). The Guided Diffusion Process Module is aimed at generating stable time-series predictions. This model leverages various levels of granularity within the data to guide the forward process of the diffusion model. Additionally, the paper designs a multi-granularity guidance loss function and explores optimal configurations for different granularity levels, proposing a practical rule of thumb. Extensive experiments are conducted to showcase its precision and effectiveness.

优点

  1. The research problem addressed in this study is of paramount significance and holds great interest. Accurate time prediction has broad applications, including tasks like anomaly detection and energy consumption control.
  2. The paper introduces a novel and intriguing approach by linking various granularities in the time series with the forward process in the diffusion model.
  3. The paper is excellently written and presented in a clear and comprehensible manner.

缺点

  1. Some related works can be further discussed.
  2. There is only one metric in the main experiment, which is not enough.
  3. Compared with the baseline, the performance improvement is not obvious.
  4. The use of the RNN architecture requires further explanation

After rebuttal

most of my concerns have been addressed.

问题

  1. There have been some similar works, such as TimeDiff[1] and D3VAE[2], which also applies the diffusion model. What are the technical advantages of these studies?
  2. The paper designed MG-TSD based on the diffusion model. Why is it not compared with the diffusion-based models in the baseline? Besides, There are some newer Transformer-based models, such as PatchTST[3], and Autoformer[4] should be compared in your experiments.
  • [1]Shen L, Kwok J. Non-autoregressive Conditional Diffusion Models for Time Series Prediction[J]. arXiv preprint arXiv:2306.05043, 2023.
  • [2] Li Y, Lu X, Wang Y, et al. Generative time series forecasting with diffusion, denoise, and disentanglement[J]. Advances in Neural Information Processing Systems, 2022, 35: 23009-23022.
  • [3] Nie Y, Nguyen N H, Sinthong P, et al. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers[C]//The Eleventh International Conference on Learning Representations. 2022.
  • [4] Wu H, Xu J, Wang J, et al. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting[J]. Advances in Neural Information Processing Systems, 2021, 34: 22419-22430.
评论

We provide more statistics summaries to evaluate the performance improvement. We have recorded the rankings of different methods based on their performance on various datasets. The results are shown in table below.

Table D: Performance ranking of methods across datasets

Method\*CRPS_\*sum\**CRPS**\_{\**sum**} Avg Rank\*NRMSE_\*sum\**NRMSE**\_{\**sum**} Avg Rank\*NMAE_\*sum\**NMAE**\_{\**sum**} Avg Rank
Vec-LSTM ind-scaling8.87.27.3
GP-Scaling7.36.36.0
GP-Copula5.74.03.3
Autoformer5.35.26.5
PatchTST4.05.35.3
D3VAE2.28.38.5
TimeDiff10.08.38.0
TimeGrad9.72.32.5
TACTiS8.26.86.3
MG-TSD1.01.21.2

From the results, our method achieved the best rank among all baselines on six datasets in terms of the textCRPS_textsum\\text{CRPS}\_{\\text{sum}}. As for the textNRMSE_textsum\\text{NRMSE}\_{\\text{sum}} and textNMAE_textsum\\text{NMAE}\_{\\text{sum}} metrics, we achieved the best performance on all datasets, except for KDD-cup dataset where the GP-Copular method was slightly better.

Additionally, we have recorded the percentage improvement of the MG-TSD method relative to the top-performing baseline in the table below for textCRPS_textsum\\text{CRPS}\_{\\text{sum}}.

Table E: Relative improvement of MG-TSD method over top-performing baseline

MetricSolarElectricityTrafficKDD-cupTaxiWikipedia
textCRPS_textsum\\text{CRPS}\_{\\text{sum}}7.6%35.8%22.0%36.7%7.6%4.7%

On the six datasets, we achieved textCRPS_textsum\\text{CRPS}\_{\\text{sum}} improvements, ranging from 4.7% to 35.8%. The modest improvements were achieved on the Wikipedia dataset. This might be due to its high-dimensional nature. The improvement of our model is not trivial, given the challenges posed by these datasets.

Action Taken: In Tables 1, 4, and 5, we have updated the results to reflect the best performance of the MG-TSD model with multiple granularities, rather than just two.

W4. The use of the RNN architecture requires further explanation.

Thanks for the question. To clarify, the Temporal Process Module serves as an encoder in the model, which captures and compresses the temporal dependencies in time series data up to a certain timestep. The Recurrent Neural Network (RNN) is adopted in our model as it presents a feasible and convenient method of implementation for this module. It is worth noting that our Temporal Process Module is adaptable and can accommodate other encoders capable of handling variable-length time steps. This includes, but is not limited to, autoregressive RNNs in an n-to-1 manner or transformers in an n-to-m manner.

Furthermore, the choice of RNN as the backbone encoder in TPM aligns with various previous works, including TimeGrad[8] and GP-copula[9]. These studies utilize RNN for modeling temporal dependencies, which attests to the effectiveness of RNN. Notably, differing from these existing works, which typically use a single RNN for single-granularity data, we employ multiple RNNs in the model to leverage the pattern and temporal information in time series data at various granularity levels. These RNNs operate without parameter sharing and are trained simultaneously with the Guided Diffusion Process Module.

References:

[1]Bjerregård M B, Møller J K, Madsen H. An introduction to multivariate probabilistic forecast evaluation[J]. Energy and AI, 2021, 4: 100058.

[2]Rasul K, Sheikh A S, Schuster I, et al. Multivariate probabilistic time series forecasting via conditioned normalizing flows[J]. arXiv preprint arXiv:2002.06103, 2020.

[3]Lin L, Li Z, Li R, et al. Diffusion models for time series applications: A survey[J]. arXiv preprint arXiv:2305.00624, 2023.

[4]Shen L, Kwok J. Non-autoregressive Conditional Diffusion Models for Time Series Prediction[J]. arXiv preprint arXiv:2306.05043, 2023.

[5] Li Y, Lu X, Wang Y, et al. Generative time series forecasting with diffusion, denoise, and disentanglement[J]. Advances in Neural Information Processing Systems, 2022, 35: 23009-23022.

[6] Nie Y, Nguyen N H, Sinthong P, et al. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers[C]//The Eleventh International Conference on Learning Representations. 2022.

[7] Wu H, Xu J, Wang J, et al. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting[J]. Advances in Neural Information Processing Systems, 2021, 34: 22419-22430.

[8] Rasul K, Seward C, Schuster I, et al. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting[C]//International Conference on Machine Learning. PMLR, 2021: 8857-8868.

[9] Salinas D, Bohlke-Schneider M, Callot L, et al. High-dimensional multivariate forecasting with low-rank gaussian copula processes[J]. Advances in neural information processing systems, 2019, 32.

评论

Table A: Comparison of textCRPS_textsum\\text{CRPS}\_{\\text{sum}} (smaller is better) of models on six real-world datasets. The reported mean and standard error are obtained from 10 re-training and evaluation independent runs.

MethodSolarElectricityTrafficKDD-cupTaxiWikipediaAvg Rank
Vec-LSTM ind-scaling0.4825±0.00270.0949±0.01750.0915±0.01970.3560±0.16670.4794±0.03430.1254±0.01748.8
GP-Scaling0.3802±0.00520.0499±0.00310.0753±0.01520.2983±0.04480.2265±0.02100.1351±0.06127.3
GP-Copula0.3612±0.00350.0287±0.00050.0618±0.00180.3157±0.04620.1894±0.00870.0669±0.00095.7
LSTM-MAF0.3427±0.00820.0312±0.00460.0526±0.00210.2919±0.14860.2295±0.00820.0763±0.00515.3
Transformer-MAF0.3532±0.00530.0272±0.00170.0499±0.00110.2951±0.05040.1531±0.00380.0644±0.00374.0
TimeGrad0.3335±0.06530.0232±0.00350.0414±0.01120.2902±0.21780.1255±0.02070.0555±0.00882.2
D3VAE0.4449±0.03750.1424±0.08830.3967±0.11650.4861±0.05170.5909±0.46451.9950±1.987410.0
TimeDiff1.3323±0.03050.3505±0.00750.4778±0.00580.3622±0.01270.4517±0.01010.1140±0.01059.7
TACTiS0.4209±0.03300.0259±0.00190.1093±0.00760.5406±0.15840.2070±0.0159-8.2
MG-Input0.3239±0.04270.0238±0.00350.0658±0.00650.2977±0.11630.1592±0.00870.0567±0.00913.8
MG-TSD0.3081±0.00990.0149±0.00170.0323±0.01250.1837±0.08650.1159±0.01320.0529±0.00541.0

Table B: Comparison of textNRMSE_textsum\\text{NRMSE}\_{\\text{sum}} (smaller is better) of models on six real-world datasets. The reported mean and standard error are obtained from 10 re-training and evaluation independent runs.

MethodSolarElectricityTrafficKDD-cupTaxiWikipediaAvg Rank
Vec-LSTM ind-scaling0.9952±0.00770.1439±0.02280.1451±0.02480.4461±0.18330.6398±0.03900.1618±0.01627.2
GP-Scaling0.9004±0.00950.0811±0.00620.1469±0.01810.3445±0.06210.3598±0.02850.1710±0.10066.3
GP-Copula0.8279±0.00530.0512±0.00090.1282±0.00330.2605±0.02270.3125±0.01130.0930±0.00764.0
Autoformer0.7046±0.00000.0475±0.00000.0951±0.00000.8984±0.00000.3498±0.00000.1052±0.00005.2
PatchTST0.7270±0.00000.0474±0.00000.1897±0.00000.5137±0.00000.3690±0.00000.0915±0.00005.3
D3VAE0.7472±0.05080.1640±0.09280.4722±0.11970.5628±0.04190.7624±0.55982.2094±2.16468.3
TimeDiff1.5985±0.03590.3714±0.00730.5520±0.00870.4955±0.01470.5479±0.00840.1412±0.00998.3
TimeGrad0.6953±0.08450.0348±0.00570.0653±0.02440.4092±0.13320.2365±0.03860.0870±0.01062.3
TACTiS0.8532±0.08510.0427±0.00230.2270±0.01590.6513±0.17670.3387±0.0097-6.8
MG-TSD0.6178±0.04180.0241±0.00300.0563±0.02300.3001±0.09970.2334±0.03130.0810±0.00571.2

Table C: Comparison of textNMAE_textsum\\text{NMAE}\_{\\text{sum}} (smaller is better) of models on six real-world datasets. The reported mean and standard error are obtained from 10 re-training and evaluation independent runs.

MethodSolarElectricityTrafficKDD-cupTaxiWikipediaAvg Rank
Vec-LSTM ind-scaling0.5091±0.00270.1261±0.02110.1042±0.02280.4193±0.19020.4974±0.03510.1416±0.01807.3
GP-Scaling0.4945±0.00650.0648±0.00460.0975±0.01630.2892±0.05500.2867±0.02640.1452±0.10296.0
GP-Copula0.4302±0.00460.0312±0.00070.0769±0.00220.2140±0.01240.2390±0.00980.0659±0.00613.3
Autoformer0.6368±0.00000.0388±0.00000.0684±0.00000.7658±0.00000.2652±0.00000.1239±0.00006.5
PatchTST0.4351±0.00000.0350±0.00000.1219±0.00000.4497±0.00000.2887±0.00000.0625±0.00005.3
D3VAE0.4457±0.03770.1434±0.08920.3992±0.11770.4874±0.05200.6080±0.50612.0151±2.00058.5
TimeDiff1.3343±0.03050.3519±0.00750.4782±0.00580.3630±0.01270.4521±0.01020.1146±0.01068.0
TimeGrad0.3694±0.04000.0266±0.00490.0410±0.00890.3614±0.13340.1365±0.01930.0631±0.00802.5
TACTiS0.4448±0.03130.0310±0.00150.1352±0.01590.6078±0.17180.2244±0.0036-6.3
MG-TSD0.3347±0.02200.0178±0.00180.0370±0.01400.2463±0.08650.1300±0.01500.0601±0.00571.2

W3. Concerns about the significance of the model’s improvements compared to baselines.

For clarity, the original Table 1 results illustrate the MG-TSD model's performance with two granularities. Our results in Table 3 suggests employing more granularity levels could lead to further performance improvements. We have updated results in Tables 1, 4 and 5 to better reflect the MG-TSD model's optimal performance using multiple granularities.

评论

Thank you very much for your careful review and constructive suggestions! Please find our response to your questions and concerns.

Q1&W1. Some related works can be further discussed. There have been some similar works, such as TimeDiff[1] and D3VAE[2], which also apply the diffusion model. What are the technical advantages of these studies?

Diffusion-based methods that use generations from conditional distribution as predictions are typically from probabilistic methods. In contrast, sequence-based (including transformer-based) models are mostly deterministic methods.

Autoformer and PatchTST are deterministic time-series forecasting models based on dedicated designed Transformers, which are different from MG-TSD in the scope of prediction. D3VAE and TimeDiff are probabilistic time-series forecasting methods, which are closer to the scope of our work.

Both D3VAE and TimeDiff involve diffusion probabilistic models. D3VAE utilizes only the forward diffusion process and the prediction stage is majorly taken over by a VAE architecture, which is fundamentally different from MG-TSD in the way of generating forecasts.

In contrast, both TimeDiff and MG-TSD utilize a conditional denoising process over time intervals to perform forecasting. The major difference between TimeDiff and MG-TSD lies in the way of incorporating information within data, TimeDiff is trained with the mixed up of hidden contexts and future ground truths as sample conditionings, while MG-TSD utilizes multi-granularity guidance, leveraging intrinsic coarse-to-grain features within data.

The technical advantages of diffusion-based time series forecasting are summarized as below:

  • Uncertainty quantification. A primary advantage of probabilistic models is their capability to accurately capture data distribution instead of just point estimates. This allows for the convenient construction of prediction intervals using multiple outputs from the diffusion model. By knowing the data distribution at a specific timestamp, one can quantify the prediction uncertainty and provide more reliable forecasts. It also enables the convenient evaluation of the probability of extreme events occurring. For instance, in the wind power sector, reliable forecasts are essential. An unexpected extreme event causing a wind farm to shut down can wipe out months of revenue. This underscores the importance of probabilistic forecasting, which considers both the expected power output and the uncertainty of the forecast, thereby aiding in the minimization of such risks [1].

  • The capability to model arbitrary distributions without parametric assumptions. This capacity of diffusion models to characterize distributions is also applicable to modeling the distribution of time series. A flaw in other methods of distribution modeling is that they are strictly constrained by the functional structure of their target distributions. For example, the previous choice, Transformer-MAF[2], models multivariate time series with an autoregressive deep learning model, in which the data distribution is expressed by a conditional normalizing flow. Diffusion-based methods, conversely, can offer a less restrictive solution, as indicated in reference [3].

Action Taken: We have included additional experimental results with respect to these mentioned works, on datasets of different dimensions and predicting lengths. Overall, the performance of MG-TSD is superior. Please refer to the reply to Q2&W2 below for numerical results. We have also included these results in the updated manuscript.

Q2&W2. The paper designed MG-TSD based on the diffusion model. Why is it not compared with the diffusion-based models in the baseline? Besides, there are some newer Transformer-based models, such as PatchTST[6], and Autoformer[7] should be compared in your experiments. There is only one metric in the main experiment, which is not enough.

Thanks for suggesting additional references. In our original benchmark experiment, the TimeGrad method included in our baseline is a diffusion-based model. To ensure comprehensive comparisons, we have now incorporated all the mentioned works, TimeDiff[4], D3VAE[5], PatchTST[6], and AutoForemer[7], into our baselines.

Furthermore, we have broadened our evaluation metrics to include textNMAE_textsum\\text{NMAE}\_{\\text{sum}} (Normalized MAE) and textNRMSE_textsum\\text{NRMSE}\_{\\text{sum}} (Normalized RMSE). Detailed results for these additional metrics can now be found in the updated appendix, particularly in Tables 4 and 5.

Here, we attach the benchmark experiment results of three metrics textCRPS_textsum\\text{CRPS}\_{\\text{sum}}, textNMAE_textsum\\text{NMAE}\_{\\text{sum}}, and textNRMSE_textsum\\text{NRMSE}\_{\\text{sum}} below. These additional experiments validate that our method consistently delivers state-of-the-art performance.

评论

We thank all the reviewers for the effort engaged in the review phase. We really appreciate those constructive comments and insights! We are truly grateful for the reviewers' positive feedback on the innovative nature of our methods, our presentation and derivations, and our detailed ablation study.

Based on these valuable comments, we have made the following revisions in our updated manuscript.

  • We have conducted additional benchmark experiments. The new results are summarized and now included in the revised supplementary material, Appendix B.1 (Tables 4 and 5). Below is a brief summary of what we have done:
  • We have added four new baselines: TimeDiff, D3VAE, PatchTST, and Autoformer.
  • We have included two additional evaluation metrics to the main experiment: textNRMSE_textsum\\text{NRMSE}\_{\\text{sum}} and textNMAE_textsum\\text{NMAE}\_{\\text{sum}}, for a more comprehensive comparison.
  • We have conducted more extensive testing of our method:
  • We added an experiment to test the performance of our method over a longer forecast time horizon. The experiment settings, results, and findings are included in Appendix B.2.1.
  • We conducted another experiment to evaluate the time and memory usage of the MG-TSD model during training. The experiment settings, results, and findings are included in Appendix B.2.2.

We have carefully proofread our manuscript and corrected any typos, as well as imprecise labels or captions in the plots.

We are intensively working on revisions. Our code will be made available to the public soon.

AC 元评审

The paper presents an intriguing approach to time series forecasting with the MG-TSD model, leveraging diffusion models and multi-granularity data. The strengths lie in the innovative methodology, clear presentation, and robust experiments. The weaknesses, such as limited metric diversity and unclear performance improvement over baselines, need to be addressed for a more comprehensive evaluation in the final version of the draft. Furthermore, comparisons with relevant existing works and Transformer-based models would enhance the paper's contribution and practical relevance.

为何不给更高分

  1. Some related works could be further discussed, and additional metrics in the main experiment could provide a more comprehensive evaluation.
  2. The performance improvement over baseline models is not as pronounced as expected, and the use of the RNN architecture requires further explanation.

为何不给更低分

All reviewers reached a consensus to (marginally) accept this paper.

最终决定

Accept (poster)