/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Enhancing Foundation Models for Time Series Forecasting via Wavelet-based Tokenization

Luca Masserano,Abdul Fatir Ansari,Boran Han,Xiyuan Zhang,Christos Faloutsos,Michael W. Mahoney,Andrew Gordon Wilson,Youngsuk Park,Syama Sundar Rangapuram,Danielle C. Maddix,Bernie Wang

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

TL;DR

We develop a wavelet-based tokenizer and pretrain a foundation model for time series forecasting on time-localized frequencies. Results show excellent generalization performance and superior ability to capture complex patterns of practical relevance.

摘要

关键词

time seriesforecastingfoundation modelswaveletstokenizationfrequencypretrained modelnonstationary

评审与讨论

审稿意见

评分: 32025-03-09

To build an effective discrete vocabulary for a real-valued sequential input, this paper develops WaveToken, a wavelet-based tokenizer that allows models to learn complex representations directly in the space of time-localized frequencies. The proposed method performs well while using a much smaller vocabulary and exhibits superior generalization capabilities.

update after rebuttal

Thank you very much for your responses. I would like to keep my rating.

给作者的问题

Scaling:

For the relationship between model size and model performance, most of the results are as expected. But in zero-shot settings with VRSE as metric, a small size model WaveToken(Small) is much better than a large size model WaveToken(Base), can you provide some discussion on it?

Efficiency:

In line 435, the authors point out the slower decoding and inference time as a limitation of the proposed method. Can you provide some detailed comparison results about inference time? This can give the reader a fuller understanding of the advantages and disadvantages of the proposed method.

Visualization:

Figure 1 only shows the visualization results on synthetically generated time series. Please provide the visualization results on some benchmark datasets to make the evaluation results in Section 3 more persuasive.

论据与证据

n/a

方法与评估标准

This paper conducts experiments on many datasets for evaluation. These datasets make sense for the problem or application at hand.

理论论述

n/a

实验设计与分析

Baselines:

Please provide the comparison with some latest time series forecasting foundational models, such as Timer [1] and TimeMoE [2].
This paper claims that the proposed method is better than some task-specific models. To better support this claim, can you provide the comparison with more latest task-specific models, such as ModernTCN [3], iTransformer [4]?

[1] Timer: Generative Pre-trained Transformers Are Large Time Series Models

[2] Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

[3] ModernTCN: A Modern Pure Convolution Structure for General Time Series Analysis

[4] iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Metrics:

The evaluation results vary under different metrics. To make the results persuasive, please provide more discussion on the evaluation metrics.

For in-domain settings, why WaveToken only performs better than Chronos on VRSE?
For zero-shot settings, PatchTST performs on par with WaveToken on MASE and TFT performs on par with WaveToken on WQL, but they lag far behind on other metrics. What causes this difference?

补充材料

I check appendix for more results.

与现有文献的关系

n/a

遗漏的重要参考文献

n/a

其他优缺点

Strengths:

This paper develops WaveToken, a novel wavelet-based tokenizer that allows models to learn complex representations directly in the space of time-localized frequencies.
The proposed method performs well while using a much smaller vocabulary and exhibits superior generalization capabilities.

Weaknesses:

More discussion on the evaluation metrics is needed. Please see Experimental Designs Or Analyses.
Efficiency. In line 435, the authors point out the slower decoding and inference time as a limitation of the proposed method.

其他意见或建议

n/a

作者回复

2025-04-01

Thank you so much for these useful comments and questions that helped improve our paper. Below we provide point-by-point answers to each section in your review.

Additional baselines

Based on your and Reviewer BBt2’s suggestion, we evaluated TimeMOE and TTM-R2 models. WaveToken clearly outperforms these models on both benchmarks. See our reply to Reviewer BBt2 for specific details.

Task-specific baselines: We already reported results for TFT and PatchTST, which are known state-of-the-art models. Recent independent work found PatchTST performs comparably to iTransformer on many datasets. Additional task-specific results are also available in Ansari et al. (2024). We thank you for suggesting iTransformer, but due to the limited rebuttal time, we couldn’t finalize experiments on all 42 datasets. We’ll include these results in the final manuscript.

For in-domain settings, why WaveToken only performs better than Chronos on VRSE?

Comparisons with Chronos have to be made by looking at model pairs of the same size. Doing so, WaveToken achieves lower (i.e., better) scores than Chronos 75% of the times across all metrics, and achieves the best average rank across all metrics for both Benchmarks I and II (Figures 9 and 10).

WaveToken’s superiority in VRSE specifically arises from its wavelet decomposition, explicitly capturing complex time-frequency structures. VRSE measures forecast-truth similarity in terms of frequency magnitude, aligning naturally with wavelet-based tokenization.

The core motivation of our work is to develop a general purpose tokenizer that seamlessly captures global and local patterns. Qualitative results (Figure 1 and Section 3.3) demonstrate the impressive performance of WaveToken on complex edge cases that are pervasive in practical applications, where existing foundation models fail almost completely.

For zero-shot settings, PatchTST performs on par with WaveToken on MASE and TFT performs on par with WaveToken on WQL, but they lag far behind on other metrics. What causes this difference?

This is due to the original design principles of these models. PatchTST was originally a point forecaster. We use the implementation from GluonTS, which adapts PatchTST for probabilistic forecasting. However, it is possible that some design elements of PatchTST are better suited for point predictions, hence its superior performance on the MASE. Similarly, TFT was originally designed to be a probabilistic forecaster trained with quantile regression, which is similar to the WQL up to scaling.

Neither PatchTST nor TFT directly aim at producing forecasts whose frequency content closely matches that of the true time series, which is the target property of the VRSE metric.

[...] in zero-shot settings with VRSE as metric, a small size model WaveToken(Small) is much better than a large size model WaveToken(Base), can you provide some discussion on it?

We indeed noticed this pattern as well while conducting experiments. Chronos (Ansari et al. (2024)) exhibits a similar phenomenon. Both Chronos and WaveToken share the same architecture (T5), so the answer might lie in this model. However, we do not currently have a definitive explanation.

Can you provide some detailed comparison results about inference time?

The wavelet tokenizer does not influence inference speed and adds only a minimal overhead, since it is a linear-time convolution-and-downsampling operation. Inference times (average of 10 runs, batch of 32 series, context=512, horizon=64) for both Chronos Base and WaveToken Base — which only differ in the tokenizer — are reported below. The difference is negligible, and could be further reduced by applying the DWT on a GPU.

Chronos: 6.56s +/- 1.02ms
WaveToken: 6.86s +/- 1.14ms

The main bottleneck remains the autoregressive T5 structure. Future work could explore integrating wavelets with PatchTST or TimesFM architectures so that they can process wavelet coefficients and retain the expressivity of the time-frequency domain, while achieving significant speed improvements.

Please provide the visualization results on some benchmark datasets to make the evaluation results in Section 3 more persuasive.

Figure 1 demonstrates WaveToken’s superiority over existing models on controlled practical scenarios (e.g., non-stationarity, sparse spikes). Due to the tight rebuttal deadline, we prioritized additional baseline evaluations requested by Reviewers BBt2 and Yi8n. We'll include further visualizations on a diverse set of real-world datasets in the final manuscript, so as to strengthen the qualitative evaluation of Section 3 as you suggested.

Thank you again for your time and engagement in the review process. We hope our responses above have satisfactorily addressed your concerns. If so, we request you to consider raising your score to reflect that. If you have further questions, we would be happy to respond to them.

审稿人评论

2025-04-03

Thank you very much for your responses. I would like to keep my rating.

审稿意见

评分: 32025-03-12

This paper introduces WaveToken, a wavelet-based tokenization framework designed to enhance foundation models for time series forecasting. The approach leverages the multi-resolution properties of wavelets to transform real-valued time series into compact, expressive token sequences, enabling efficient learning of temporal patterns across diverse domains.

给作者的问题

The autoregressive nature of the T5 architecture paired with wavelet tokenization may lead to slower inference times compared to non-autoregressive models (e.g., patch-based approaches like TimesFM). This could hinder real-time applications requiring rapid predictions.
The choice of wavelet family (e.g., Biorthogonal-2.2) and thresholding method (e.g., no-thresholding) is task-dependent and may not universally optimize performance across all datasets. Hyperparameter sensitivity could limit reproducibility in diverse real-world scenarios.
While WaveToken performs well for H×2 horizons, its WQL performance slightly lags behind TimesFM for H×3 forecasts. This suggests potential limitations in extrapolating patterns over very long horizons, which may require architectural adjustments (e.g., hierarchical attention).

论据与证据

Claims about wavelet decomposition provides a compact and expressive representation are supported by the superior performance of WaveToken on 42 datasets, particularly in capturing complex patterns like exponential trends and sparse spikes (Figure 1). Meanwhile, the authors claim that WaveToken generalizes well to unseen datasets in both in-domain and zero-shot settings. This is supported by empirical results showing WaveToken outperforming state-of-the-art models like Chronos and TimesFM.

方法与评估标准

The proposed wavelet decomposition, thresholding, and quantization pipeline is justified for encoding temporal and frequency information. The metrics (WQL, MASE, VRSE) cover probabilistic accuracy, point forecasting, and frequency-domain fidelity.

理论论述

The paper focuses on empirical validation rather than theoretical proofs. Key claims (e.g., wavelet sparsity, multi-scale learning) are supported by prior wavelet theory (e.g., Mallat, 2009) but lack formal derivations.

实验设计与分析

Ablation studies (vocabulary size, wavelet family, decomposition level) systematically validate hyperparameters. Results are averaged over three seeds, and comparisons include both foundation models and task-specific models. However, the use of T5’s encoder-decoder architecture may introduce bias in autoregressive performance.

补充材料

Appendices provide technical details on wavelets, thresholding methods, evaluation metrics, and dataset splits.

与现有文献的关系

For Wavelet forecasting, this paper extends prior work (e.g., Sasal et al., 2022) by integrating wavelets into a foundation model framework.

遗漏的重要参考文献

For non-autoregressive models, PatchTST (Nie et al., 2022) is cited but not discussed in the context of inference speed.

其他优缺点

This paper proposes an innovative Wavelet-based tokenization, which utilizes Wavelets decompose time series into hierarchical frequency components (approximations and details), enabling the model to capture both global trends and local patterns efficiently.
The proposed methos can achieve high expressiveness via sparse wavelet coefficients while reducing computational overhead, which uses only 1024 tokens (1/4 of Chronos), .
The attention mechanism analysis reveals the model’s ability to exploit wavelet coefficient hierarchies, enhancing transparency.

其他意见或建议

The code is not publicly available, hindering replication.

update after rebuttal

I appreciate the authors' efforts in addressing my concerns, particularly regarding the inference efficiency, the rationale behind wavelet tokenizer design choices, and the scope of zero-shot generalization. I believe the authors have addressed my questions, and I maintain my original score.

作者回复

2025-04-01

Thank you for these useful comments and questions that helped improve our paper. Below we provide point-by-point answers to each section in your review.

The code is not publicly available, hindering replication.

We plan to release a user-friendly research package complete with details on how to train and evaluate our tokenizer and models. Unfortunately, we could not share our code for review at this stage due to pending legal approvals.

The autoregressive nature of the T5 architecture paired with wavelet tokenization may lead to slower inference times compared to non-autoregressive models (e.g., patch-based approaches like TimesFM). This could hinder real-time applications requiring rapid predictions.

We agree that competitive inference times are paramount to achieve an effective utilization of these models in real-time application. We note that in practice the wavelet tokenizer does not influence inference speed and adds only a minimal overhead, since it is implemented as a convolution-and-downsampling operation which runs in linear time. Below we report running times (averaged over 10 repetitions) for both Chronos (Base) and WaveToken (Base) — which use the same underlying architecture and only differ in the tokenization pipeline — when forecasting a batch of 32 time series, with context_length=512 and prediction_length=64.

Chronos: 6.56s +/- 1.02ms
WaveToken: 6.86s +/- 1.14ms

The difference in inference times is negligible, and could be further reduced by applying the wavelet decomposition in parallel on the GPU.

Similar to Chronos, what limits inference speed for WaveToken is the autoregressive nature of forecast generation. Crucially, this is the reason why patch-based methods such as TimesFM are faster than WaveToken. It would be interesting to apply ideas from our work to patch-based architectures as part of future work. In this way, one would retain the expressivity of the time-frequency domain while achieving significant speed improvements.

The choice of wavelet family (e.g., Biorthogonal-2.2) and thresholding method (e.g., no-thresholding) is task-dependent and may not universally optimize performance across all datasets. Hyperparameter sensitivity could limit reproducibility in diverse real-world scenarios.

As detailed in Section 3.4, we choose hyperparameters by analyzing their probabilistic and point forecasting performance across all the tasks in the in-domain and zero-shot benchmarks, not just on a specific task/dataset. These benchmarks are already quite large (42 datasets in total), hence they ensure that the chosen hyperparameters indeed provide optimal performance for a variety of real-world domains, seasonalities and frequencies. These choices lead to superior generalization performance, as it can be seen from the results on the Zero-Shot benchmark of Figure 3 (panel B), where WaveToken even outperforms (or is competitive with) task-specific models that were trained separately on each single dataset.

While WaveToken performs well for H×2 horizons, its WQL performance slightly lags behind TimesFM for H×3 forecasts. This suggests potential limitations in extrapolating patterns over very long horizons, which may require architectural adjustments (e.g., hierarchical attention).

We note that TimesFM seems to outperform WaveToken for $H \times 3$ horizons only with respect to weighted quantile loss, which measures the accuracy of probabilistic forecasts. On the other two metrics, namely mean absolute scaled error and visual relative squared error (VRSE), WaveToken outperforms all other models.

审稿人评论

2025-04-08

Thank you for the detailed response. I appreciate the authors' efforts in addressing my concerns, particularly regarding the inference efficiency, the rationale behind wavelet tokenizer design choices, and the scope of zero-shot generalization. I believe the authors have addressed my questions, and I maintain my original score.

审稿意见

评分: 32025-03-14

This paper introduces WaveToken, a wavelet-based tokenization method for time series forecasting. It decomposes time series into wavelet coefficients, which are then used to autoregressively predict future values. The method involves standardizing, decomposing, thresholding, and quantizing the coefficients, and training with cross-entropy loss.WaveToken performs well across various datasets, especially in zero-shot settings, using a smaller vocabulary while achieving competitive forecasting performance. It effectively handles complex patterns like non-stationary signals and trends, offering an efficient approach to time series forecasting.

给作者的问题

See above questions in Methods And Evaluation Criteria and Experimental Designs Or Analyses.

论据与证据

Claims are well supported.

方法与评估标准

Overall, the methods and evaluation are convincing. I have a few questions as follows:

Since wavelet transformation is closely related to the input length, and the experiments were conducted under a fixed setting of input 512-output 64, I would like to know if the proposed method can handle variable-length inputs and outputs with a single model.
Regarding thresholding, are the filtered coefficients set to zero or completely removed? In other words, does the number of input coefficients remain $C$ ? Additionally, how are excessively large coefficients handled?
The paper tested various thresholding methods and provided comprehensive experimental results, with different model settings requiring different thresholding techniques. Is there a possibility to allow the model to dynamically select the thresholding method based on the data characteristics?
From both theoretical and experimental perspectives, why is quantization necessary? This is because we could also directly use the raw coefficients as input and perform autoregressive prediction using continuous loss functions such as MSE or MAE.
Do outputs share the same quantization bins with inputs?

理论论述

Not applicable, no new theoretical claims is proposed.

实验设计与分析

Ablation on quantization ia required, i.e. directly usinig the raw coefficients without as quantization input and perform continuous autoregressive with MSE loss.

补充材料

I have reviewed all the appendix.

与现有文献的关系

The objective of this paper is to enhance efficiency and generalization of time-series foundation model by using wavelet-based discrete tokenization.

遗漏的重要参考文献

Not applicable, references are generally comprehensive.

其他优缺点

The structure of the paper is very clear and easy to follow.

其他意见或建议

See my comments and suggestions above.

作者回复

2025-04-01

Thank you for these useful comments and questions that helped improve our paper. Below we provide point-by-point answers to each section in your review.

I would like to know if the proposed method can handle variable-length inputs and outputs with a single model.

The current implementation can readily handle variable-length inputs and outputs. More specifically:

At training time, any input of size up to 512 and output of size up to 64 can be processed directly. If the inputs or outputs are shorter, we simply pad them with NaNs to ensure consistency of the shapes. In practice, we observed that these maximum lengths cover the vast majority (if not all) of practical applications. Note that other popular foundation models for time series forecasting adopt the same strategy, e.g. Chronos by Ansari et al. (2024) and TimesFM by Das et al. (2024).
At test time, the model can provide forecasts of unlimited length due to its autoregressive nature.

Regarding thresholding, are the filtered coefficients set to zero or completely removed? In other words, does the number of input coefficients remain C? Additionally, how are excessively large coefficients handled?

The filtered coefficients are set to zero. Removing them would lead to variable-length coefficient vectors and a potential mis-representation of different groups of coefficients. In other words, since the model processes concatenated groups of wavelet coefficients, it is important to keep the length of each coefficient group fixed (i.e. no removals) so that the model can learn to forecast (and attend to) approximation coefficients differently from details coefficients.

Is there a possibility to allow the model to dynamically select the thresholding method based on the data characteristics?

All the thresholding methods we tested do indeed try to adapt to the data characteristics. CDF-thresholding, for example, directly looks at the empirical distribution of the detail coefficients (in absolute value) to determine the cutoffs. VisuShrink, on the other hand, applies a threshold based on an estimate of the variance of the input time series. In general, one could in principle choose entirely different thresholding methods for different datasets or domains. Although we have not implemented this feature yet, it represents an interesting area for future work. Thank you for raising this point!

From both theoretical and experimental perspectives, why is quantization necessary?

and

Ablation on quantization ia required, i.e. directly usinig the raw coefficients without as quantization input and perform continuous autoregressive with MSE loss.

The quantization step is not strictly necessary to apply the ideas presented in our paper. However, the primary focus of this work was to improve tokenization in the context of time series foundation models with discrete vocabularies. Therefore, we need the quantization step to construct a finite vocabulary of tokens that can be directly processed by the T5 architecture without substantial modifications. We focus on models with discrete vocabularies because they present an exciting alternative method to address forecasting problems compared to traditional regression-based approaches and have demonstrated superior performance in prior works.

In principle, one could also sidestep the quantization process by feeding the raw coefficients as embeddings to the transformer, or, as you suggested, by learning to forecast the raw coefficients directly via a continuous-input loss such as MSE/MAE. Both these approaches are perfectly valid ideas. However, they entail completely different models with different analyses and are better suited for independent future works rather than ablations of WaveToken. It would also be interesting to explore ideas based on Wavelets in the context of models that operate on patches (e.g., TimesFM and Moirai). We are hopeful that the community will build upon our work to investigate these ideas.

Do outputs share the same quantization bins with inputs?

Yes, the vocabulary is shared between inputs and outputs. Both are mapped to and from the same bins using the procedure described in Section 2.2 - Quantization.

审稿人评论

2025-04-04

Thank you for your response. I will keep my score at 3 and suggest an acception. However, I recommend further studying the rationale and necessity of quantization in the time series in future work rather than just simply for "alignment with T5".

审稿意见

评分: 42025-03-21

The manuscript describes a method to tokenize univariate time series data using wavelet transform. The proposed method consists of discrete wavelet transformation followed by a quantization, which generates tokens equal size to the length of input data. Then, the model is trained to forecast the wavelet coefficients in the forecasting horizon. It is shown that the proposed method achieves comparable accuracy with other state-of-the-art models.

update after rebuttal

I believe that the authors tried to address my concerns. I updated the score as I believe the manuscript is worth reporting in ICML.

给作者的问题

In the first paragraph in page 4, it is mentioned that "wavelet transform leads to compact representations". But as you mentioned in the following sentence, "a signal of size N results in N coefficients", which does not look like a compact representation. Can you be more clear about compactness in what sense?
I think it will be helpful to elaborate the quantization, probably adding more details in SM. In quantization section, I guess $w \in [[a_k]_J,[d_k]_j]$ should be $w \in [[a_k]_J, [d_k]\_{j=1}^J]$ . Right? Anyway, $w$ is a vector, e.g., $[a_k]_J$ . Then how do you quantize a vector?
Also, in the quantization, how did you decide the bin size that can be universally applied to diverse data set? The bin size essentially determines the approximation accuracy. When the data generating distribution changes, one set of bins that is optimal to one data set may perform terribly for another data set.

论据与证据

方法与评估标准

The proposed method is relatively straightforward. While it looks like there are some limitations, such as an extension to multi-variate time series and variable length forecast, I think the proposal is interesting and worth reporting to the ML community. The evaluation criteria also make sense.

理论论述

This paper is empirical.

实验设计与分析

It should be noted that the experimental results are obtained by training T5 with the features obtained by WaveToken. Even though, the model achieved good accuracy, it is unclear if it is due to the use of WaveToken or due to T5. A more straightforward approach would be training one of the models used in the benchmark with WaveToken and compare accuracy. For example, Chronos is anyway trained with the cross-entropy loss. Then, why not train Chronos with WaveToken?

The experiments do not include some of the best performing models, such as TabPFN-TS, TTM, and MOMENT.

补充材料

与现有文献的关系

Time series tokenization is of current interest, not only because it allows use of LLM for time series problems, but also because it has shown that tokenizing time series seems to help the models better learn the representation. I believe that the idea of wavelet-based tokenization is timely.

遗漏的重要参考文献

The idea of quantizing time series data was proposed long before Ansari et al (2024). It was proposed in Yeo and Melnyk (JCP 2019) with a method to impose a structural constraint in the distribution.

其他优缺点

Strength: The paper is well written and the experiments are well designed to demonstrate the strength of the proposed method.

其他意见或建议

I find that a few notations are used without clearly defined. For example, there is $[a_k]_J$ on page 3. What is the dimension of $k$ ? On page 4, in CDF-thresholding, What is $F$ and what is $b^{J-j+1}$ ? $z_{1:C}$ is defined, but what is $z_{C+1:C+H}$ ? While it may not be difficult to guess what they are, I would suggest to proofread the manuscript and clearly define every mathematical notations to improve clarity.

作者回复

2025-04-01

Thank you for your review and comments that have helped us improve our manuscript. Below we provide point-by-point answers to each section in your review.

why not train Chronos with WaveToken?

Chronos itself is based on the T5 architecture (Ansari et al. (2024)) and we use the same model hyper-parameters (Table 8, Appendix H in our paper). Thus, Chronos effectively serves as an ablation of WaveToken with respect to a different tokenization method (wavelet tokenizer vs simple discretization). Hence, the performance gain of WaveToken can be attributed entirely to the wavelet tokenizer.

The experiments do not include some of the best performing models, such as TabPFN-TS, TTM, and MOMENT.

Thank you for these suggestions. We ran preliminary experiments for the pre-trained TTM-R2 and TimeMOE models (also suggested by Reviewer Yi8n). TTM-R2 and TimeMOE are point forecasters, so the WQL metric is not meaningful for them.

Model	Bench. I (MASE)	Bench. II (MASE)
TTM-R2	1.029	1.114
TimeMOE	0.887	0.973
WaveToken (Large)	0.698	0.810

Both TTM-R2 and TimeMOE exhibit significantly inferior performance compared to WaveToken. These findings are consistent with the limited zero-shot capabilities of these models noted on benchmarks like GIFT-Eval.

MOMENT: We faced some challenges setting up experiments correctly with its linear probing setup. We will finalize the experiments for the final manuscript. We note that while MOMENT has performed well in contexts beyond forecasting (e.g., classification and anomaly detection), there isn’t enough evidence in existing benchmarks showing that MOMENT is a state-of-the-art zero-shot forecasting model.
TabPFN-TS: The paper and codebase was released publicly on Jan 5, 2025. This falls well within the four months time frame specified by the concurrent work guidelines for ICML 2025. Nevertheless, we thank the reviewer for this suggestion and will include the results for TabPFN-TS in the final version of our paper.

I find that a few notations are used without clearly defined.

We agree that some of the notation is not clear. Thank you for pointing this out. We will make sure to update the final version of the manuscript with the clarifications below:

In $[a_k]_J$ , the symbol $k=1,\dots,K$ indexes approximation coefficients from the DWT. For an input of size (say) 512, we get about $K=256$ approximation coefficients (depending on the wavelet basis).
In $F^{−1}_{|d_j|} (b^{J−j+1})$ , $F^{-1}$ is the (empirical) inverse-CDF of the absolute values of the detail coefficients $|d_j|, \forall j=1,\dots,J$ . The thresholding percentile $b^{J−j+1}$ grows exponentially in the decomposition level so that finer detail coefficients, which tend to capture more noise, are thresholded more aggressively. We set $b=2$
$z_{1:C}$ denotes the concatenated wavelet coefficients after decomposing the time series context.
$z_{C+1:C+H}$ denotes the coefficients obtained from the time series horizon at training time, which are used to compute the loss.

Can you be more clear about compactness in what sense?

The “compactness” of wavelets manifests itself in two ways:

Wavelets concentrate most of the signal energy (input variance) on a few coefficients of high magnitude. Thus, after thresholding the effective number of non-zero coefficients can be much lower than the size of the input
Empirically, WaveToken has excellent performance while using a much smaller vocabulary (1024 tokens) relative to Chronos (4096 tokens), which instead directly quantizes the time series.

I guess w∈[[ak]J,[dk]j] should be w∈[[ak]J,[dk]j=1J]. Right? Anyway, w is a vector, e.g., [ak]J Then how do you quantize a vector?

Thanks for catching the typo! Yes, $j=1,\dots,J$ is correct. Also, note $w$ is a single coefficient, without distinguishing between approximation and detail. In other words, we quantize all the coefficients element-wise by assigning each to bins based on their empirical distribution.

Also, in the quantization, how did you decide the bin size that can be universally applied to diverse data set?

We construct the vocabulary by scanning the training set and choosing the optimal bin size according to Freedman and Diaconis (1981), which selects the bin width minimizing the reconstruction error. Clearly, the universality of this binning scheme relies on having a representative training corpus, the reason why we trained WaveToken on 28 different real-world datasets from different domains.

Thanks for pointing out the reference to Yeo and Melnyk (JCP 2019) regarding prior quantization approaches. We’ll add this reference in the final version.

审稿人评论

2025-04-08

This is a minor point, but it would be better to avoid saying "compact". It simply means that the wavelet coefficients decay relatively fast so that one can achieve a good approximation by a truncation. This argument is also not general, as how fast the coefficient decays depends on the characteristics of the signal. To avoid confusion with "compactness", I would suggest to elaborate or use a different terminology.
The algorithm needs to be elaborated to make it clearer. In particular, the part to compute the training data, i.e., computing the tokens for the forecasting window, is not well explained.
One question I raised is that the model is trained to make a prediction, $z_{C+1:C+H} = [ \mathbf{d}_J, \cdots, \mathbf{d}_1 ]$ , and each $\mathbf{d}_i$ is a 512-dimensional vector. Then, it is unclear how the 512-dimensional vector is discretized and how they are used in the model. In other words, it is unclear if the model predict each dimension at once and auto-regressively predict all $512 \times H$ ? Or, if the model predict the 512-dimensional $\mathbf{d}_1$ at once and move to $\mathbf{d}_2$ .

Anyway, I believe that it will be beneficial to add detailed algorithm in the appendix.

作者评论

2025-04-09

Thanks again for your engagement and additional questions! Below are our answers:

Thank you for the suggestion, we agree with your comment. To avoid confusions, we will make sure to update the final version of the manuscript by clarifying and elaborating more about this property of the wavelet vocabulary.
We will add an algorithm box in the appendix with the step-by-step procedure.
Let us provide a practical example to clarify both discretization and forecasting. At training time:
- Suppose we have a time series input divided in a context window (of size 512) and a horizon window (of size 64).
- We first re-scale input and target as detailed in Section 2.2 - Scaling.
- For simplicity, suppose that we are applying the discrete wavelet transform (DWT) up to the $J=1$ level. That is, we are only decomposing the time series into one level of approximation and detail coefficients. The steps below apply for larger $J$ too.
- The DWT yields $256$ approximation coefficients $a^{ctx}_i$ and $256$ detail coefficients $d^{ctx}_i$ for the context, while for the horizon we get $32$ approximation coefficients $a^{hrz}_i$ and $32$ detail coefficients $d^{hrz}_i$ . This is due to the convolution and downsampling operations through which the DWT is implemented, as detailed in the second-to-last paragraph of Appendix A.2.
- We then threshold some of the detail coefficients to $0$ as described in Section 2.2 - Thresholding.
- After that, the inputs and targets for the model are the concatenated context and horizon coefficients, respectively. That is, the inputs are $z_{1:C} = [a_1^{ctx}, \dots, a_{256}^{ctx}, d_1^{ctx}, \dots, d_{256}^{ctx}]$ , and the targets are $z_{C+1:C+H} = [a_1^{hrz}, \dots, a_{32}^{hrz}, d_1^{hrz}, \dots, d_{32}^{hrz}]$ .
- To clarify, note that each $a^{ctx}_i$ , $d^{ctx}_i$ , $a^{hrz}_i$ and $d^{hrz}_i$ is a scalar (possibly a 0 if thresholded).
- Before feeding these coefficients to the model, we quantize/discretize them by mapping each coefficient (regardless of whether it is an approximation or detail) to a bin, obtained as detailed in Section 2.2 - Quantization. Each bin has an index: this index is the token that we feed to the model. In other words, we quantize each coefficient separately. There is no need to quantize the entire coefficient vector at once.
- The model is trained to forecast the 64 horizon tokens (the indexes of the corresponding bins) from the 512 context tokens. The cross-entropy loss is computed over the bin indices (see the equation on line 224-225).

The procedure follows analogously at test/inference time, the only difference being that the ground truth horizon tokens are not available. Given the context tokens/bins, the model forecasts the horizon tokens/bins, which we then map to the wavelet coefficients using the bin centers and to the time series forecast by inverting the DWT.

We hope that this can help in clarifying the step-by-step procedure. We realize we should have been clearer in our exposition, and we will do our best to incorporate these additional explanations (along with an algorithm box) to allow readers to quickly understand our method.

Thank you again for giving us a chance to improve the paper! We hope our additional responses above have satisfactorily addressed your additional concerns. If so, we request you to consider raising your score to reflect that. Thanks for your time and engagement in the review process.

最终决定Accept (poster)

2025-05-01

The paper proposes a method to tokenize univariate time series data using wavelet transform. It decomposes time series into wavelet coefficients, which are used autoregressively to predict future values. The idea was deemed interesting and straightforward but lacks a formal derivation. While the results are systematic and extensive, there are concerns regarding the key element that contributes to the success of the method, given the reliance on T5 architecture.

Despite these concerns, the authors replied to most of the issues raised by the reviewers. The remaining concerns raised by the reviewers are minimal, and while the paper could benefit from more descriptions, the existing contribution will be beneficial for the community.

Thus, I recommend its acceptance.