OLinear: A Linear Model for Time Series Forecasting in Orthogonally Transformed Domain
OLinear is a linear-based time series forecasting model that outperforms state-of-the-art Transformer-based methods with significantly higher efficiency.
摘要
评审与讨论
The paper introduces OLinear, a multivariate time-series forecaster that operates in an orthogonally transformed domain rather than the raw time domain. For each input window, the authors compute the temporal Pearson-correlation matrix, perform an eigen-decomposition, and project the signals onto the resulting orthogonal basis. They then replace self-attention with NormLin. Extensive experiments indicate improvements over strong baselines, and the accompanying code is available.
优缺点分析
Strengths
-
The authors evaluate OLinear on 24 publicly available datasets spanning electricity, traffic, weather and financial domains, and consider multiple forecast horizons for each dataset. Reporting 99 % confidence intervals indicates an effort to convey result variability rather than single-run numbers.
-
The paper is well structured: problem statement, method, ablations and main tables flow logically. Figures are labelled and referenced properly, and notation is introduced before use, which makes the technical content easy to follow.
-
Code and data-processing scripts are released, allowing reviewers and future researchers to reproduce the results or to integrate the proposed method into their own pipelines with minimal effort.
-
The plug-and-play lowers the entry barrier for practitioners who already have Transformer-based models and wish to test the proposed ideas without redesigning their entire architecture.
-
While Fourier and wavelet bases dominate current transform-domain forecasters, constructing an orthogonal basis from the data’s own temporal correlation matrix has been less explored. This adaptive approach could inspire further work.
Weaknesses
-
Results seem to degrade when the look-back window exceeds 96, casting doubt on long-horizon robustness. For example, with 15-minute data, 96 steps cover only 24 h, yet the model may need to predict up to 720 steps (≈ 1 week) ahead in long-term forecasting, raising concerns about interpretability.
-
Many of the hyper-parameters, such as model dimension, embedding size, number of blocks and learning rate, are dataset-dependent.
-
The assumption of decorrelation after OrthoTrans is not demonstrated. Two series can be decorrelated in time yet correlated in frequency; an orthogonal transform does not necessarily remove such cross-frequency dependencies. Moreover, decorrelating in the time domain removes second-order dependencies but says nothing about higher-order or non-linear dependencies.
-
The novelty remains unclear. The manuscript does not cite the previous existing work [1]. That work already establishes that forecasting Transformers suffer from excessive sharpness, and that this can be mitigated by (i) using a different optimiser and (ii) reducing the architecture to a single channel-wise attention head—whose complexity is exactly , the same as NormLin with (see the Table 1). These two changes were shown to raise the attention-matrix rank and therefore boost the model’s expressivity. The study re-uses those insights in a purely linear setting, which by construction loses the non-linear capacity of a Transformer, yet none of this prior art is acknowledged. Consequently, the claims regarding higher rank, multivariate correlations, and expressivity are not novel in the context of multivariate time-series forecasting.
-
NormLin trades away non-linearity while retaining the same cost.
References
[1] SAMformer : Unlocking the potential of Transformers in Time Series Forecasting
问题
-
Could you run a statistical significance test across ≥ 5 random seeds for every dataset–horizon pair and report the resulting p-values?
-
The existing work already demonstrates that a specific optimiser combined with a single channel-wise attention head (complexity ) raises attention-matrix rank and boosts accuracy while using a channel attention mechanism to capture multivariate correlations. Please clarify the novelty and compare the baseline with this existing work. Especially, how the model compares to a shallow transformer trained with a strong regularization like SAM and a channel-wise attention mechanism ?
-
The claim that OrthoTrans removes step-wise correlations needs quantitative confirmation in both domains. Could you provide a metric that shows (i) a drop in the off-diagonal magnitude of the time-domain covariance matrix and (ii) that no new dependencies appear in the frequency domain—for instance, by comparing the Frobenius norm of the off-diagonal entries in the covariance matrix and in the cross-spectral density before and after applying OrthoTrans?
-
NormLin removes non-linearity yet keeps quadratic cost. Is there a way to reduce this cost?
-
In the experiments, you use different learning rate, model dimension, embedding size and number of blocks for each dataset. Could you please freeze one single architecture, run it on all the datasets and compare this performance with the best architecture for each dataset?
局限性
While the authors include a brief limitations paragraph in Appendix, a fuller discussion into the main paper, covering details like compute cost and dataset-specific hyper-parameters, is needed.
最终评判理由
The authors have addressed my initial concerns. They have also committed to revise the narrative on weight-matrix rank in the camera-ready.
格式问题
No concern
Thank you for your comments. We sincerely appreciate the reviewer’s recognition of our "well-structured" presentation, the plug-and-play modules that "lower the entry barrier", and an adaptive design that could "inspire further work". Below are our responses:
Q1: On the p-value
- We conduct a significance test (using 7 random seeds) using Student’s t-test between OLinear and iTransformer across various datasets and prediction horizons. As shown below, the better model is in bold when the improvement is statistically significant at the level 0.05 (p-value < 0.05). In other words, OLinear exhibits a clear performance advantage over iTransformer.
| Setting | OLinear (MAE) | iTrans.(MAE) | p-value between OLinear and iTrans. |
|---|---|---|---|
| ECL: 96 | 0.221±4e-4 | 0.240±4e-4 | 5.12E-11 |
| ECL: 192 | 0.238±1e-3 | 0.253±2e-3 | 3.06E-06 |
| ECL: 336 | 0.254±1e-3 | 0.269±1e-3 | 9.96E-07 |
| ECL: 720 | 0.279±2e-3 | 0.317±7e-3 | 5.59E-05 |
| Traffic: 96 | 0.226±2e-4 | 0.268±1e-3 | 1.61E-12 |
| Traffic: 192 | 0.241±4e-4 | 0.276±1e-3 | 9.48E-10 |
| Traffic: 336 | 0.250±3e-4 | 0.283±4e-4 | 1.87E-13 |
| Traffic: 720 | 0.270±4e-4 | 0.302±4e-4 | 6.15E-13 |
| Weather: 96 | 0.190±1e-3 | 0.214±3e-4 | 1.29E-08 |
| Weather: 192 | 0.235±2e-3 | 0.254±1e-3 | 1.04E-06 |
| Weather: 336 | 0.280±2e-3 | 0.296±1e-3 | 1.93E-06 |
| Weather: 720 | 0.333±2e-3 | 0.349±4e-4 | 5.68E-06 |
-
We report the performance with increasing lookback lengths in Appendix I.3. OLinear demonstrates consistent improvements as the lookback horizon increases from 48 to 720, and consistently outperforms state-of-the-art forecasters.
-
Regarding robustness across random seeds, please refer to Appendix F, where OLinear exhibits smaller standard deviations and narrower 99% confidence intervals compared to TimeMixer++ and iTransformer.
-
We would like to emphasize that MSE and MAE metrics typically degrade in long-term forecasting as the prediction horizon increases. As shown in Table 15 (Page 28), this phenomenon is observed in state-of-the-art time series forecasters, such as TimeMixer++ (ICLR 2025), iTransformer (ICLR 2024), FilterNet (NeurIPS 2024). This phenomenon is reasonable because future series inherently contain greater uncertainty as the horizon extends, much like how predicting the temperature two weeks ahead is less accurate than predicting tomorrow’s temperature.
Q2: On novelty and comparison with SAMformer
-
SAMformer (ICML 2024) introduces the Sharpness-Aware Minimization (SAM) mechanism—originally proposed in [1] (ICLR 2021)—into time series forecasting. While channel-wise attention was first introduced by iTransformer, SAMformer adopts the same practice. Its key contribution is a new loss function that smooths the loss landscape of Transformers: where is a hyper-parameter. This can be implemented by the SAM optimizer [1].
-
Instead of modifying the loss function, we focus on the architecture design and propose NormLin. Compared with vanilla self-attention, NormLin offers four advantages: lower computational cost, higher-rank attention matrices, smoother gradient flow, and an enlarged optimization space (see Lines 182-198 in the paper). NormLin "can be plugged into existing forecasters to boost performance with minimal engineering effort" (by Reviewer 5fFn). Moreover, our linear-based model, OLinear, outperforms state-of-the-art Transformer-based forecasters—including SAMformer. We will cite and discuss SAMformer in our final draft.
-
We compare MSE and standard deviation between OLinear and SAMformer. The SAMformer numbers are taken from its original paper, which uses a look-back window of 512 and thus enjoys extra performance gains. Even under this handicap, OLinear achieves lower MSE and smaller standard deviations in most cases.
| Dataset | Hor. | OLinear (Look-back: 96) | SAMformer (Look-back: 512) |
|---|---|---|---|
| ETTm2 | 96 | 0.169±1e-4 | 0.181±0.005 |
| 192 | 0.232±1e-4 | 0.233±0.002 | |
| 336 | 0.291±2e-4 | 0.285±0.001 | |
| 720 | 0.389±4e-4 | 0.375±0.001 | |
| ECL | 96 | 0.131±4e-4 | 0.155±0.002 |
| 192 | 0.150±1e-3 | 0.168±0.001 | |
| 336 | 0.165±1e-3 | 0.183±0.000 | |
| 720 | 0.191±2e-3 | 0.219±0.000 | |
| Exchange | 96 | 0.082±3e-4 | 0.161±0.007 |
| 192 | 0.171±2e-3 | 0.246±0.009 | |
| 336 | 0.331±3e-3 | 0.368±0.006 | |
| 720 | 0.837±0.018 | 1.003±0.018 | |
| Weather | 96 | 0.153±1e-3 | 0.197±0.001 |
| 192 | 0.200±2e-3 | 0.235±0.000 | |
| 336 | 0.258±3e-3 | 0.276±0.001 | |
| 720 | 0.337±4e-3 | 0.334±0.000 |
- To ensure a fair comparison, we uniformly set the lookback length to 96 and report the following MAEs. The average results are reported across 3 random seeds for SAMformer, using the recommended learning rates and values.
- On average, OLinear outperforms SAMformer by a substantial margin of 12.1%.
- Notably, the proposed NormLin layer improves the performance of SAMformer by 10.6% and 5.8% on the Traffic and ECL datasets, respectively.
| Dataset | Hor. | OLinear | SAMformer | SAMformer+NormLin |
|---|---|---|---|---|
| ETTh1 | 96 | 0.382 | 0.396 | 0.392 |
| 192 | 0.414 | 0.425 | 0.419 | |
| 336 | 0.438 | 0.444 | 0.439 | |
| 720 | 0.462 | 0.469 | 0.459 | |
| ETTm2 | 96 | 0.249 | 0.268 | 0.262 |
| 192 | 0.290 | 0.305 | 0.303 | |
| 336 | 0.328 | 0.343 | 0.340 | |
| 720 | 0.387 | 0.399 | 0.397 | |
| ECL | 96 | 0.221 | 0.285 | 0.267 |
| 192 | 0.238 | 0.288 | 0.269 | |
| 336 | 0.254 | 0.301 | 0.284 | |
| 720 | 0.279 | 0.335 | 0.318 | |
| Traffic | 96 | 0.226 | 0.404 | 0.355 |
| 192 | 0.241 | 0.367 | 0.335 | |
| 336 | 0.250 | 0.379 | 0.339 | |
| 720 | 0.270 | 0.401 | 0.358 | |
| Weather | 96 | 0.190 | 0.231 | 0.208 |
| 192 | 0.235 | 0.269 | 0.251 | |
| 336 | 0.280 | 0.304 | 0.290 | |
| 720 | 0.333 | 0.350 | 0.339 | |
| Echange | 96 | 0.200 | 0.203 | 0.210 |
| 192 | 0.293 | 0.303 | 0.302 | |
| 336 | 0.414 | 0.426 | 0.416 | |
| 720 | 0.688 | 0.712 | 0.689 |
Q3: On the quantitation of temporally decorrelation
- We report the off-diagonal Frobenius norm of the correlation matrices (size: ) for the original series, and for the series after OrthoTrans or FFT, using a window size of 96. The results below show that OrthoTrans effectively removes temporal correlations and performs better than the FFT operation in decorrelation.
| Dataset | Original | OrthoTrans | FFT |
|---|---|---|---|
| ETTh1 | 61.36 | 1.46 | 2.58 |
| ETTm2 | 83.04 | 1.26 | 5.22 |
| ECL | 45.95 | 1.78 | 2.26 |
| Traffic | 34.80 | 1.79 | 3.29 |
| Weather | 61.70 | 2.86 | 19.61 |
- We also argue that examining correlations in the frequency domain of the feature series after OrthoTrans is of limited relevance: 1) OLinear does not involve any FFT operations. 2) Whether there exists an orthogonal transformation that simultaneously mitigates both temporal- and frequency-domain correlations is a nontrivial question. Further exploration could be a future direction of this work. We will emphasize that the decorrelation is performed temporally in the final draft.
Q4: On quadratic cost of NormLin
-
In our view, is the minimal cost required to model -channel correlations, as even a single linear layer incurs this complexity.
-
A straightforward way to reduce this cost is to introduce a bottleneck block: the channels are first projected down to a fixed width (e.g., 64), the NormLin module is applied in the fixed-dimensional space, and then the representation is projected back to channels. This reduces the computational complexity to . As shown below, this variant tends to degrade performance, particularly on datasets with a large number of channels, such as ECL and Traffic.
| Dataset | Hor. | Original (MAE) | Bottleneck (MAE, fixed width=64) |
|---|---|---|---|
| ETTh1 | 96 | 0.382 | 0.382 |
| 192 | 0.414 | 0.415 | |
| 336 | 0.438 | 0.443 | |
| 720 | 0.462 | 0.468 | |
| ECL | 96 | 0.221 | 0.241 |
| 192 | 0.238 | 0.256 | |
| 336 | 0.254 | 0.270 | |
| 720 | 0.279 | 0.291 | |
| Traffic | 96 | 0.226 | 0.263 |
| 192 | 0.241 | 0.278 | |
| 336 | 0.250 | 0.283 | |
| 720 | 0.270 | 0.298 | |
| Weather | 96 | 0.190 | 0.190 |
| 192 | 0.235 | 0.239 | |
| 336 | 0.280 | 0.281 | |
| 720 | 0.333 | 0.337 | |
| PEMS-03 | 12 | 0.159 | 0.160 |
| 24 | 0.179 | 0.182 | |
| 48 | 0.210 | 0.224 | |
| 96 | 0.247 | 0.264 |
Q5: On hyperparameters
-
Because there are substantial differences among datasets, it is common practice (e.g., iTransformer and SAMformer) to fine-tune certain hyperparameters to maximize model performance.
-
In the paper, the hyperparameter value ranges are provided in Appendix D. Note that the embedding size is fixed at 16 for all experiments. We also evaluate the hyperparameter sensitivity of OLinear in Appendix I.6. As shown in Figure 17 and Table 30, OLinear exhibits robust performance across these hyperparameters.
-
Furthermore, we freeze a single architecture (learning rate , model dimension , number of blocks , embedding size ) and report the average results across four prediction lengths. Compared to the results reported in the paper, the performance degradation is within 1%.
| Dataset | In paper | Fixed Para. | ||
|---|---|---|---|---|
| MSE | MAE | MSE | MAE | |
| ETTh1 | 0.424 | 0.424 | 0.426 | 0.425 |
| ETTh2 | 0.367 | 0.388 | 0.369 | 0.390 |
| ETTm1 | 0.374 | 0.377 | 0.377 | 0.381 |
| ETTm2 | 0.270 | 0.313 | 0.271 | 0.313 |
| ECL | 0.159 | 0.248 | 0.159 | 0.248 |
| Traffic | 0.451 | 0.247 | 0.451 | 0.247 |
| Weather | 0.237 | 0.260 | 0.240 | 0.262 |
We hope these responses can fully address your concerns. Thank you again for your comments!
Reference
[1] Sharpness-aware minimization for efficiently improving generalization. ICLR 2021.
Thank you for the well-structured rebuttal. Your additional experiments and clarifications resolve most of my earlier concerns. Below are my remarks.
Q1: Significance testing
The inclusion of Student’s t-tests across seven random seeds strengthens the statistical analysis and provides a solid estimate of the robustness of your model.
Q2: Novelty and comparison with existing baselines
The comparison with SAmformer and the reported average 12.1% MAE gain positions OLinear as the stronger baseline. However, what would further elevate this section is a theoretical discussion that contrasts the subspaces produced by OrthoTrans with those implied by a model like SAMformer’s rank-regularised attention. It is still unclear to me whether OrthoTrans provides representational capacity beyond what a simple attention + linear layer equipped with rank regularization, or even LoRA-style methods, can already achieve.
Q3: Quantifying decorrelation
The updated table of off-diagonal Frobenius norms is very helpful. It shows that OrthoTrans efficiently slashes time-domain correlations on every dataset. Given your statement that OrthoTrans is intended only for temporal decorrelation and that extending it to the frequency domain is a separate, non-trivial research question, the current analysis feels sufficient for this paper’s scope. However, a brief sentence in the main text clarifying this design choice would head off potential misunderstandings.
Q4: Quadratic cost
The bottleneck experiment shows the trade-off between complexity and accuracy: reducing the model size lowers the computational cost but also leads to some loss in accuracy on specific datasets. Showing how the performance (MAE or MSE) and latency change as the bottleneck size varies would help readers decide if the full quadratic version of the method is worth it for their own use cases. I suggest adding such an experiment to the paper to make this trade-off more explicit.
Q5: Hyper-parameter robustness
Freezing the learning rate, model dimension, number of blocks, and embedding size while losing 1 % accuracy attests to the method’s stability.
General Comment
The rebuttal effectively strengthens the paper. Further clarifying the main theoretical point raised in Q2 would help improve the overall evaluation.
Point 3: Quadratic cost
- Thanks for your suggestion! We compare the MAE performance and resource footprint across various bottleneck sizes. As shown below, a bottleneck size of 16 appears to be a sweet spot for the performance–resource trade-off. This choice has a performance gap of about 4% compared to the full quadratic version, but with less resource utilization. We will definitely include a new section in the final manuscript to make this trade-off more explicit.
| Dataset | Hor. | Ori. | Bottleneck (8) | Bottleneck (16) | Bottleneck (64) | Bottleneck (128) |
|---|---|---|---|---|---|---|
| ETTh1 | 96 | 0.382 | 0.382 | 0.381 | 0.382 | 0.383 |
| 192 | 0.414 | 0.415 | 0.412 | 0.415 | 0.416 | |
| 336 | 0.438 | 0.438 | 0.440 | 0.443 | 0.442 | |
| 720 | 0.462 | 0.467 | 0.467 | 0.468 | 0.461 | |
| ECL | 96 | 0.221 | 0.236 | 0.236 | 0.241 | 0.240 |
| 192 | 0.238 | 0.255 | 0.250 | 0.256 | 0.254 | |
| 336 | 0.254 | 0.264 | 0.271 | 0.270 | 0.267 | |
| 720 | 0.279 | 0.302 | 0.291 | 0.291 | 0.295 | |
| Traffic | 96 | 0.226 | 0.255 | 0.255 | 0.263 | 0.280 |
| 192 | 0.241 | 0.273 | 0.289 | 0.278 | 0.277 | |
| 336 | 0.250 | 0.270 | 0.275 | 0.283 | 0.281 | |
| 720 | 0.270 | 0.299 | 0.296 | 0.298 | 0.301 | |
| Weather | 96 | 0.190 | 0.192 | 0.188 | 0.190 | 0.190 |
| 192 | 0.235 | 0.240 | 0.237 | 0.239 | 0.241 | |
| 336 | 0.280 | 0.284 | 0.287 | 0.281 | 0.282 | |
| 720 | 0.333 | 0.338 | 0.336 | 0.337 | 0.338 | |
| PEMS03 | 12 | 0.159 | 0.162 | 0.161 | 0.160 | 0.162 |
| 24 | 0.179 | 0.186 | 0.183 | 0.182 | 0.181 | |
| 48 | 0.210 | 0.223 | 0.224 | 0.224 | 0.224 | |
| 96 | 0.247 | 0.275 | 0.266 | 0.264 | 0.271 |
| Dataset | Metric | Ori. | Bottleneck (8) | Bottleneck (16) | Bottleneck (64) | Bottleneck (128) |
|---|---|---|---|---|---|---|
| Traffic | Params(M) | 6.17 | 4.71 | 4.74 | 4.92 | 5.16 |
| FLOPs(G) | 4.91 | 4.16 | 4.18 | 4.27 | 4.39 | |
| T.T. (s/iter) | 0.02 | 0.015 | 0.015 | 0.016 | 0.017 | |
| T.M.(GB) | 1.01 | 0.98 | 0.99 | 1.00 | 1.01 | |
| I.T.(ms/iter) | 5.71 | 5.60 | 5.63 | 5.71 | 5.80 | |
| I.M. (GB) | 0.43 | 0.39 | 0.39 | 0.40 | 0.41 | |
| ECL | Params(M) | 4.79 | 4.59 | 4.60 | 4.67 | 4.78 |
| FLOPs(G) | 1.65 | 1.55 | 1.56 | 1.59 | 1.65 | |
| T.T. (ms/iter) | 7.75 | 7.45 | 7.58 | 7.61 | 7.80 | |
| T.M.(GB) | 0.45 | 0.44 | 0.45 | 0.45 | 0.46 | |
| I.T.(ms/iter) | 2.11 | 2.04 | 2.05 | 2.08 | 2.09 | |
| I.M. (GB) | 0.17 | 0.16 | 0.16 | 0.16 | 0.17 | |
| PEMS03 | Params(M) | 4.84 | 4.60 | 4.61 | 4.69 | 4.80 |
| FLOPs(G) | 1.85 | 1.73 | 1.74 | 1.77 | 1.83 | |
| T.T. (s/iter) | 8.14 | 8.05 | 8.06 | 8.12 | 8.16 | |
| T.M.(GB) | 0.48 | 0.48 | 0.48 | 0.49 | 0.50 | |
| I.T.(ms/iter) | 2.23 | 2.18 | 2.18 | 2.21 | 2.24 | |
| I.M. (GB) | 0.20 | 0.18 | 0.18 | 0.18 | 0.19 | |
| Weather | Params(M) | 4.52 | 4.52 | 4.52 | 4.54 | 4.57 |
| FLOPs(G) | 0.10 | 0.10 | 0.10 | 0.11 | 0.12 | |
| T.T. (ms/iter) | 7.47 | 7.57 | 7.59 | 7.62 | 7.70 | |
| T.M.(GB) | 0.21 | 0.22 | 0.22 | 0.23 | 0.23 | |
| I.T.(ms/iter) | 1.18 | 1.27 | 1.28 | 1.31 | 1.35 | |
| I.M. (MB) | 39.26 | 38.86 | 39.10 | 40.67 | 44.86 | |
| ETTh1 | Params(M) | 4.52 | 4.52 | 4.52 | 4.53 | 4.56 |
| FLOPs(M) | 33.74 | 33.87 | 34.18 | 38.81 | 52.31 | |
| T.T. (ms/iter) | 8.09 | 8.43 | 8.57 | 8.89 | 9.01 | |
| T.M.(GB) | 0.20 | 0.20 | 0.20 | 0.21 | 0.22 | |
| I.T.(ms/iter) | 1.19 | 1.28 | 1.31 | 1.35 | 1.38 | |
| I.M. (MB) | 156.41 | 156.44 | 156.69 | 159.12 | 160.39 |
Thanks again for your constructive comments. We are very happy to answer any further questions.
Thank you for the detailed rebuttal: the expanded discussion around Degrees-of-Freedom is very useful, and the quadratic-cost analysis clearly illustrates the efficiency trade-offs.
LoRA ablation & DoF: Clarifying the narrative
The provided NormLin_LoRA results (d = 8/16) are particularly insightful, showing stable performance despite a substantial nominal rank reduction. This important finding suggests that the observed performance gains do not primarily arise from increased rank or Degrees-of-Freedom alone.
Given this evidence, it would further strengthen the paper to explicitly clarify in the main text what specific aspect(s) of NormLin are most responsible for its improved performance. For example, do the gains primarily stem from the normalization mechanism itself, enhanced gradient smoothness during optimization, or possibly another underlying factor?
Providing this clarification in the main narrative would ensure that readers do not mistakenly attribute the method’s success solely to higher rank or increased parameter count.
Quadratic cost
The bottleneck experiment fully addresses my earlier concerns: a small figure illustrating the trend could enhance readability, but the provided numeric table already clearly conveys the intended message.
Overall
My primary remaining concern relates to the current narrative around rank and Degrees-of-Freedom: the LoRA experiments convincingly show that rank reduction alone does not fully explain the observed improvements.
Thanks again to the authors for addressing my points carefully and conducting these valuable additional experiments.
Point 1 (continued): Theoretical discussion of NormLin and self-attention
- To further demonstrate the DoF effects on forecasting performance, we introduce the NormLin_LoRA with the thought of low-rank approximation: where . According to the first table, has fewer DoF than the original NormLin. We report the following MAEs under the architecture of SAMformer. As expected, NormLin outperforms the LoRA-style variants.
| Dataset | Hor. | SAMformer+NormLin | SAMformer+NormLin_LoRA (d:8) | SAMformer+NormLin_LoRA (d:16) |
|---|---|---|---|---|
| ETTh1 | 96 | 0.393 | 0.395 | 0.394 |
| 192 | 0.422 | 0.425 | 0.424 | |
| 336 | 0.444 | 0.446 | 0.446 | |
| 720 | 0.487 | 0.489 | 0.489 | |
| ETTm2 | 96 | 0.262 | 0.264 | 0.264 |
| 192 | 0.303 | 0.305 | 0.304 | |
| 336 | 0.340 | 0.342 | 0.342 | |
| 720 | 0.397 | 0.399 | 0.399 | |
| ECL | 96 | 0.267 | 0.270 | 0.269 |
| 192 | 0.269 | 0.272 | 0.269 | |
| 336 | 0.284 | 0.288 | 0.286 | |
| 720 | 0.318 | 0.322 | 0.320 | |
| Traffic | 96 | 0.355 | 0.357 | 0.356 |
| 192 | 0.335 | 0.337 | 0.336 | |
| 336 | 0.339 | 0.341 | 0.340 | |
| 720 | 0.358 | 0.360 | 0.359 | |
| Weather | 96 | 0.208 | 0.219 | 0.210 |
| 192 | 0.251 | 0.260 | 0.253 | |
| 336 | 0.290 | 0.295 | 0.292 | |
| 720 | 0.339 | 0.343 | 0.345 | |
| Echange | 96 | 0.221 | 0.224 | 0.226 |
| 192 | 0.302 | 0.304 | 0.306 | |
| 336 | 0.416 | 0.419 | 0.416 | |
| 720 | 0.689 | 0.693 | 0.692 |
- We also report the SVs, ranks, and entropy of the weight matrices for the LoRA variants under the SAMformer architecture. As shown below, the original NormLin exhibits higher SVs, ranks, and entropy than its LoRA counterparts.
| Dataset | Min Attn. SV | Median Attn. SV | Attn. Rank | Attention Entropy | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NormLin | LoRA (d8) | LoRA (d16) | NormLin | LoRA (d8) | LoRA (d16) | NormLin | LoRA (d8) | LoRA (d16) | NormLin | LoRA (d8) | LoRA (d16) | |
| ETTh1 | 0.009 | 0.010 | 0.013 | 0.311 | 0.301 | 0.280 | 7 | 7 | 7 | 1.546 | 1.293 | 1.248 |
| ETTm2 | 0.001 | 0.001 | 0.001 | 0.473 | 0.414 | 0.431 | 7 | 7 | 7 | 1.277 | 1.118 | 1.075 |
| ECL | 0.000 | 0.000 | 0.000 | 0.018 | 0.000 | 0.000 | 311 | 34 | 98 | 5.701 | 5.641 | 5.514 |
| Traffic | 0.000 | 0.000 | 0.000 | 0.013 | 0.000 | 0.000 | 828 | 48 | 189 | 6.657 | 6.605 | 6.494 |
| Weather | 0.005 | 0.000 | 0.002 | 0.106 | 0.033 | 0.124 | 21 | 17 | 21 | 2.828 | 2.570 | 2.291 |
| Exchange | 0.019 | 0.001 | 0.005 | 0.108 | 0.123 | 0.078 | 8 | 7 | 8 | 1.876 | 1.231 | 1.347 |
- We also compare the NormLin layer with its LoRA variant under the OLinear architecture. As shown below, the reduced DoF of the LoRA variant results in degraded performance.
| Dataset | Hor. | OLinear | OLinear+LoRA(d64) | ||
|---|---|---|---|---|---|
| MSE | MAE | MSE | MAE | ||
| ETTh1 | 96 | 0.360 | 0.382 | 0.363 | 0.384 |
| 192 | 0.416 | 0.414 | 0.415 | 0.415 | |
| 336 | 0.457 | 0.438 | 0.461 | 0.441 | |
| 720 | 0.463 | 0.462 | 0.469 | 0.466 | |
| ETTm2 | 96 | 0.169 | 0.249 | 0.171 | 0.250 |
| 192 | 0.232 | 0.290 | 0.233 | 0.292 | |
| 336 | 0.291 | 0.328 | 0.292 | 0.329 | |
| 720 | 0.389 | 0.387 | 0.390 | 0.389 | |
| ECL | 96 | 0.131 | 0.221 | 0.135 | 0.225 |
| 192 | 0.150 | 0.238 | 0.153 | 0.242 | |
| 336 | 0.165 | 0.254 | 0.170 | 0.259 | |
| 720 | 0.191 | 0.279 | 0.199 | 0.286 | |
| Traffic | 96 | 0.398 | 0.226 | 0.410 | 0.229 |
| 192 | 0.439 | 0.241 | 0.431 | 0.243 | |
| 336 | 0.464 | 0.250 | 0.462 | 0.254 | |
| 720 | 0.502 | 0.270 | 0.509 | 0.283 | |
| Weather | 96 | 0.153 | 0.190 | 0.156 | 0.195 |
| 192 | 0.200 | 0.235 | 0.203 | 0.238 | |
| 336 | 0.258 | 0.280 | 0.263 | 0.283 | |
| 720 | 0.337 | 0.333 | 0.342 | 0.336 | |
| Exchange | 96 | 0.082 | 0.200 | 0.083 | 0.200 |
| 192 | 0.171 | 0.293 | 0.174 | 0.296 | |
| 336 | 0.331 | 0.414 | 0.330 | 0.414 | |
| 720 | 0.837 | 0.688 | 0.867 | 0.703 |
- We will incorporate this discussion in our final manuscript to further enrich our work.
Point 2: Further clarification on temporal correlation
- Thanks for your suggestion! We will definitely clarify our design of OrthoTrans by emphasizing that the decorrelation is performed temporally in the Introduction and Section 4.2.
We could not be happier to receive such heartwarming remarks. We deeply appreciate that you found our initial responses resolved most of your concerns. Below, we provide our further responses:
Point 1: Theoretical discussion of NormLin and self-attention
-
Thanks for your suggestion! We fully agree that further theoretical analysis will elevate our work. We would like to conduct the analysis from the perspective of degrees of freedom (DoF) of the attention matrices. The number of DoF in a matrix/vector is the number of elements needed to completely specify it. The DoF satisfy the following properties [1]:
- Strictly monotonic functions (e.g., Softplus, exp) that operate element-wise do not impose cross-element constraints, and thus do not change the DoF.
- For a vector , its DoF is . When imposing the L1 normalization, i.e., , the DoF is reduced by 1. Similarly, for a matrix , row-wise L1 normalization reduces its DoF from to .
-
Based on these, we have the following DoF equations for NormLin and self-attention:
| Metric | ||||
|---|---|---|---|---|
| DoF | ||||
| Best Rank | ||||
| Worst Rank | 0 | 1 | 0 | 1 |
-
In the above table, we omit scalar parameters for clarity, as they do not affect DoF or ranks. is a piecewise function with a maximum value of , defined as if and otherwise. The subtraction of in can be understood from the fact that, for any invertible matrix , and satisfy . In SAMformer, the default hidden dimension d is set to 16, which is smaller than the channel count of most common datasets.
-
We can see that the NormLin layer benefits from a larger DoF of the weight matrix (third column) compared to that of the attention matrix in self-attention (last column), since . This indicates a larger search space of attention/weight matrices and potentially better representational capacity [2].
-
Empirically, we report the minimum singular values (SVs), median SVs, ranks, and entropies of the attention and weight matrices for SAMformer and SAMformer+NormLin (denoted as NormLin in the following table). The lookback and prediction lengths are set to 96. Since the attention matrix in SAMformer is input-dependent, we report the mean, minimum, and maximum values (denoted as S_mean, S_min, S_max, respectively) across the test sets for SAMformer. The matrix rank is computed as the number of SVs greater than . The SAM optimizer is used with the recommended hyperparameters in the SAMformer paper.
| Dataset | Min Attn. SV | Median Attn. SV | Attn. Rank | Attention Entropy | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NormLin | S_mean | S_min | S_max | NormLin | S_mean | S_min | S_max | NormLin | S_mean | S_min | S_max | NormLin | S_mean | S_min | S_max | |
| ETTh1 | 0.009 | 0.002 | 0.000 | 0.042 | 0.311 | 0.318 | 0.032 | 0.776 | 7 | 6.620 | 5 | 7 | 1.546 | 1.473 | 0.940 | 1.843 |
| ETTm2 | 0.001 | 0.013 | 0.000 | 0.326 | 0.473 | 0.418 | 0.011 | 0.997 | 7 | 6.780 | 4 | 7 | 1.277 | 0.823 | 0.148 | 1.375 |
| ECL | 0.000 | 0.000 | 0.000 | 0.000 | 0.018 | 0.000 | 0.000 | 0.000 | 311 | 83.825 | 68 | 100 | 5.701 | 4.887 | 4.336 | 5.487 |
| Traffic | 0.000 | 0.000 | 0.000 | 0.000 | 0.013 | 0.000 | 0.000 | 0.000 | 828 | 140.979 | 93 | 197 | 6.657 | 5.536 | 3.384 | 6.405 |
| Weather | 0.005 | 0.000 | 0.000 | 0.000 | 0.106 | 0.015 | 0.002 | 0.081 | 21 | 14.626 | 12 | 17 | 2.828 | 2.596 | 1.218 | 2.987 |
| Exchange | 0.019 | 0.003 | 0.000 | 0.020 | 0.108 | 0.053 | 0.005 | 0.197 | 8 | 7.676 | 6 | 8 | 1.876 | 1.822 | 1.080 | 2.060 |
- We can observe that, compared with the vanilla SAMformer, the plug-in NormLin layer generally produces larger SVs, higher ranks, and, at the same time, better prevents attention entropy collapse. These results indicate better representational capacity than simple attention with rank regularization. In addition, this rank improvement phenomenon is consistent with the findings in Appendix E.2 of our paper.
References:
[1] Strang, Linear Algebra and Its Applications, International Thomson Publishing, 3rd edition.
[2] Attention is not all you need: pure attention loses rank doubly exponentially with depth. ICML 2021.
Thank you for your timely response and raising your score : ) We are glad that all your initial concerns have been effectively addressed. Your comments are important to enhance the quality of our manuscript and we will definitely incorporating these revisions in our camera-ready version - thank you so much!
(b) The weight matrix is trained to approximate the correlation matrix among channels .
- In Figure 5 of the paper, we observe that the learned NormLin weights exhibit strong similarity to . Motivated by this observation, we pre-compute the correlation matrix among channels with the trainset and directly use as a fixed weight matrix, resulting in a variant named :
We refer to OLinear with this variant as OLinear-C.
- We then compare the average performance of OLinear-C with OLinear across multiple prediction lengths. ETT denotes the average results across the ETTh1, ETTh2, ETTm1 and ETTm2 datasets. As shown below, the average performance gap between OLinear and OLinear-C on these datasets is negligible (only 0.2%)!
| Dataset | ECL | Traffic | ETT | Solar | PEMS | CarSales | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Metric | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE |
| OLinear | 0.159 | 0.248 | 0.451 | 0.247 | 0.359 | 0.376 | 0.215 | 0.217 | 0.094 | 0.187 | 0.330 | 0.305 |
| OLinear-C | 0.161 | 0.249 | 0.451 | 0.247 | 0.359 | 0.376 | 0.215 | 0.217 | 0.094 | 0.187 | 0.330 | 0.305 |
-
If the ideal weight matrix is the matrix , this partly explains the performance advantage of NormLin over self-attention.
- Because attention matrices in self-attention are input-dependent, it is harder for them to converge to the input-independent multivariate correlation matrix.
- In contrast, NormLin uses an input-independent weight matrix, , making this approximation easier.
-
The matrix is usually high-rank. How does NormLin_LoRA, whose weight matrix is also perform well, given that has a low rank of ? The reason is that rescales and rotates each row differently and thus could increase the matrix rank [1], enabling a better approximation of .
-
We will accordingly tone down the analysis of weight-matrix rank (Lines 187-193 ) and DoF (Lines 196-198) in the final paragraph of Section 4.3, and relocate the OLinear-C exposition from Section 5.2 to Section 4.3, to better clarify the mechanism underlying the NormLin layer.
Quadratic cost
Thanks for your suggestion! We will definitely add a figure showing performance versus bottleneck size in the relevant section to improve readability.
We are eager to hear your feedback. We sincerely hope that our new comments can address your primary remaining concerns. We’d deeply appreciate it if you could let us know whether your concerns have been addressed -- thank you so much!
Thanks for your time,
Authors of #10716
References:
[1] DenseHMM: Learning Hidden Markov Models by Learning Dense Representations. NeurIPS 2020.
[2] Feature selection, L1 vs. L2 regularization, and rotational invariance. ICML 2004.
Thank you for the comprehensive response. You have effectively addressed my initial concerns and outlined planned enhancements for the camera-ready version. In light of these updates, I will raise my overall rating. No further questions.
Dear Reviewer jUXb,
We highly appreciate your timely and constructive responses, and we are glad that you found our expanded discussion around Degrees-of-Freedom very useful, and the quadratic-cost analysis clearly illustrates the efficiency trade-offs.
Yes, we agree with the reviewer that, from our LoRA experiments, the observed performance gains of NormLin do not primarily arise from increased rank or Degrees-of-Freedom alone. To further understand the method’s success reasons, we added more experiments and in-depth analysis, showing that: (a) The L1 normalization contributes more to NormLin's performance gains than the Softplus function, and (b) The weight matrix is trained to approximate the correlation matrix among channels .
(a) The L1 normalization contributes more to NormLin's performance gains than the Softplus function.
What component affects NormLin's performance most?
-
NormLin consists of two components:
- The transformation function (Softplus);
- Normalization strategy (L1 norm).
-
We add ablation studies on 5 transformation functions (Softplus, Identity, Softmax, Sigmoid, ReLU) and 3 normalization strategies (L1, L2, None), where Identity denotes no transformation and None denotes no normalization. The average results across four prediction lengths are presented below.
| Trans. Fun. | Norm | ECL | Traffic | Solar | PEMS03 | Weather | ILI (S2) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | ||
| Softplus | L1 | 0.159 | 0.248 | 0.451 | 0.247 | 0.215 | 0.217 | 0.095 | 0.199 | 0.237 | 0.260 | 1.764 | 0.802 |
| Identity | L1 | 0.167 | 0.255 | 0.499 | 0.252 | 0.285 | 0.224 | 0.096 | 0.199 | 0.243 | 0.264 | 1.844 | 0.818 |
| Softmax | L1 | 0.163 | 0.251 | 0.452 | 0.247 | 0.215 | 0.217 | 0.096 | 0.199 | 0.236 | 0.259 | 1.802 | 0.811 |
| Sigmoid | L1 | 0.161 | 0.250 | 0.450 | 0.247 | 0.215 | 0.217 | 0.096 | 0.199 | 0.239 | 0.260 | 1.767 | 0.802 |
| ReLU | L1 | 0.163 | 0.251 | 0.454 | 0.247 | 0.215 | 0.217 | 0.097 | 0.199 | 0.239 | 0.262 | 1.759 | 0.802 |
| Softplus | L2 | 0.206 | 0.298 | 0.647 | 0.328 | 0.293 | 0.273 | 0.163 | 0.254 | 0.238 | 0.260 | 1.874 | 0.827 |
| Identity | L2 | 0.171 | 0.259 | 0.518 | 0.259 | 0.224 | 0.222 | 0.102 | 0.205 | 0.241 | 0.262 | 1.838 | 0.817 |
| Softmax | L2 | 0.201 | 0.294 | 0.648 | 0.328 | 0.287 | 0.267 | 0.164 | 0.254 | 0.237 | 0.260 | 1.843 | 0.820 |
| Sigmoid | L2 | 0.214 | 0.302 | 0.647 | 0.328 | 0.293 | 0.272 | 0.164 | 0.255 | 0.238 | 0.261 | 1.854 | 0.823 |
| ReLU | L2 | 0.201 | 0.294 | 0.648 | 0.328 | 0.285 | 0.266 | 0.163 | 0.254 | 0.237 | 0.260 | 1.840 | 0.821 |
| Softplus | None | 0.211 | 0.301 | 0.652 | 0.331 | 0.295 | 0.274 | 0.163 | 0.254 | 0.259 | 0.274 | 1.925 | 0.846 |
| Identity | None | 0.183 | 0.280 | 0.609 | 0.333 | 0.232 | 0.231 | 0.130 | 0.231 | 0.242 | 0.264 | 1.888 | 0.830 |
| Sigmoid | None | 0.212 | 0.301 | 0.647 | 0.328 | 0.296 | 0.275 | 0.170 | 0.258 | 0.262 | 0.278 | 1.890 | 0.827 |
| ReLU | None | 0.199 | 0.293 | 0.648 | 0.328 | 0.293 | 0.273 | 0.168 | 0.256 | 0.238 | 0.260 | 1.806 | 0.817 |
-
From above, we found that the L1 normalization contributes more to NormLin's performance gains than the Softplus function. Specifically,
- (i) the L1 norm outperforms L2 norm and no normalization by 10% and 12%, respectively;
- (ii) under L1 normalization, Softplus outperforms Identity, Softmax, Sigmoid, and ReLU by 5%, 1%, 0.2%, and 0.3%, respectively.
-
The reasons behind the performance advantage of L1 norm could be intricate. It has become a natural choice since it induces a probabilistic distribution over each row of the weight matrix.
- Without L1 normalization, each row effectively includes a learnable scaling parameter , whether using L2 normalization or no normalization at all.. However, this scaling is rendered meaningless by the subsequent LayerNorm operation, which eliminates any learned magnitude differences.
- In contrast, L1 normalization emphasizes the relative distribution of weights within each row rather than their absolute scale, which may lead to improved performance.
The paper introduces OLinear, a linear-based model for multivariate time series forecasting with two key components:
-
OrthoTrans: A data-adaptive transformation that diagonalizes the temporal Pearson correlation matrix, decorrelating features and simplifying forecasting.
-
NormLin: A lightweight linear layer with row-wise normalization that outperforms self-attention in accuracy and efficiency.
Experiments on 24 datasets demonstrate state-of-the-art performance, particularly for long-term forecasting, while maintaining low computational overhead.
优缺点分析
Strengths
- Theorem 1 formally connects the temporal-correlation structure of the data with the orthogonal transformation used by OrthoTrans, giving the model a clear theoretical underpinning and enhancing interpretability.
- NormLin replaces multi-head self-attention with a row-normalized linear layer, cutting both FLOPs and memory footprint by roughly 50 % while preserving—or improving—forecasting accuracy.
- OLinear achieves state-of-the-art results on 140 forecasting tasks drawn from 24 public datasets, demonstrating robustness across domains and long-horizon scenarios.
- Both OrthoTrans and NormLin are modular, training-free components that can be plugged into existing forecasters to boost performance with minimal engineering effort.
- The experimental section is notably comprehensive, featuring extensive ablations, scalability tests, and comparisons against a wide range of baselines.
Weaknesses
- While training-time speed and memory are reported, the paper does not quantify inference-time latency or serving cost—metrics that are crucial in streaming or high-frequency settings; deployment-cost analysis is therefore incomplete.
- Important modern linear or correlation-based baselines (e.g., RLinear, TSMixer, StemGNN, DEPTS) are omitted, leaving open whether OLinear’s gains hold against the strongest current alternatives.
- The claim that NormLin is a “universal token dependency learner” appears overstated: its superiority is demonstrated only on time-series data, with no evidence provided for other modalities or tasks.
问题
- Why does NormLin show better gradient dynamics than attention? Are there any theoretic supports.
- What is the relationship OrthoTrans with PCA?
- Theorem 1 assumes Gaussian. How sensitive is OLinear to deviations from this assumption in real-world data?
- How does performance degrade with noisy/missing data or non-stationary correlations?
- What is the inference latency for ultra-long sequence (1M)? And some related references for long-term forecasting such as Pyraformer (ICLR 2022) should be discussed.
局限性
Yes
最终评判理由
Thanks the authors for providing detailed response especially those extra data support. I keep my original rating, and think this paper is acceptable.
格式问题
No
Thank you for your positive comments on our work. We sincerely appreciate the reviewer’s recognition of our work as theoretically well-founded, experimentally comprehensive, and achieving state-of-the-art performance. Here are our responses to your concerns:
W1: On the deployment cost
- We report the training and inference resource footprint of OLinear and baseline forecasters in Table 37 (Page 48), including inference time and GPU memory consumption. OLinear achieves higher inference efficiency than state-of-the-art Transformer-based forecasters (e.g., TimeMixer++, iTransformer, and CARD), benefiting from the simplified NormLin module. Moreover, Table 38 shows the efficiency differences before and after applying the NormLin module to Transformer-based forecasters.
W2: On more baselines
- We compare OLinear with RLinear and TSMixer using the MSE metric (see below). The results of RLinear and TSMixer are taken from iTransformer and SAMformer, respectively. On average, OLinear outperforms these two forecasters by 17.3% and 25.6%, respectively.
| Dataset | Hor. | OLinear | RLinear | TSMixer |
|---|---|---|---|---|
| ETTm2 | 96 | 0.169 | 0.182 | 0.211 |
| 192 | 0.232 | 0.246 | 0.252 | |
| 336 | 0.291 | 0.307 | 0.303 | |
| 720 | 0.389 | 0.407 | 0.390 | |
| ECL | 96 | 0.131 | 0.201 | 0.173 |
| 192 | 0.150 | 0.201 | 0.204 | |
| 336 | 0.165 | 0.215 | 0.217 | |
| 720 | 0.191 | 0.257 | 0.242 | |
| Exchange | 96 | 0.082 | 0.093 | 0.343 |
| 192 | 0.171 | 0.184 | 0.342 | |
| 336 | 0.331 | 0.351 | 0.484 | |
| 720 | 0.837 | 0.886 | 1.204 | |
| Traffic | 96 | 0.398 | 0.649 | 0.409 |
| 192 | 0.439 | 0.601 | 0.637 | |
| 336 | 0.464 | 0.609 | 0.747 | |
| 720 | 0.502 | 0.647 | 0.688 | |
| Weather | 96 | 0.153 | 0.192 | 0.214 |
| 192 | 0.200 | 0.240 | 0.231 | |
| 336 | 0.258 | 0.292 | 0.279 | |
| 720 | 0.337 | 0.364 | 0.343 |
- In Section 5.1, OLinear is compared with 11 carefully selected and widely acknowledged state-of-the-art forecasting models. We believe these results demonstrate the performance superiority of OLinear. For completeness, we will also include discussions on RLinear, TSMixer, StemGNN, and DEPTS in the final draft.
W3: On the claim of “universal token dependency learner”
- On Line 289, we explicitly restrict the claim “universal token dependency learner” to the field of “time series forecasting.” However, on Lines 278 and 312, this field restriction was unintentionally omitted. We will revise these two sentences in the final draft to ensure accuracy. Thank you for pointing this out!
Q1: On the gradient dynamics
-
In Appendix B, we analyze the Jacobian matrix of the non-linear transformations in the self-attention mechanism and the NormLin layer. For self-attention, we have where , and is the diagonal matrix with as its diagonal.
-
For the NormLin layer, it holds that
Dear Reviewer 5fFn,
We would like to thank you once again for your efforts and constructive comments. We would be happy to address any further questions you may have.
All the best,
Authors
The paper introduces OLinear, a computationally efficient alternative to transformer-based models for multivariate time series forecasting. The architecture combines two key components that can also be used as plugins: OrthoTrans, which applies orthogonal transformations to decorrelate input features, and NormLin, a linear layer that aims to reproduce channel-wise attention. OLinear consists of a Cross-Series Learner (CSL) using NormLin layers for multivariate correlations and an Intra-Series Learner (ISL) with MLPs for individual series processing. Results show OLinear achieves competitive or superior performance while reducing computations.
优缺点分析
Strengths:
- The two-component design (OrthoTrans, NormLin for cross-variable interactions) creates a clean separation of concerns where temporal decorrelation is handled by preprocessing and multivariate correlations by learned linear layers.
- These two modules can work as plugins for existing transformer architectures, with validated improvements on iTransformer (+5.1%) and PatchTST (+10.1%), which is really interesting
- The empirical benchmark is exhaustive, covering 24 datasets and 140 configurations, including zero-shot and few-shot scenarios and seems to show strong results. The ablation study is extensive.
- Rank analysis demonstrates good expressiveness of the model.
Weaknesses:
- The proposed way to evaluate the correlation matrix is confusing and lacks proper formalization. I guess you want to estimate the autocorrelation of a block of T timesteps (context length) but what does this mean formally? The mathematical framework should be more rigorously defined as the concept is non-trivial.
- In addition, when reading the description of the approach, two intuitive baselines came to my mind: (a) instead of eigenvector decomposition of the correlation matrix, we can directly perform data rotation by PCA (i.e., working with ), (b) the correlation matrix can alternatively be estimated through a sliding window approach by computing cross-correlation between the timestamps across different windows. It will be valuable to perform additional experiments to compare with these approaches in order to have more insights on the chosen methodology.
- Averaging correlations across N variables dilutes variable-specific temporal structures without theoretical justification. Why did not you apply decorrelations for each channel separately?
- Decorrelation effectiveness is not shown, no metrics show Q^T CorrMat Q ≈ Λ or residual correlation analysis on a validation set, for example.
- When looking at the source code, it appears that there are some implementation errors. The authors consider the Pearson correlation, but in
dataset/Generate_corrmat.ipynb, we havecov_matrix = cov_matrix / diag_vecwhich seems to divide only by σ_i, not σ_i × σ_j, producing an invalid asymmetric correlation matrix. - Some other architecture choices are not well-explained nor well-motivated (e.g., the CSL block, the dimension extension). Please see my questions below.
- Reducing the number of FLOPs is beneficial, but the complexity (memory for example) remains quadratic with respect to D, which can still result in substantial computational overhead for high-dimensional multivariate series.
问题
-
On the asymmetric normalization: The cov_matrix / diag_vec normalization produces an asymmetric matrix that isn't a true correlation matrix. Is this an implementation error or deliberate choice? If intentional, what's the theoretical justification? Why is this implementation in a notebook? Maybe I didn't find the right file that saves the matrices.
-
On the decorrelation architecture: Have you explored architectures where each variable has its own matrix to preserve variable-specific temporal structures? What information is lost by averaging correlations across all variables? Did you try the ablation studies with PCA on ? What is the intuition? Did you try an approach with learnable ?
-
What is the intuition behind the dimension extension? To what extent does it differ from the linear encoder? What does it bring? Why decorrelating after the dimension extension?
-
In the CSL: What is the purpose of the first linear layer, what is the intuition behind it? From my understanding, you want to reproduce the architecture of self-attention blocks, but then I don't see why we need the first layer. CSL:
-
In Table 6, could you elaborate on what "Temporal NormLin" baseline exactly is? Do you keep the intra-series learner? Could you be more explicit about this architecture?
-
Could you also have a comparison with other lightweight multivariate methods such as SAMFormer [1] and TSMixer [2]?
References:
- [1] (Ilbert et al., 2024) SAMFormer: Unlocking the potential of transformers in time series forecasting with sharpness-aware minimization and channel-wise attention.
- [2] (Chen et al., 2023) TSMixer: An all-mlp architecture for time series forecasting.
局限性
- As said earlier, the memory complexity remains quadratic with respect to D.
最终评判理由
I am raising my score as the authors have clarified most of my concerns including:
- Clarification on the definition of the temporal correlation matrix and the way it is calculated,
- Empirical justification that the approach effectively decorrelates the channels,
- Extension of the experiments by incorporating more baselines,
- Other concerns related to ablation study.
格式问题
No apparent formatting concerns.
Thank you for your invaluable review. We sincerely appreciate the reviewer’s recognition that our plug-and-play OrthoTrans and NormLin modules are "really interesting" and the experiments are "exhaustive" and "extensive". Below are our responses to the concerns:
Q1: On the computation of and 'asymmetric' normalization
- We first clarify our process of computing the temporal correlation matrix . For clarity, we use the univariate series. Let denote the training series. We then generate lagged sub-series (whose length is ): $$ \mathbf{s}_{i} = \mathbf{x}[i : M - T + i],\quad i = 0, 1, \dots, T-1.
We would like to thank you again for your constructive comments!
-
To fully address your concerns about the quadratic cost of the NormLin layer, we introduce a bottleneck block to reduce computational complexity:
- the channels are first projected to a fixed width (denoted as bottleneck size),
- the NormLin module is applied in this lower-dimensional space,
- and the representation is finally projected back to channels.
-
In this way, the complexity of the NormLin layer is reduced from to . We report the MAEs and resource footprints under various bottleneck sizes below.
| Dataset | Hor. | Ori. | Bottleneck (8) | Bottleneck (16) | Bottleneck (64) | Bottleneck (128) |
|---|---|---|---|---|---|---|
| ETTh1 | 96 | 0.382 | 0.382 | 0.381 | 0.382 | 0.383 |
| 192 | 0.414 | 0.415 | 0.412 | 0.415 | 0.416 | |
| 336 | 0.438 | 0.438 | 0.440 | 0.443 | 0.442 | |
| 720 | 0.462 | 0.467 | 0.467 | 0.468 | 0.461 | |
| ECL | 96 | 0.221 | 0.236 | 0.236 | 0.241 | 0.240 |
| 192 | 0.238 | 0.255 | 0.250 | 0.256 | 0.254 | |
| 336 | 0.254 | 0.264 | 0.271 | 0.270 | 0.267 | |
| 720 | 0.279 | 0.302 | 0.291 | 0.291 | 0.295 | |
| Traffic | 96 | 0.226 | 0.255 | 0.255 | 0.263 | 0.280 |
| 192 | 0.241 | 0.273 | 0.289 | 0.278 | 0.277 | |
| 336 | 0.250 | 0.270 | 0.275 | 0.283 | 0.281 | |
| 720 | 0.270 | 0.299 | 0.296 | 0.298 | 0.301 | |
| Weather | 96 | 0.190 | 0.192 | 0.188 | 0.190 | 0.190 |
| 192 | 0.235 | 0.240 | 0.237 | 0.239 | 0.241 | |
| 336 | 0.280 | 0.284 | 0.287 | 0.281 | 0.282 | |
| 720 | 0.333 | 0.338 | 0.336 | 0.337 | 0.338 | |
| PEMS03 | 12 | 0.159 | 0.162 | 0.161 | 0.160 | 0.162 |
| 24 | 0.179 | 0.186 | 0.183 | 0.182 | 0.181 | |
| 48 | 0.210 | 0.223 | 0.224 | 0.224 | 0.224 | |
| 96 | 0.247 | 0.275 | 0.266 | 0.264 | 0.271 |
| Dataset | Metric | Ori. | Bottleneck (8) | Bottleneck (16) | Bottleneck (64) | Bottleneck (128) |
|---|---|---|---|---|---|---|
| Traffic | Params(M) | 6.17 | 4.71 | 4.74 | 4.92 | 5.16 |
| FLOPs(G) | 4.91 | 4.16 | 4.18 | 4.27 | 4.39 | |
| T.T. (s/iter) | 0.02 | 0.015 | 0.015 | 0.016 | 0.017 | |
| T.M.(GB) | 1.01 | 0.98 | 0.99 | 1.00 | 1.01 | |
| I.T.(ms/iter) | 5.71 | 5.60 | 5.63 | 5.71 | 5.80 | |
| I.M. (GB) | 0.43 | 0.39 | 0.39 | 0.40 | 0.41 | |
| ECL | Params(M) | 4.79 | 4.59 | 4.60 | 4.67 | 4.78 |
| FLOPs(G) | 1.65 | 1.55 | 1.56 | 1.59 | 1.65 | |
| T.T. (ms/iter) | 7.75 | 7.45 | 7.58 | 7.61 | 7.80 | |
| T.M.(GB) | 0.45 | 0.44 | 0.45 | 0.45 | 0.46 | |
| I.T.(ms/iter) | 2.11 | 2.04 | 2.05 | 2.08 | 2.09 | |
| I.M. (GB) | 0.17 | 0.16 | 0.16 | 0.16 | 0.17 | |
| PEMS03 | Params(M) | 4.84 | 4.60 | 4.61 | 4.69 | 4.80 |
| FLOPs(G) | 1.85 | 1.73 | 1.74 | 1.77 | 1.83 | |
| T.T. (s/iter) | 8.14 | 8.05 | 8.06 | 8.12 | 8.16 | |
| T.M.(GB) | 0.48 | 0.48 | 0.48 | 0.49 | 0.50 | |
| I.T.(ms/iter) | 2.23 | 2.18 | 2.18 | 2.21 | 2.24 | |
| I.M. (GB) | 0.20 | 0.18 | 0.18 | 0.18 | 0.19 |
-
From above, we find that:
- (1) the bottlenecked NormLin variants incur approximately a 4% performance degradation,
- but (2) consume fewer resources, especially on datasets with a large number of channels.
-
This provides readers with additional flexibility to decide whether the full quadratic version of NormLin is worthwhile for their specific use cases.
We sincerely hope that our new comments can address your concerns. We’d deeply appreciate it if you could let us know whether your concerns have been addressed.
Thanks for your time,
Authors
Dear Reviewer CEMa,
As the discussion deadline approaches, we are wondering whether our responses have properly addressed your concerns? Your feedback would be extremely helpful to us. If you have further comments or questions, we hope for the opportunity to respond to them.
Many thanks,
10716 Authors
Thank you for the comprehensive rebuttal that has clarified some of my concerns.
Q1: While the justification for cov_matrix / diag_vec is now clear, my concern about the theoretical foundation remains. You've addressed the orthogonality of (which is indeed guaranteed by construction), but my primary question was about the effectiveness of the decorrelation itself: can you show that is actually diagonal?
More fundamentally, I question whether your sliding window approach actually estimates the autocorrelation of your input context. Your method computes correlations between overlapping subsequences and , each with their own empirical means (in practice they could be almost equal indeed). However, true autocorrelation estimation requires:
$
\gamma(k) = \mathbb{E}[(X_t - \mu)(X_{t+k} - \mu)]
$
where is a common reference point (typically the global mean). Your approach instead computes:
$
\text{Corr}(s_i, s_j)
$
where has mean and has mean .
This is fundamentally different from autocorrelation estimation. For your input context of length , the standard approach would estimate autocorrelation using the entire available time series with a consistent global mean, not through correlations between windowed subsequences with varying local means.
Could you provide theoretical justification for why your windowing-based correlation matrix captures the same temporal dependencies as classical autocorrelation estimation? And more importantly, can you empirically show that this decorrelation actually works on unseen test data?
Q2: Your justification for channel averaging (CA) over channel individual (CI) is reasonable given the computational and performance trade-offs. However, regarding the PCA comparison, that wasn't my point. Standard PCA would decompose the series into non-overlapping windows of length and center them. You would have a matrix of size , and you would decompose to obtain the matrix and proceed with the next steps as you do. This is different from your approach that applies T overlapping sliding windows of length and computes the Pearson correlation between them.
Q3: The dimension extension explanation is experimentally adequate.
Q5: The clarification on how NormLin is applied in the temporal dimension is now clear.
Q6: The comparison with recent lightweight baselines (SamFormer, SimpleTM, TQNet, TimePro, TimeBase) is valuable and shows competitive performance. The bottleneck analysis effectively addresses the computational complexity concerns while providing a practical trade-off between performance and efficiency.
We would like to thank you again for your constructive comments! To fully address your concerns, we compare OLinear with additional lightweight baselines, including several recently accepted works: SimpleTM (ICLR 2025), TQNet (ICML 2025), TimePro (ICML 2025), and TimeBase (ICML 2025). We report the MAEs below, highlighting the best results in bold. The lookback length is uniformly set to 96. Remarkably, OLinear outperforms SimpleTM, TQNet, TimePro, and TimeBase by 5%, 5%, 7%, and 16% on average, respectively, establishing OLinear as a strong new baseline for lightweight time series forecasting.
| Model | Hor. | OLinear | SimpleTM (ICLR 2025) | TQNet (ICML 2025) | TimePro (ICML 2025) | TimeBase (ICML 2025) |
|---|---|---|---|---|---|---|
| ETTm1 | 96 | 0.334 | 0.361 | 0.353 | 0.364 | 0.388 |
| 192 | 0.363 | 0.380 | 0.378 | 0.383 | 0.409 | |
| 336 | 0.385 | 0.404 | 0.401 | 0.409 | 0.421 | |
| 720 | 0.426 | 0.438 | 0.440 | 0.446 | 0.461 | |
| Avg | 0.377 | 0.396 | 0.393 | 0.400 | 0.420 | |
| ETTm2 | 96 | 0.249 | 0.257 | 0.256 | 0.260 | 0.271 |
| 192 | 0.290 | 0.299 | 0.298 | 0.303 | 0.309 | |
| 336 | 0.328 | 0.338 | 0.340 | 0.342 | 0.346 | |
| 720 | 0.387 | 0.395 | 0.396 | 0.399 | 0.401 | |
| Avg | 0.313 | 0.322 | 0.323 | 0.326 | 0.332 | |
| ETTh1 | 96 | 0.382 | 0.392 | 0.393 | 0.398 | 0.392 |
| 192 | 0.414 | 0.421 | 0.426 | 0.429 | 0.423 | |
| 336 | 0.438 | 0.438 | 0.446 | 0.450 | 0.443 | |
| 720 | 0.462 | 0.462 | 0.470 | 0.474 | 0.458 | |
| Avg | 0.424 | 0.428 | 0.434 | 0.438 | 0.429 | |
| ETTh2 | 96 | 0.329 | 0.338 | 0.343 | 0.345 | 0.376 |
| 192 | 0.379 | 0.387 | 0.393 | 0.394 | 0.405 | |
| 336 | 0.415 | 0.401 | 0.427 | 0.431 | 0.440 | |
| 720 | 0.431 | 0.436 | 0.446 | 0.445 | 0.477 | |
| Avg | 0.388 | 0.391 | 0.402 | 0.403 | 0.424 | |
| ECL | 96 | 0.221 | 0.235 | 0.229 | 0.234 | 0.279 |
| 192 | 0.238 | 0.247 | 0.247 | 0.249 | 0.281 | |
| 336 | 0.254 | 0.267 | 0.264 | 0.267 | 0.295 | |
| 720 | 0.279 | 0.293 | 0.294 | 0.299 | 0.327 | |
| Avg | 0.248 | 0.260 | 0.259 | 0.262 | 0.295 | |
| Traffic | 96 | 0.226 | 0.274 | 0.261 | 0.269 | 0.384 |
| 192 | 0.241 | 0.280 | 0.271 | 0.276 | 0.362 | |
| 336 | 0.250 | 0.290 | 0.277 | 0.287 | 0.365 | |
| 720 | 0.270 | 0.309 | 0.295 | 0.312 | 0.386 | |
| Avg | 0.247 | 0.289 | 0.276 | 0.286 | 0.374 | |
| Weather | 96 | 0.190 | 0.207 | 0.200 | 0.207 | 0.215 |
| 192 | 0.235 | 0.248 | 0.245 | 0.254 | 0.256 | |
| 336 | 0.280 | 0.290 | 0.287 | 0.296 | 0.297 | |
| 720 | 0.333 | 0.341 | 0.342 | 0.346 | 0.348 | |
| Avg | 0.260 | 0.271 | 0.269 | 0.276 | 0.279 | |
| Solar | 96 | 0.191 | 0.232 | 0.233 | 0.237 | 0.363 |
| 192 | 0.213 | 0.247 | 0.257 | 0.263 | 0.404 | |
| 336 | 0.229 | 0.257 | 0.263 | 0.281 | 0.398 | |
| 720 | 0.236 | 0.252 | 0.270 | 0.285 | 0.388 | |
| Avg | 0.217 | 0.247 | 0.256 | 0.266 | 0.388 |
Thanks again for your constructive comments. We are very happy to answer any further questions.
Dear Reviewer CEMa,
We highly appreciate your constructive comments, and we are glad that you found our initial responses have addressed your concerns on recent lightweight baselines, quadratic computational complexity and implementation details. We address your questions point-to-point in the following.
Q1 (Part 1): Can you show that is actually diagonal?
- Thank you for raising this point. Theoretically, suppose the temporal correlation matrix admits the eigendecomposition , where is an orthogonal matrix and is a diagonal matrix. If the vector is normalized (zero mean and unit variance), then the following holds:
which is a diagonal matrix.
- Empirically, we quantify the effectiveness of OrthoTrans by reporting the Frobenius norm of the off-diagonal elements in the correlation matrices for the original series, and for the series transformed by OrthoTrans, DFT, or Haar wavelets, using a window size of 96. For DFT, correlations are computed using only the real part of the transformed signal, while for Haar wavelets, only the high-frequency components are used. The results below show that OrthoTrans effectively mitigates temporal correlations and outperforms both DFT and Haar wavelets in decorrelation on both the training and test sets.
| Dataset | Training set | Test set | ||||||
|---|---|---|---|---|---|---|---|---|
| Original | OrthoTrans | DFT | Wavelet | Original | OrthoTrans | DFT | Wavelet | |
| ETTh1 | 61.36 | 1.46 | 2.58 | 4.93 | 47.86 | 1.89 | 2.81 | 5.35 |
| ETTm2 | 83.04 | 1.26 | 5.22 | 2.13 | 65.74 | 2.05 | 6.46 | 3.22 |
| ECL | 45.95 | 1.78 | 2.26 | 11.20 | 45.69 | 2.03 | 2.57 | 11.26 |
| Traffic | 34.80 | 1.79 | 3.29 | 6.92 | 34.77 | 2.16 | 3.46 | 7.01 |
| Weather | 61.70 | 2.86 | 19.61 | 4.33 | 65.46 | 3.97 | 22.10 | 4.31 |
| Avg | 57.37 | 1.83 | 6.59 | 5.90 | 51.90 | 2.42 | 7.48 | 6.23 |
Q1 (Part 2): On global mean (std) and local mean (std)
- We would like to show that the local mean and standard deviation (std) of the sub-series are very close to the global mean and std, since each sub-series contains only fewer elements than the global series (of length ). We report the following metrics: , and , where and denote the global mean and std, respectively.
| Dataset | ||
|---|---|---|
| ETTm1 | 1.4E-03 | 8.0E-04 |
| Solar | 1.5E-03 | 9.0E-04 |
| ECL | 1.4E-03 | 2.5E-03 |
| Traffic | 3.5E-03 | 2.8E-03 |
| Weather | 1.6E-03 | 1.3E-03 |
Q1 (Part 3): On classic autocorrelation estimation (ACF)
-
We respectfully argue that for a univariate series, computing the ACF inherently involves comparing two subseries of the original signal, as it measures the similarity between the signal and its lagged versions.
-
Assume that the series has been normalized (i.e., zero mean and unit variance). The ACF can be regarded as the covariance between two sub-series:
where , and . We assume that the means of and are 0. Both and have length .
-
In OrthoTrans, where and .
-
By comparing , with , , we can observe that and are the truncated versions of and (with fewer entries) , respectively. In other words, our sliding window approach is a (slightly) truncated version of the classical ACF.
Q1 (Part 4): Performance comparison of OrthoTrans and ACF
-
We compute , and obtain the vector . Based on this vector, we construct the symmetric Toeplitz correlation matrix and compute the corresponding Q matrix. We denote this method as ACF.
-
We directly compare the differences between the correlation matrices and matrices obtained from OrthoTrans and ACF using the metric , where denotes the Frobenius norm. As shown below, the differences are negligible.
| Dataset | CorrMat_Diff_Fnorm | Q_Diff_Fnorm |
|---|---|---|
| ETTh1 | 1.5E-04 | 5.4E-04 |
| ETTm2 | 1.0E-06 | 4.6E-04 |
| ECL | 8.0E-06 | 6.2E-04 |
| Traffic | 1.1E-05 | 4.4E-04 |
| Weather | 4.0E-06 | 2.0E-03 |
- We report the off-diagonal Frobenius norm of the correlation matrices (size: ) for the original series, and for the series transformed by OrthoTrans and ACF. The following results show that OrthoTrans and ACF achieve comparable and strong decorrelation performance on both training and test sets.
| Dataset | Training set | Test set | ||||
|---|---|---|---|---|---|---|
| Original | Ours | ACF | Original | Ours | ACF | |
| ETTh1 | 61.36 | 1.46 | 1.45 | 47.86 | 1.89 | 1.87 |
| ETTm2 | 83.04 | 1.26 | 1.27 | 65.74 | 2.05 | 2.06 |
| ECL | 45.95 | 1.78 | 1.80 | 45.69 | 2.03 | 2.05 |
| Traffic | 34.80 | 1.79 | 1.82 | 34.77 | 2.16 | 2.18 |
| Weather | 61.70 | 2.86 | 2.86 | 65.46 | 3.97 | 3.96 |
| Avg | 57.37 | 1.83 | 1.84 | 51.90 | 2.42 | 2.42 |
- We further compare the forecasting performance of OrthoTrans and ACF. As shown below, OrthoTrans performs comparably to ACF.
| Dataset | Hor. | OrthoTrans | ACF | ||
|---|---|---|---|---|---|
| MSE | MAE | MSE | MAE | ||
| ETTh1 | 96 | 0.360 | 0.382 | 0.361 | 0.383 |
| 192 | 0.416 | 0.414 | 0.415 | 0.414 | |
| 336 | 0.457 | 0.438 | 0.457 | 0.438 | |
| 720 | 0.463 | 0.462 | 0.464 | 0.463 | |
| Avg | 0.424 | 0.424 | 0.424 | 0.424 | |
| ETTm2 | 96 | 0.169 | 0.249 | 0.169 | 0.249 |
| 192 | 0.232 | 0.290 | 0.232 | 0.290 | |
| 336 | 0.291 | 0.328 | 0.291 | 0.329 | |
| 720 | 0.389 | 0.387 | 0.389 | 0.387 | |
| Avg | 0.270 | 0.313 | 0.270 | 0.314 | |
| ECL | 96 | 0.131 | 0.221 | 0.131 | 0.221 |
| 192 | 0.150 | 0.238 | 0.152 | 0.240 | |
| 336 | 0.165 | 0.254 | 0.167 | 0.255 | |
| 720 | 0.191 | 0.279 | 0.190 | 0.280 | |
| Avg | 0.159 | 0.248 | 0.160 | 0.249 | |
| Traffic | 96 | 0.398 | 0.226 | 0.404 | 0.226 |
| 192 | 0.439 | 0.241 | 0.434 | 0.241 | |
| 336 | 0.464 | 0.250 | 0.458 | 0.250 | |
| 720 | 0.502 | 0.270 | 0.508 | 0.270 | |
| Avg | 0.451 | 0.247 | 0.451 | 0.247 | |
| Weather | 96 | 0.153 | 0.190 | 0.151 | 0.190 |
| 192 | 0.200 | 0.235 | 0.203 | 0.239 | |
| 336 | 0.258 | 0.280 | 0.259 | 0.281 | |
| 720 | 0.337 | 0.333 | 0.338 | 0.333 | |
| Avg | 0.237 | 0.260 | 0.238 | 0.260 |
Q2 (Part 1): On OrthoTrans and PCA
-
Thank you for your clear suggestion! We refer to the suggested variant as PCA. It first segments the series into non-overlapping windows of length , forming the matrix . Then the correlations are computed among the columns of .
-
In essence, OrthoTrans and the PCA variant differ in their unfolding strategies: although both use a patch size of , OrthoTrans employs sliding windows with stride 1, whereas PCA uses a non-overlapping segmentation with stride .
-
We further introduce a variant, OrthoTrans-C, for comparison:
- In OrthoTrans-C, we do not use the CA strategy. Given the multivariate training series , we apply PyTorch’s
unfoldfunction along the time dimension (second dimension) with patch size and stride 1, resulting in the matrix . Then we flatten the first two dimensions of to obtain a matrix of size , and compute the correlation matrix across its columns.
- In OrthoTrans-C, we do not use the CA strategy. Given the multivariate training series , we apply PyTorch’s
Q2 (Part 2): Decorrelation performance of OrthoTrans, PCA and OrthoTrans-C
- We compare the decorrelation effectiveness of OrthoTrans, PCA and OrthoTrans-C using the off-diagonal Frobenius norm with a window size of 96. In general, OrthoTrans and OrthoTrans-C achieve better decorrelation performance than PCA.
| Dataset | Training set | Test set | ||||||
|---|---|---|---|---|---|---|---|---|
| Original | OrthoTrans | PCA | OrthoTrans-C | Original | OrthoTrans | PCA | OrthoTrans-C | |
| ETTh1 | 61.36 | 1.46 | 10.77 | 1.43 | 47.86 | 1.89 | 11.66 | 1.75 |
| ETTm2 | 83.04 | 1.26 | 7.98 | 1.01 | 65.74 | 2.05 | 9.37 | 1.55 |
| ECL | 45.95 | 1.78 | 11.87 | 2.24 | 45.69 | 2.03 | 12.00 | 2.71 |
| Traffic | 34.80 | 1.79 | 8.94 | 1.81 | 34.77 | 2.16 | 9.11 | 2.17 |
| Weather | 61.70 | 2.86 | 8.78 | 2.94 | 65.46 | 3.97 | 10.47 | 3.78 |
| Avg | 57.37 | 1.83 | 9.67 | 1.89 | 51.90 | 2.42 | 10.52 | 2.39 |
- To further study the relationship between decorrelation performance and unfolding stride in PCA, we report the following results on the testsets of different datasets, which clearly show that smaller strides lead to better decorrelation performance.
| Dataset | Stride: 1 | Stride: 16 | Stride: 32 | Stride: 64 | Stride: 96 |
|---|---|---|---|---|---|
| ETTh1 | 1.89 | 5.92 | 6.60 | 7.36 | 11.66 |
| ETTm2 | 2.05 | 4.40 | 5.95 | 6.71 | 9.37 |
| ECL | 2.03 | 6.41 | 6.58 | 6.76 | 12.00 |
| Traffic | 2.16 | 4.45 | 4.53 | 4.68 | 9.11 |
| Weather | 3.97 | 7.91 | 8.70 | 9.83 | 10.47 |
| Avg | 2.42 | 5.82 | 6.47 | 7.07 | 10.52 |
Q2 (Part 3): Forecasting performance of OrthoTrans, PCA and OrthoTrans-C
- We apply these decorrelation methods to OLinear and observe that they achieve comparable performance. The results also verify the robustness of OLinear to different matrices.
| Dataset | Hor. | OrthoTrans | PCA | OrthoTrans-C | |||
|---|---|---|---|---|---|---|---|
| MSE | MAE | MSE | MAE | MSE | MAE | ||
| ETTh1 | 96 | 0.360 | 0.382 | 0.361 | 0.382 | 0.361 | 0.383 |
| 192 | 0.416 | 0.414 | 0.415 | 0.414 | 0.415 | 0.414 | |
| 336 | 0.457 | 0.438 | 0.457 | 0.439 | 0.458 | 0.439 | |
| 720 | 0.463 | 0.462 | 0.463 | 0.462 | 0.464 | 0.463 | |
| Avg | 0.424 | 0.424 | 0.424 | 0.424 | 0.424 | 0.424 | |
| ETTm2 | 96 | 0.169 | 0.249 | 0.169 | 0.249 | 0.169 | 0.249 |
| 192 | 0.232 | 0.290 | 0.233 | 0.290 | 0.232 | 0.290 | |
| 336 | 0.291 | 0.328 | 0.291 | 0.328 | 0.291 | 0.328 | |
| 720 | 0.389 | 0.387 | 0.389 | 0.387 | 0.389 | 0.387 | |
| Avg | 0.270 | 0.313 | 0.270 | 0.314 | 0.270 | 0.314 | |
| ECL | 96 | 0.131 | 0.221 | 0.131 | 0.221 | 0.131 | 0.221 |
| 192 | 0.150 | 0.238 | 0.152 | 0.240 | 0.153 | 0.241 | |
| 336 | 0.165 | 0.254 | 0.167 | 0.256 | 0.167 | 0.255 | |
| 720 | 0.191 | 0.279 | 0.195 | 0.285 | 0.191 | 0.278 | |
| Avg | 0.159 | 0.248 | 0.161 | 0.250 | 0.160 | 0.249 | |
| Traffic | 96 | 0.398 | 0.226 | 0.402 | 0.226 | 0.402 | 0.226 |
| 192 | 0.439 | 0.241 | 0.435 | 0.240 | 0.437 | 0.240 | |
| 336 | 0.464 | 0.250 | 0.462 | 0.250 | 0.463 | 0.250 | |
| 720 | 0.502 | 0.270 | 0.505 | 0.272 | 0.509 | 0.270 | |
| Avg | 0.451 | 0.247 | 0.451 | 0.247 | 0.453 | 0.246 | |
| Weather | 96 | 0.153 | 0.190 | 0.152 | 0.189 | 0.152 | 0.189 |
| 192 | 0.200 | 0.235 | 0.201 | 0.236 | 0.202 | 0.238 | |
| 336 | 0.258 | 0.280 | 0.260 | 0.282 | 0.260 | 0.283 | |
| 720 | 0.337 | 0.333 | 0.339 | 0.333 | 0.335 | 0.332 | |
| Avg | 0.237 | 0.260 | 0.238 | 0.260 | 0.237 | 0.260 |
- We also examine the robustness of OrthoTrans, PCA, and OrthoTrans-C when the matrices are computed using only the first 10% of the training set. We observe that OLinear performs robustly in this setting, consistent with the finding in Appendix I.8.
| Dataset | Hor. | OrthoTrans | PCA | OrthoTrans-C | |||
|---|---|---|---|---|---|---|---|
| MSE | MAE | MSE | MAE | MSE | MAE | ||
| ECL | 96 | 0.131 | 0.221 | 0.131 | 0.221 | 0.131 | 0.221 |
| 192 | 0.152 | 0.240 | 0.152 | 0.240 | 0.154 | 0.242 | |
| 336 | 0.166 | 0.255 | 0.168 | 0.256 | 0.165 | 0.254 | |
| 720 | 0.189 | 0.278 | 0.186 | 0.278 | 0.186 | 0.278 | |
| Avg | 0.159 | 0.248 | 0.159 | 0.249 | 0.159 | 0.248 | |
| Traffic | 96 | 0.402 | 0.226 | 0.400 | 0.226 | 0.404 | 0.226 |
| 192 | 0.431 | 0.240 | 0.434 | 0.240 | 0.438 | 0.241 | |
| 336 | 0.465 | 0.250 | 0.463 | 0.250 | 0.464 | 0.250 | |
| 720 | 0.512 | 0.271 | 0.502 | 0.272 | 0.500 | 0.272 | |
| Avg | 0.453 | 0.247 | 0.450 | 0.247 | 0.452 | 0.247 | |
| Weather | 96 | 0.149 | 0.187 | 0.152 | 0.189 | 0.152 | 0.189 |
| 192 | 0.203 | 0.239 | 0.200 | 0.237 | 0.202 | 0.238 | |
| 336 | 0.259 | 0.281 | 0.262 | 0.282 | 0.264 | 0.285 | |
| 720 | 0.339 | 0.332 | 0.338 | 0.334 | 0.347 | 0.338 | |
| Avg | 0.237 | 0.260 | 0.238 | 0.260 | 0.241 | 0.262 |
- The newly added experiments show that OrthoTrans, ACF, PCA, and OrthoTrans-C perform similarly when applied to OLinear, demonstrating its robustness to different decorrelation methods. We will include a discussion of these variants in a new appendix section and incorporate the results presented here to further strengthen our work. Thank you again for your constructive comments!
Q3-6: On the dimension extension, temporal NormLin, lightweight baselines, and computational complexity
- We appreciate the reviewer’s constructive comments and are glad that our responses have addressed the concerns.
We are eager to hear your feedback. We sincerely hope that our new comments can address your remaining concerns. We’d deeply appreciate it if you could let us know whether your concerns have been addressed. Thank you so much!
Thanks for your time,
Authors of #10716
Thank you for your detailed response. You have treated most of my concerns, so I have decided to raise my score. However, I still would raise some important issues that I would like to ask the authors to solve.
- Please be more mathematically rigorous. If I understand correctly, you assume that is close to ( what is ? Is it the correlation matrix for a fixed channel?). In this case, you need to use in all transitions like "".
- If ACF and your strategy give the same results, then I would suggest the authors to stick to ACF, because it is more intuitive and familiar to the broad reader. To me, it was very confusing to follow the paper, especially with notation (by the way, it is confusing to write and call it "the temporal correlation matrix").
- When you compute the Frobenius norm of the off-diagonal elements, the values are not that meaningful as it depends on the size of the matrix. Maybe dividing by the number of off-diagonal elements would be more intuitive in order to have a value within [0,1].
Please incorporate all the modifications we have discussed during the rebuttal. This will greatly improve the quality of the paper.
Best regards, Reviewer CEMa
Dear Reviewer CEMa,
We highly appreciate your constructive responses and great efforts in our multi-round discussion. We are encouraged that you found our responses have addressed most of your concerns! We would like to respond to your remaining questions as follows.
Q1: If I understand correctly, you assume that is close to ( what is ? Is it the correlation matrix for a fixed channel?). In this case, you need to use in all transitions like "".
-
Thank you for raising this point. We sincerely apologize for any confusion caused by our mathematical notations, and would like to offer the clarifications as follows.
-
We use and to denote the correlation and covariance matrices, respectively, throughout the paper. The subscript (or , Line 262) indicates the temporal (or variate) dimension. Specifically, denotes the correlation matrix among the time-lagged subseries, averaged across all variates (Line 146). For a specific variate , we denote this as (Line 140).
-
For a normalized series, it holds that , based on the fact that the subseries share (almost) the same unit variance. However, since we explicitly define , rather than , on Line 149 in the paper, the use of the approximation symbol in the equation on Line 153 seems unnecessary.
-
We are sorry for making a mistake in our previous response for writing "suppose the temporal correlation matrix admits the eigendecomposition ".
-
The correct one should be instead of . We sincerely apologize for the confusion this has caused. We have thoroughly examined the paper, and are relieved to observe that these two notions are consistently and correctly used in the manuscript.
Q2: If ACF and your strategy give the same results, then I would suggest the authors to stick to ACF, because it is more intuitive and familiar to the broad reader. To me, it was very confusing to follow the paper, especially with notation (by the way, it is confusing to write and call it "the temporal correlation matrix").
-
Thank you for the suggestion! We will choose to stick ACF in Section 4.2 for our final version, which may be more familiar to a broad range of readers.
-
Please kindly note that the ACF here is only used for computing in OrthoTrans, so that such revision does not affect the core contributions of our paper.
-
We have carefully reviewed the manuscript and confirmed that is not mistakenly written as in our final version.
Q3: On the metric to quantify decorrelation
- Thank you for the great suggestion! We divide the off-diagonal Frobenius norm by the number of off-diagonal elements, and report this new metric below.
- In our previous experiments of quantifying decorrelation, for DFT we used only the real part, and for Haar wavelets we used only the high-frequency components. We now adopt a fairer scheme: for DFT, we concatenate the real and imaginary parts; for Haar wavelets, we concatenate both the low- and high-frequency components.
| Dataset | Training Set | Test Set | ||||||
|---|---|---|---|---|---|---|---|---|
| Original | OrthoTrans | DFT | Wavelet | Original | OrthoTrans | DFT | Wavelet | |
| ETTh1 | 6.7E-03 | 1.6E-04 | 5.8E-04 | 3.7E-03 | 5.2E-03 | 2.1E-04 | 6.0E-04 | 3.1E-03 |
| ETTm2 | 9.1E-03 | 1.4E-04 | 1.1E-03 | 4.6E-03 | 7.2E-03 | 2.2E-04 | 1.3E-03 | 3.7E-03 |
| ECL | 5.0E-03 | 1.9E-04 | 4.7E-04 | 3.6E-03 | 5.0E-03 | 2.2E-04 | 4.9E-04 | 3.6E-03 |
| Traffic | 3.8E-03 | 1.9E-04 | 5.5E-04 | 2.6E-03 | 3.8E-03 | 2.3E-04 | 5.7E-04 | 2.6E-03 |
| Weather | 6.7E-03 | 3.1E-04 | 4.1E-03 | 3.6E-03 | 7.2E-03 | 4.0E-04 | 4.6E-03 | 3.7E-03 |
| Avg | 6.3E-03 | 2.0E-04 | 1.4E-03 | 3.6E-03 | 5.7E-03 | 2.6E-04 | 1.5E-03 | 3.3E-03 |
-
Under this new metric, OrthoTrans reduces the correlation in the original series by 96%, and outperforms DFT and Haar wavelet transforms by 84% and 93%, respectively.
-
Following your suggestion, we will include the quantitation of temporal decorrelation of OrthoTrans in a new appendix section and incorporate the results presented here to further strengthen our work.
We really appreciate your valuable suggestions which have undoubtedly contributed to improving the quality of our paper. Please let us know if we have properly addressed your questions and we are more than happy to discuss more!
Thanks for your time,
Authors of #10716
This paper tackles entangled temporal dependencies in multivariate time series forecasting through two main contributions: a data-adaptive orthogonal transformation based on eigenvalue decomposition of Pearson correlation matrices, and a computationally efficient normalized linear layer (NormLin) that serves as an alternative to self-attention mechanisms. The proposed method leverages dataset-specific statistical properties to decorrelate temporal features before applying linear forecasting models. The comprehensive evaluation across multiple datasets and and tasks in this paper demonstrates consistent performance gains over strong baselines, with both components showing modularity as plug-in enhancements for existing architectures.
优缺点分析
Strengths:
This paper presents a well-executed combination of theoretical innovation and empirical rigor. The core contribution using eigenvalue decomposition of temporal Pearson correlation matrices for orthogonal transformation, provides an elegant theoretical foundation with clear justification. The work demonstrates state-of-the-art performance across multiple datasets and tasks, with both components serving as effective plug-in modules that consistently improve various baseline models. I am particularly impressed by the comprehensive experimental evaluation, which includes extensive ablation studies, and evaluation across carious tasks with different prediction horizons.
Weaknesses:
See questions below
问题
-
How does the method perform on highly non-stationary time series where the correlation structure changes significantly over time? Have you considered adaptive or online updates to the orthogonal matrices?
-
You mention that OrthoTrans improves the rank of attention matrices in Transformer models. Can you provide more detailed analysis or proof of this phenomenon? What is the relationship between decorrelation and representation diversity?
-
Have you compared against other adaptive bases beyond the eigenvalue decomposition, such as those learned through neural networks or other unsupervised methods?
-
The orthogonal matrices Qi and Qo are computed from the training data. The paper only briefly mentions robustness to limited training data in the appendix but doesn't thoroughly investigate how performance degrades with smaller training sets or distribution shifts.
局限性
Yes
最终评判理由
The authors have provided comprehensive response to my questions. I choose to maintain my current score.
格式问题
No
Thank you for your constructive response. We sincerely appreciate the reviewer’s recognition of OLinear as "a well-executed combination of theoretical innovation and empirical rigor". Below are our responses to the concerns:
Q1: On non-stationary time series
- Financial data (e.g., stock prices and exchange rates) are widely recognized as non-stationary time series. As shown in Tables 2 and 3 of the paper, OLinear consistently achieves state-of-the-art performance on the real-world non-stationary time series datasets: Exchange, NASDAQ, SP500, and DowJones.
- Regarding online updates of the Q matrices, we further conduct experiments where Q matrices are replaced by those computed on the validation set. After retraining with the updated Q matrices (using the same trainset), the model exhibits similar forecasting performance (as shown below). This phenomenon implies that the Q matrices obtained by OrthoTrans with the trainset can generalize well to the full dataset. Further discussion on online learning could be a future direction of this work.
| Dataset | Hor. | Data in paper | Val_Q and Retrain | ||
|---|---|---|---|---|---|
| MSE | MAE | MSE | MAE | ||
| Exchange | 96 | 0.082 | 0.200 | 0.082 | 0.200 |
| 192 | 0.171 | 0.293 | 0.171 | 0.293 | |
| 336 | 0.331 | 0.414 | 0.331 | 0.414 | |
| 720 | 0.837 | 0.688 | 0.836 | 0.688 | |
| Avg | 0.355 | 0.399 | 0.355 | 0.399 | |
| NASDAQ | 3 | 0.036 | 0.092 | 0.036 | 0.092 |
| 6 | 0.049 | 0.117 | 0.049 | 0.117 | |
| 9 | 0.062 | 0.137 | 0.062 | 0.137 | |
| 12 | 0.073 | 0.154 | 0.073 | 0.154 | |
| Avg | 0.055 | 0.125 | 0.055 | 0.125 | |
| SP500 | 3 | 0.035 | 0.126 | 0.035 | 0.127 |
| 6 | 0.053 | 0.158 | 0.054 | 0.158 | |
| 9 | 0.070 | 0.181 | 0.070 | 0.181 | |
| 12 | 0.088 | 0.204 | 0.088 | 0.204 | |
| Avg | 0.061 | 0.167 | 0.061 | 0.167 |
Q2: On the rank improvement of attention matrices
-
The observation that OrthoTrans improves the rank of attention matrices is interesting and partly explains why OrthoTrans, as a plug-in, consistently enhances Transformer-based forecasters (see Table 5 of the paper). A potential reason behind this phenomenon is that OrthoTrans can mitigate the intrinsic low-rank property of time series data. The empirical results are presented below.
-
| Dataset | Identity | OrthoTrans | DFT | Wavelet | |-------------|--------------|----------------|---------|-------------| | ECL | 0.162 | 0.446 | 0.332 | 0.381 | | Traffic | 0.002 | 0.568 | 0.177 | 0.049 | | Weather | 0.012 | 0.572 | 0.352 | 0.292 | | Solar | 0.007 | 0.481 | 0.032 | 0.023 | | PEMS03 | 0.241 | 0.450 | 0.356 | 0.395 |
-
Following TimeBase (ICML 2025), we compute the median singular values of the correlation matrix among non-overlapping time series patches (patch length = 24, number of patches = 30); see Figure 1 in TimeBase for more details. The results are presented above. As shown, OrthoTrans yields larger singular values than other transformation bases (with “Identity” denoting no transformation). In other words, OrthoTrans alleviates low-rank tendencies [1,2] and enhances data diversity.
Q3: On adaptive bases
- We have considered learnable Q matrices for potential performance improvements. The most straightforward approach is adding a learnable matrix to Q. More advanced alternatives include using a learnable vector with a Householder transformation to ensure orthogonality: or applying the Cayley transform: where is the learnable skew-symmetric matrix and . We also explored an input-adaptive version: where is the input series and is the learnable parameter. The results are reported below:
| Dataset | Horizon | OLinear | Plus Delta | Householder | Cayley | Adapt_Q | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | ||
| ECL | 96 | 0.131 | 0.221 | 0.131 | 0.221 | 0.131 | 0.221 | 0.130 | 0.221 | 0.131 | 0.221 |
| 192 | 0.150 | 0.238 | 0.152 | 0.240 | 0.153 | 0.241 | 0.153 | 0.241 | 0.153 | 0.241 | |
| 336 | 0.165 | 0.254 | 0.166 | 0.256 | 0.166 | 0.255 | 0.165 | 0.255 | 0.164 | 0.254 | |
| 720 | 0.191 | 0.279 | 0.188 | 0.277 | 0.193 | 0.282 | 0.188 | 0.278 | 0.192 | 0.281 | |
| Avg | 0.159 | 0.248 | 0.159 | 0.248 | 0.161 | 0.250 | 0.159 | 0.249 | 0.160 | 0.249 | |
| Traffic | 96 | 0.398 | 0.226 | 0.397 | 0.227 | 0.400 | 0.226 | 0.401 | 0.226 | 0.399 | 0.226 |
| 192 | 0.439 | 0.241 | 0.436 | 0.241 | 0.435 | 0.240 | 0.437 | 0.241 | 0.438 | 0.240 | |
| 336 | 0.464 | 0.250 | 0.460 | 0.250 | 0.461 | 0.250 | 0.466 | 0.250 | 0.460 | 0.250 | |
| 720 | 0.502 | 0.270 | 0.510 | 0.271 | 0.500 | 0.271 | 0.505 | 0.271 | 0.507 | 0.271 | |
| Avg | 0.451 | 0.247 | 0.451 | 0.247 | 0.449 | 0.247 | 0.452 | 0.247 | 0.451 | 0.247 | |
| Weather | 96 | 0.153 | 0.190 | 0.149 | 0.187 | 0.152 | 0.189 | 0.151 | 0.189 | 0.151 | 0.189 |
| 192 | 0.200 | 0.235 | 0.200 | 0.236 | 0.204 | 0.239 | 0.205 | 0.240 | 0.203 | 0.238 | |
| 336 | 0.258 | 0.280 | 0.266 | 0.285 | 0.263 | 0.284 | 0.263 | 0.284 | 0.260 | 0.282 | |
| 720 | 0.337 | 0.333 | 0.334 | 0.331 | 0.339 | 0.334 | 0.343 | 0.337 | 0.337 | 0.332 | |
| Avg | 0.237 | 0.260 | 0.237 | 0.260 | 0.239 | 0.261 | 0.240 | 0.262 | 0.238 | 0.260 |
- However, these attempts do not yield performance gains. These results highlight the importance of temporal decorrelation: modifications to the Q matrices reintroduce correlations into the output series (since Q is theoretically the optimal orthogonal matrix for temporal decorrelation). This further demonstrates the superiority of the proposed OrthoTrans scheme.
Q4: On smaller training sets or distribution shifts
-
We report performance with smaller training sets in Appendix I.4. As a linear-based forecaster, OLinear does not heavily rely on large training data and adapts well when the training size is reduced from 100% to 5% (see Figure 16 on Page 32 of the paper).
-
Distribution shifts are challenging for any deep forecasting model. Tables 2 and 3 in the paper show that OLinear performs competitively on real-world non-stationary datasets. In Table 28, OLinear also demonstrates strong zero-shot and generalization capabilities.
We hope these responses can fully address your concerns. Thank you again for your detailed feedback!
[1] TimeBase: The Power of Minimalism in Efficient Long-term Time Series Forecasting. ICML 2025.
[2] Robust recovery of subspace structures by low-rank representation. TPAMI 2012.
Dear Reviewer rQjb,
We would like to thank you once again for your efforts and constructive comments. We would be happy to address any further questions you may have.
All the best,
Authors
Thanks to the authors for the detailed response and experimental results, which are great additions to the paper. I have no further questions, and I will maintain my score.
Dear Reviewer rQjb,
We really appreciate your recognition of our work and your kind words, and we are happy to hear that you consider our responses are great additions to the paper and your concerns have been addressed. Again, thank you for your valuable suggestions which have undoubtedly contributed to improving the quality of our paper.
Many thanks,
The authors of #10716
We sincerely thank all the reviewers for their insightful reviews and valuable comments, which are instructive for us to improve our paper.
In this work, we propose OLinear, a linear-based model for time series forecaseting in the orthogonally transformed domain. It mainly consists of the (plug-and-play) OrthoTrans and NormLin modules, and "demonstrates state-of-the-art performance" (Reviewer rQjb) across "24 datasets and 140 configurations" (Reviewer CEMa). Reviewer jUXb states that as a dataset-adaptive approach, OrthoTrans could "inspire further work" (Reviewer jUXb). When used as a replacement of self-attention, the NormLin module cuts FLOPs "by roughly 50%" (Reviewer 5fFn), "with validated improvements on iTransformer (+5.1%) and PatchTST (+10.1%), which is really interesting" (Reviewer CEMa). Reviewer rQjb further notes that "this paper presents a well-executed combination of theoretical innovation and empirical rigor".
In addition, we curate and publicly release six new multivariate time series datasets: SP500, DowJones, CarSales, Power, Website, and Unemp, to facilitate the development of the time series community.
The reviewers raised insightful and constructive comments. We are devoting all our efforts to addressing the concerns by providing sufficient evidence and the requested results. Below is a summary of the major responses in our rebuttal:
-
Further analysis. Reviewer rQjb suggests further analysis of how OrthoTrans improves the rank of attention matrices in Transformer models. In response, we provide the perspective that OrthoTrans can mitigate the intrinsic low-rank property of time series data. Empirical results show that OrthoTrans yields larger singular values of the correlation matrix among non-overlapping time series patches than those obtained with other transformation bases. This phenomenon partly explains the performance advantages of OrthoTrans over DFT and wavelet bases.
-
More baselines. As requested by Reviewers CEMa, 5fFn and jUXb, we compare OLinear with more baselines, including SimpleTM (ICLR 2025), TQNet (ICML 2025), TimePro (ICML 2025), and TimeBase (ICML 2025), SAMformer, RLinear, TSMixer, and Pyraformer. OLinear consistently outperforms these baselines on a wide range of datasets. Specifically, OLinear outperforms SAMformer by a substantial margin of 12.1%. Notably, replacing self-attention in SAMformer with the NormLin layer improves MAE performance by 10.6% and 5.8% on the Traffic and ECL datasets, respectively.
-
Elaboration on implementation details. As requested by Reviewer CEMa, we provide detailed explanations of our implementation and use experiments to justify our design choices, including the computation of temporal correlation matrices, the design of the NormLin module, and the details of the “temporal NormLin” baseline. For the channel average (CA) scheme used to compute , we conduct ablation studies demonstrating the performance advantages of CA over the channel-individual (CI) counterpart.
-
Quantitation of temporal decorrelation. Following Reviewer jUXb's suggestions, we report the Frobenius norm of the off-diagonal elements to validate OrthoTrans's effectiveness in temporal decorrelation. We also provide an intuitive solution to reduce the computational load of the NormLin layer. A statistical significance test using Student’s t-test between OLinear and iTransformer shows that OLinear has a clear performance advantage over the classic iTransformer. We also demonstrate the robustness of OLinear using unified hyperparameter settings.
-
Learnable Q matrices. Reviewers rQjb and CEMa suggest using learnable Q matrices for potential improvements. To this end, we design several approaches (Plus Delta, Householder, Cayley, and Adapt_Q) to implement learnable Q matrices and compare them with the fixed ones. However, these variants do not outperform the original OrthoTrans. These results highlight the importance of temporal decorrelation and the challenges of learning matrices that perform better than the pre-computed Q matrices.
We are grateful for the reviewers’ valuable suggestions. We would be very happy to answer any further questions.
This paper introduces OLinear, a new linear multivariate time series forecasting model, where the O indicates that the model operates in an orthogonally transformed domain. It includes two key components: OrthoTrans, a data-adaptive orthogonal transformation that uses temporal Pearson correlation matrices; NormLin: a lightweight, normalized linear layer that effectively replaces more complex self-attention mechanisms, leading to gains in both accuracy and efficiency. The empirical study shows good performance.
All reviewers gave positive ratings, and indicated that the rebuttal comments are useful to clarify unclear things in the original submission. The authors promised to include the rebuttall comments into the final version to improve the paper. Overall, the paper presents an interesting linear model for time series forecasting.