PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
4
4
4.0
置信度
创新性3.0
质量2.8
清晰度2.8
重要性2.8
NeurIPS 2025

OLinear: A Linear Model for Time Series Forecasting in Orthogonally Transformed Domain

OpenReviewPDF
提交: 2025-05-08更新: 2025-10-29
TL;DR

OLinear is a linear-based time series forecasting model that outperforms state-of-the-art Transformer-based methods with significantly higher efficiency.

摘要

This paper presents $\mathbf{OLinear}$, a $\mathbf{linear}$-based multivariate time series forecasting model that operates in an $\mathbf{o}$rthogonally transformed domain. Recent forecasting models typically adopt the temporal forecast (TF) paradigm, which directly encode and decode time series in the time domain. However, the entangled step-wise dependencies in series data can hinder the performance of TF. To address this, some forecasters conduct encoding and decoding in the transformed domain using fixed, dataset-independent bases (e.g., sine and cosine signals in the Fourier transform). In contrast, we propose $\mathbf{OrthoTrans}$, a data-adaptive transformation based on an orthogonal matrix that diagonalizes the series' temporal Pearson correlation matrix. This approach enables more effective encoding and decoding in the decorrelated feature domain and can serve as a plug-in module to enhance existing forecasters. To enhance the representation learning for multivariate time series, we introduce a customized linear layer, $\mathbf{NormLin}$, which employs a normalized weight matrix to capture multivariate dependencies. Empirically, the NormLin module shows a surprising performance advantage over multi-head self-attention, while requiring nearly half the FLOPs. Extensive experiments on 24 benchmarks and 140 forecasting tasks demonstrate that OLinear consistently achieves state-of-the-art performance with high efficiency. Notably, as a plug-in replacement for self-attention, the NormLin module consistently enhances Transformer-based forecasters. The code and datasets are available at https://github.com/jackyue1994/OLinear.
关键词
Time series forecasting; Linear-based model; Transformer

评审与讨论

审稿意见
5

The paper introduces OLinear, a multivariate time-series forecaster that operates in an orthogonally transformed domain rather than the raw time domain. For each input window, the authors compute the temporal Pearson-correlation matrix, perform an eigen-decomposition, and project the signals onto the resulting orthogonal basis. They then replace self-attention with NormLin. Extensive experiments indicate improvements over strong baselines, and the accompanying code is available.

优缺点分析

Strengths

  1. The authors evaluate OLinear on 24 publicly available datasets spanning electricity, traffic, weather and financial domains, and consider multiple forecast horizons for each dataset. Reporting 99 % confidence intervals indicates an effort to convey result variability rather than single-run numbers.

  2. The paper is well structured: problem statement, method, ablations and main tables flow logically. Figures are labelled and referenced properly, and notation is introduced before use, which makes the technical content easy to follow.

  3. Code and data-processing scripts are released, allowing reviewers and future researchers to reproduce the results or to integrate the proposed method into their own pipelines with minimal effort.

  4. The plug-and-play lowers the entry barrier for practitioners who already have Transformer-based models and wish to test the proposed ideas without redesigning their entire architecture.

  5. While Fourier and wavelet bases dominate current transform-domain forecasters, constructing an orthogonal basis from the data’s own temporal correlation matrix has been less explored. This adaptive approach could inspire further work.

Weaknesses

  1. Results seem to degrade when the look-back window exceeds 96, casting doubt on long-horizon robustness. For example, with 15-minute data, 96 steps cover only 24 h, yet the model may need to predict up to 720 steps (≈ 1 week) ahead in long-term forecasting, raising concerns about interpretability.

  2. Many of the hyper-parameters, such as model dimension, embedding size, number of blocks and learning rate, are dataset-dependent.

  3. The assumption of decorrelation after OrthoTrans is not demonstrated. Two series can be decorrelated in time yet correlated in frequency; an orthogonal transform does not necessarily remove such cross-frequency dependencies. Moreover, decorrelating in the time domain removes second-order dependencies but says nothing about higher-order or non-linear dependencies.

  4. The novelty remains unclear. The manuscript does not cite the previous existing work [1]. That work already establishes that forecasting Transformers suffer from excessive sharpness, and that this can be mitigated by (i) using a different optimiser and (ii) reducing the architecture to a single channel-wise attention head—whose complexity is exactly O(N2)O(N^2), the same as NormLin with h=1h=1 (see the Table 1). These two changes were shown to raise the attention-matrix rank and therefore boost the model’s expressivity. The study re-uses those insights in a purely linear setting, which by construction loses the non-linear capacity of a Transformer, yet none of this prior art is acknowledged. Consequently, the claims regarding higher rank, multivariate correlations, and expressivity are not novel in the context of multivariate time-series forecasting.

  5. NormLin trades away non-linearity while retaining the same O(N2)O(N^2) cost.

References

[1] SAMformer : Unlocking the potential of Transformers in Time Series Forecasting

问题

  1. Could you run a statistical significance test across ≥ 5 random seeds for every dataset–horizon pair and report the resulting p-values?

  2. The existing work already demonstrates that a specific optimiser combined with a single channel-wise attention head (complexity O(N2)O(N^2)) raises attention-matrix rank and boosts accuracy while using a channel attention mechanism to capture multivariate correlations. Please clarify the novelty and compare the baseline with this existing work. Especially, how the model compares to a shallow transformer trained with a strong regularization like SAM and a channel-wise attention mechanism ?

  3. The claim that OrthoTrans removes step-wise correlations needs quantitative confirmation in both domains. Could you provide a metric that shows (i) a drop in the off-diagonal magnitude of the time-domain covariance matrix and (ii) that no new dependencies appear in the frequency domain—for instance, by comparing the Frobenius norm of the off-diagonal entries in the covariance matrix and in the cross-spectral density before and after applying OrthoTrans?

  4. NormLin removes non-linearity yet keeps quadratic cost. Is there a way to reduce this cost?

  5. In the experiments, you use different learning rate, model dimension, embedding size and number of blocks for each dataset. Could you please freeze one single architecture, run it on all the datasets and compare this performance with the best architecture for each dataset?

局限性

While the authors include a brief limitations paragraph in Appendix, a fuller discussion into the main paper, covering details like compute cost and dataset-specific hyper-parameters, is needed.

最终评判理由

The authors have addressed my initial concerns. They have also committed to revise the narrative on weight-matrix rank in the camera-ready.

格式问题

No concern

作者回复

Thank you for your comments. We sincerely appreciate the reviewer’s recognition of our "well-structured" presentation, the plug-and-play modules that "lower the entry barrier", and an adaptive design that could "inspire further work". Below are our responses:

Q1: On the p-value

  • We conduct a significance test (using 7 random seeds) using Student’s t-test between OLinear and iTransformer across various datasets and prediction horizons. As shown below, the better model is in bold when the improvement is statistically significant at the level 0.05 (p-value < 0.05). In other words, OLinear exhibits a clear performance advantage over iTransformer.
SettingOLinear (MAE)iTrans.(MAE)p-value between OLinear and iTrans.
ECL: 960.221±4e-40.240±4e-45.12E-11
ECL: 1920.238±1e-30.253±2e-33.06E-06
ECL: 3360.254±1e-30.269±1e-39.96E-07
ECL: 7200.279±2e-30.317±7e-35.59E-05
Traffic: 960.226±2e-40.268±1e-31.61E-12
Traffic: 1920.241±4e-40.276±1e-39.48E-10
Traffic: 3360.250±3e-40.283±4e-41.87E-13
Traffic: 7200.270±4e-40.302±4e-46.15E-13
Weather: 960.190±1e-30.214±3e-41.29E-08
Weather: 1920.235±2e-30.254±1e-31.04E-06
Weather: 3360.280±2e-30.296±1e-31.93E-06
Weather: 7200.333±2e-30.349±4e-45.68E-06
  • We report the performance with increasing lookback lengths in Appendix I.3. OLinear demonstrates consistent improvements as the lookback horizon increases from 48 to 720, and consistently outperforms state-of-the-art forecasters.

  • Regarding robustness across random seeds, please refer to Appendix F, where OLinear exhibits smaller standard deviations and narrower 99% confidence intervals compared to TimeMixer++ and iTransformer.

  • We would like to emphasize that MSE and MAE metrics typically degrade in long-term forecasting as the prediction horizon increases. As shown in Table 15 (Page 28), this phenomenon is observed in state-of-the-art time series forecasters, such as TimeMixer++ (ICLR 2025), iTransformer (ICLR 2024), FilterNet (NeurIPS 2024). This phenomenon is reasonable because future series inherently contain greater uncertainty as the horizon extends, much like how predicting the temperature two weeks ahead is less accurate than predicting tomorrow’s temperature.

Q2: On novelty and comparison with SAMformer

  • SAMformer (ICML 2024) introduces the Sharpness-Aware Minimization (SAM) mechanism—originally proposed in [1] (ICLR 2021)—into time series forecasting. While channel-wise attention was first introduced by iTransformer, SAMformer adopts the same practice. Its key contribution is a new loss function that smooths the loss landscape of Transformers: LSAM_train(ω)=maxϵ<ρL_train(ω+ρ),\mathcal{L}^{SAM}\_{train}(\omega ) = \underset{\left \| \epsilon \right \| <\rho }{max}\mathcal{L}\_{train}(\omega +\rho), where ρ>0\rho >0 is a hyper-parameter. This can be implemented by the SAM optimizer [1].

  • Instead of modifying the loss function, we focus on the architecture design and propose NormLin. Compared with vanilla self-attention, NormLin offers four advantages: lower computational cost, higher-rank attention matrices, smoother gradient flow, and an enlarged optimization space (see Lines 182-198 in the paper). NormLin "can be plugged into existing forecasters to boost performance with minimal engineering effort" (by Reviewer 5fFn). Moreover, our linear-based model, OLinear, outperforms state-of-the-art Transformer-based forecasters—including SAMformer. We will cite and discuss SAMformer in our final draft.

  • We compare MSE and standard deviation between OLinear and SAMformer. The SAMformer numbers are taken from its original paper, which uses a look-back window of 512 and thus enjoys extra performance gains. Even under this handicap, OLinear achieves lower MSE and smaller standard deviations in most cases.

DatasetHor.OLinear (Look-back: 96)SAMformer (Look-back: 512)
ETTm2960.169±1e-40.181±0.005
1920.232±1e-40.233±0.002
3360.291±2e-40.285±0.001
7200.389±4e-40.375±0.001
ECL960.131±4e-40.155±0.002
1920.150±1e-30.168±0.001
3360.165±1e-30.183±0.000
7200.191±2e-30.219±0.000
Exchange960.082±3e-40.161±0.007
1920.171±2e-30.246±0.009
3360.331±3e-30.368±0.006
7200.837±0.0181.003±0.018
Weather960.153±1e-30.197±0.001
1920.200±2e-30.235±0.000
3360.258±3e-30.276±0.001
7200.337±4e-30.334±0.000
  • To ensure a fair comparison, we uniformly set the lookback length to 96 and report the following MAEs. The average results are reported across 3 random seeds for SAMformer, using the recommended learning rates and ρ\rho values.
    • On average, OLinear outperforms SAMformer by a substantial margin of 12.1%.
    • Notably, the proposed NormLin layer improves the performance of SAMformer by 10.6% and 5.8% on the Traffic and ECL datasets, respectively.
DatasetHor.OLinearSAMformerSAMformer+NormLin
ETTh1960.3820.3960.392
1920.4140.4250.419
3360.4380.4440.439
7200.4620.4690.459
ETTm2960.2490.2680.262
1920.2900.3050.303
3360.3280.3430.340
7200.3870.3990.397
ECL960.2210.2850.267
1920.2380.2880.269
3360.2540.3010.284
7200.2790.3350.318
Traffic960.2260.4040.355
1920.2410.3670.335
3360.2500.3790.339
7200.2700.4010.358
Weather960.1900.2310.208
1920.2350.2690.251
3360.2800.3040.290
7200.3330.3500.339
Echange960.2000.2030.210
1920.2930.3030.302
3360.4140.4260.416
7200.6880.7120.689

Q3: On the quantitation of temporally decorrelation

  • We report the off-diagonal Frobenius norm of the correlation matrices (size: 96×9696\times 96) for the original series, and for the series after OrthoTrans or FFT, using a window size of 96. The results below show that OrthoTrans effectively removes temporal correlations and performs better than the FFT operation in decorrelation.
DatasetOriginalOrthoTransFFT
ETTh161.361.462.58
ETTm283.041.265.22
ECL45.951.782.26
Traffic34.801.793.29
Weather61.702.8619.61
  • We also argue that examining correlations in the frequency domain of the feature series after OrthoTrans is of limited relevance: 1) OLinear does not involve any FFT operations. 2) Whether there exists an orthogonal transformation that simultaneously mitigates both temporal- and frequency-domain correlations is a nontrivial question. Further exploration could be a future direction of this work. We will emphasize that the decorrelation is performed temporally in the final draft.

Q4: On quadratic cost of NormLin

  • In our view, O(N2)\mathcal{O}(N^2) is the minimal cost required to model NN-channel correlations, as even a single linear layer incurs this complexity.

  • A straightforward way to reduce this cost is to introduce a bottleneck block: the channels are first projected down to a fixed width (e.g., 64), the NormLin module is applied in the fixed-dimensional space, and then the representation is projected back to NN channels. This reduces the computational complexity to O(N)\mathcal{O}(N). As shown below, this variant tends to degrade performance, particularly on datasets with a large number of channels, such as ECL and Traffic.

DatasetHor.Original (MAE)Bottleneck (MAE, fixed width=64)
ETTh1960.3820.382
1920.4140.415
3360.4380.443
7200.4620.468
ECL960.2210.241
1920.2380.256
3360.2540.270
7200.2790.291
Traffic960.2260.263
1920.2410.278
3360.2500.283
7200.2700.298
Weather960.1900.190
1920.2350.239
3360.2800.281
7200.3330.337
PEMS-03120.1590.160
240.1790.182
480.2100.224
960.2470.264

Q5: On hyperparameters

  • Because there are substantial differences among datasets, it is common practice (e.g., iTransformer and SAMformer) to fine-tune certain hyperparameters to maximize model performance.

  • In the paper, the hyperparameter value ranges are provided in Appendix D. Note that the embedding size dd is fixed at 16 for all experiments. We also evaluate the hyperparameter sensitivity of OLinear in Appendix I.6. As shown in Figure 17 and Table 30, OLinear exhibits robust performance across these hyperparameters.

  • Furthermore, we freeze a single architecture (learning rate lr=5×104lr = 5\times10^{-4}, model dimension D=512D = 512, number of blocks L=3L = 3, embedding size d=16d = 16) and report the average results across four prediction lengths. Compared to the results reported in the paper, the performance degradation is within 1%.

DatasetIn paperFixed Para.
MSEMAEMSEMAE
ETTh10.4240.4240.4260.425
ETTh20.3670.3880.3690.390
ETTm10.3740.3770.3770.381
ETTm20.2700.3130.2710.313
ECL0.1590.2480.1590.248
Traffic0.4510.2470.4510.247
Weather0.2370.2600.2400.262

We hope these responses can fully address your concerns. Thank you again for your comments!

Reference

[1] Sharpness-aware minimization for efficiently improving generalization. ICLR 2021.

评论

Thank you for the well-structured rebuttal. Your additional experiments and clarifications resolve most of my earlier concerns. Below are my remarks.

Q1: Significance testing

The inclusion of Student’s t-tests across seven random seeds strengthens the statistical analysis and provides a solid estimate of the robustness of your model.

Q2: Novelty and comparison with existing baselines

The comparison with SAmformer and the reported average 12.1% MAE gain positions OLinear as the stronger baseline. However, what would further elevate this section is a theoretical discussion that contrasts the subspaces produced by OrthoTrans with those implied by a model like SAMformer’s rank-regularised attention. It is still unclear to me whether OrthoTrans provides representational capacity beyond what a simple attention + linear layer equipped with rank regularization, or even LoRA-style methods, can already achieve.

Q3: Quantifying decorrelation

The updated table of off-diagonal Frobenius norms is very helpful. It shows that OrthoTrans efficiently slashes time-domain correlations on every dataset. Given your statement that OrthoTrans is intended only for temporal decorrelation and that extending it to the frequency domain is a separate, non-trivial research question, the current analysis feels sufficient for this paper’s scope. However, a brief sentence in the main text clarifying this design choice would head off potential misunderstandings.

Q4: Quadratic cost

The bottleneck experiment shows the trade-off between complexity and accuracy: reducing the model size lowers the computational cost but also leads to some loss in accuracy on specific datasets. Showing how the performance (MAE or MSE) and latency change as the bottleneck size varies would help readers decide if the full quadratic version of the method is worth it for their own use cases. I suggest adding such an experiment to the paper to make this trade-off more explicit.

Q5: Hyper-parameter robustness

Freezing the learning rate, model dimension, number of blocks, and embedding size while losing \leq 1 % accuracy attests to the method’s stability.

General Comment

The rebuttal effectively strengthens the paper. Further clarifying the main theoretical point raised in Q2 would help improve the overall evaluation.

评论

Point 3: Quadratic cost

  • Thanks for your suggestion! We compare the MAE performance and resource footprint across various bottleneck sizes. As shown below, a bottleneck size of 16 appears to be a sweet spot for the performance–resource trade-off. This choice has a performance gap of about 4% compared to the full quadratic version, but with less resource utilization. We will definitely include a new section in the final manuscript to make this trade-off more explicit.
DatasetHor.Ori.Bottleneck (8)Bottleneck (16)Bottleneck (64)Bottleneck (128)
ETTh1960.3820.3820.3810.3820.383
1920.4140.4150.4120.4150.416
3360.4380.4380.4400.4430.442
7200.4620.4670.4670.4680.461
ECL960.2210.2360.2360.2410.240
1920.2380.2550.2500.2560.254
3360.2540.2640.2710.2700.267
7200.2790.3020.2910.2910.295
Traffic960.2260.2550.2550.2630.280
1920.2410.2730.2890.2780.277
3360.2500.2700.2750.2830.281
7200.2700.2990.2960.2980.301
Weather960.1900.1920.1880.1900.190
1920.2350.2400.2370.2390.241
3360.2800.2840.2870.2810.282
7200.3330.3380.3360.3370.338
PEMS03120.1590.1620.1610.1600.162
240.1790.1860.1830.1820.181
480.2100.2230.2240.2240.224
960.2470.2750.2660.2640.271
DatasetMetricOri.Bottleneck (8)Bottleneck (16)Bottleneck (64)Bottleneck (128)
TrafficParams(M)6.174.714.744.925.16
FLOPs(G)4.914.164.184.274.39
T.T. (s/iter)0.020.0150.0150.0160.017
T.M.(GB)1.010.980.991.001.01
I.T.(ms/iter)5.715.605.635.715.80
I.M. (GB)0.430.390.390.400.41
ECLParams(M)4.794.594.604.674.78
FLOPs(G)1.651.551.561.591.65
T.T. (ms/iter)7.757.457.587.617.80
T.M.(GB)0.450.440.450.450.46
I.T.(ms/iter)2.112.042.052.082.09
I.M. (GB)0.170.160.160.160.17
PEMS03Params(M)4.844.604.614.694.80
FLOPs(G)1.851.731.741.771.83
T.T. (s/iter)8.148.058.068.128.16
T.M.(GB)0.480.480.480.490.50
I.T.(ms/iter)2.232.182.182.212.24
I.M. (GB)0.200.180.180.180.19
WeatherParams(M)4.524.524.524.544.57
FLOPs(G)0.100.100.100.110.12
T.T. (ms/iter)7.477.577.597.627.70
T.M.(GB)0.210.220.220.230.23
I.T.(ms/iter)1.181.271.281.311.35
I.M. (MB)39.2638.8639.1040.6744.86
ETTh1Params(M)4.524.524.524.534.56
FLOPs(M)33.7433.8734.1838.8152.31
T.T. (ms/iter)8.098.438.578.899.01
T.M.(GB)0.200.200.200.210.22
I.T.(ms/iter)1.191.281.311.351.38
I.M. (MB)156.41156.44156.69159.12160.39

Thanks again for your constructive comments. We are very happy to answer any further questions.

评论

Thank you for the detailed rebuttal: the expanded discussion around Degrees-of-Freedom is very useful, and the quadratic-cost analysis clearly illustrates the efficiency trade-offs.

LoRA ablation & DoF: Clarifying the narrative

The provided NormLin_LoRA results (d = 8/16) are particularly insightful, showing stable performance despite a substantial nominal rank reduction. This important finding suggests that the observed performance gains do not primarily arise from increased rank or Degrees-of-Freedom alone.

Given this evidence, it would further strengthen the paper to explicitly clarify in the main text what specific aspect(s) of NormLin are most responsible for its improved performance. For example, do the gains primarily stem from the normalization mechanism itself, enhanced gradient smoothness during optimization, or possibly another underlying factor?

Providing this clarification in the main narrative would ensure that readers do not mistakenly attribute the method’s success solely to higher rank or increased parameter count.

Quadratic cost

The bottleneck experiment fully addresses my earlier concerns: a small figure illustrating the trend could enhance readability, but the provided numeric table already clearly conveys the intended message.

Overall

My primary remaining concern relates to the current narrative around rank and Degrees-of-Freedom: the LoRA experiments convincingly show that rank reduction alone does not fully explain the observed improvements.

Thanks again to the authors for addressing my points carefully and conducting these valuable additional experiments.

评论

Point 1 (continued): Theoretical discussion of NormLin and self-attention

  • To further demonstrate the DoF effects on forecasting performance, we introduce the NormLin_LoRA with the thought of low-rank approximation: NormLinLoRA=NormL1(Softplus(AB)),\mathrm{NormLin} _{\mathrm{LoRA} }=\mathrm{Norm} _{\mathrm{L1} }(\mathrm{Softplus} (\mathbf{AB}^{\top} )), where A,BRN×d\mathbf{A}, \mathbf{B} \in \mathbb{R}^{N \times d}. According to the first table, NormLinLoRA\mathrm{NormLin} _{\mathrm{LoRA} } has fewer DoF than the original NormLin. We report the following MAEs under the architecture of SAMformer. As expected, NormLin outperforms the LoRA-style variants.
DatasetHor.SAMformer+NormLinSAMformer+NormLin_LoRA (d:8)SAMformer+NormLin_LoRA (d:16)
ETTh1960.3930.3950.394
1920.4220.4250.424
3360.4440.4460.446
7200.4870.4890.489
ETTm2960.2620.2640.264
1920.3030.3050.304
3360.3400.3420.342
7200.3970.3990.399
ECL960.2670.2700.269
1920.2690.2720.269
3360.2840.2880.286
7200.3180.3220.320
Traffic960.3550.3570.356
1920.3350.3370.336
3360.3390.3410.340
7200.3580.3600.359
Weather960.2080.2190.210
1920.2510.2600.253
3360.2900.2950.292
7200.3390.3430.345
Echange960.2210.2240.226
1920.3020.3040.306
3360.4160.4190.416
7200.6890.6930.692
  • We also report the SVs, ranks, and entropy of the weight matrices for the LoRA variants under the SAMformer architecture. As shown below, the original NormLin exhibits higher SVs, ranks, and entropy than its LoRA counterparts.
DatasetMin Attn. SVMedian Attn. SVAttn. RankAttention Entropy
NormLinLoRA (d8)LoRA (d16)NormLinLoRA (d8)LoRA (d16)NormLinLoRA (d8)LoRA (d16)NormLinLoRA (d8)LoRA (d16)
ETTh10.0090.0100.0130.3110.3010.2807771.5461.2931.248
ETTm20.0010.0010.0010.4730.4140.4317771.2771.1181.075
ECL0.0000.0000.0000.0180.0000.00031134985.7015.6415.514
Traffic0.0000.0000.0000.0130.0000.000828481896.6576.6056.494
Weather0.0050.0000.0020.1060.0330.1242117212.8282.5702.291
Exchange0.0190.0010.0050.1080.1230.0788781.8761.2311.347
  • We also compare the NormLin layer with its LoRA variant under the OLinear architecture. As shown below, the reduced DoF of the LoRA variant results in degraded performance.
DatasetHor.OLinearOLinear+LoRA(d64)
MSEMAEMSEMAE
ETTh1960.3600.3820.3630.384
1920.4160.4140.4150.415
3360.4570.4380.4610.441
7200.4630.4620.4690.466
ETTm2960.1690.2490.1710.250
1920.2320.2900.2330.292
3360.2910.3280.2920.329
7200.3890.3870.3900.389
ECL960.1310.2210.1350.225
1920.1500.2380.1530.242
3360.1650.2540.1700.259
7200.1910.2790.1990.286
Traffic960.3980.2260.4100.229
1920.4390.2410.4310.243
3360.4640.2500.4620.254
7200.5020.2700.5090.283
Weather960.1530.1900.1560.195
1920.2000.2350.2030.238
3360.2580.2800.2630.283
7200.3370.3330.3420.336
Exchange960.0820.2000.0830.200
1920.1710.2930.1740.296
3360.3310.4140.3300.414
7200.8370.6880.8670.703
  • We will incorporate this discussion in our final manuscript to further enrich our work.

Point 2: Further clarification on temporal correlation

  • Thanks for your suggestion! We will definitely clarify our design of OrthoTrans by emphasizing that the decorrelation is performed temporally in the Introduction and Section 4.2.
评论

We could not be happier to receive such heartwarming remarks. We deeply appreciate that you found our initial responses resolved most of your concerns. Below, we provide our further responses:

Point 1: Theoretical discussion of NormLin and self-attention

  • Thanks for your suggestion! We fully agree that further theoretical analysis will elevate our work. We would like to conduct the analysis from the perspective of degrees of freedom (DoF) of the attention matrices. The number of DoF in a matrix/vector is the number of elements needed to completely specify it. The DoF satisfy the following properties [1]:

    • Strictly monotonic functions (e.g., Softplus, exp) that operate element-wise do not impose cross-element constraints, and thus do not change the DoF.
    • For a vector xRN\mathbf{x} \in \mathbb{R}^N, its DoF is NN. When imposing the L1 normalization, i.e., NormL1(x)\mathrm{Norm}_{\mathrm{L1}}(\mathbf{x}), the DoF is reduced by 1. Similarly, for a matrix XRN×N\mathbf{X} \in \mathbb{R}^{N\times N}, row-wise L1 normalization reduces its DoF from N2N^2 to N2NN^2 - N.
  • Based on these, we have the following DoF equations for NormLin and self-attention:

MetricWRN×N\mathbf{W} \in \mathbb{R}^{N \times N} NormL1(Softplus(W))\mathrm{Norm} _{\mathrm{L1} }(\mathrm{Softplus} (\mathbf{W} )) QKT,(Q,KRN×d)\mathbf{Q} \mathbf{K} ^T, (\mathbf{Q}, \mathbf{K} \in \mathbb{R}^{N \times d} )Softmax(QKT)\mathrm{Softmax} (\mathbf{Q} \mathbf{K} ^T)
DoFN2N^2N2NN^2-Nf(N,d)f(N,d)f(N,d)Nf(N,d)-N
Best RankNNNNmin(N,d)\mathrm{min}(N,d)NN
Worst Rank0101
  • In the above table, we omit scalar parameters for clarity, as they do not affect DoF or ranks. f(N,d)f(N,d) is a piecewise function with a maximum value of N2N^2, defined as f(N,d)=2Ndd2f(N,d)=2Nd-d^2 if dNd \le N and f(N,d)=N2f(N,d)=N^2 otherwise. The subtraction of d2d^2 in f(N,d)f(N,d) can be understood from the fact that, for any invertible matrix PRd×d\mathbf{P} \in \mathbb{R}^{d\times d}, Q=QP\mathbf{Q}'=\mathbf{Q}\mathbf{P} and K=K(P1)\mathbf{K}'=\mathbf{K}(\mathbf{P}^{-1})^{\top} satisfy QK=QK\mathbf{Q}'\mathbf{K}'^{\top}=\mathbf{Q}\mathbf{K}^{\top}. In SAMformer, the default hidden dimension d is set to 16, which is smaller than the channel count NN of most common datasets.

  • We can see that the NormLin layer benefits from a larger DoF of the weight matrix (third column) compared to that of the attention matrix in self-attention (last column), since N2Nf(N,d)NN^2-N \ge f(N,d)-N . This indicates a larger search space of attention/weight matrices and potentially better representational capacity [2].

  • Empirically, we report the minimum singular values (SVs), median SVs, ranks, and entropies of the attention and weight matrices for SAMformer and SAMformer+NormLin (denoted as NormLin in the following table). The lookback and prediction lengths are set to 96. Since the attention matrix in SAMformer is input-dependent, we report the mean, minimum, and maximum values (denoted as S_mean, S_min, S_max, respectively) across the test sets for SAMformer. The matrix rank is computed as the number of SVs greater than 10310^{-3}. The SAM optimizer is used with the recommended hyperparameters in the SAMformer paper.

DatasetMin Attn. SVMedian Attn. SVAttn. RankAttention Entropy
NormLinS_meanS_minS_maxNormLinS_meanS_minS_maxNormLinS_meanS_minS_maxNormLinS_meanS_minS_max
ETTh10.0090.0020.0000.0420.3110.3180.0320.77676.620571.5461.4730.9401.843
ETTm20.0010.0130.0000.3260.4730.4180.0110.99776.780471.2770.8230.1481.375
ECL0.0000.0000.0000.0000.0180.0000.0000.00031183.825681005.7014.8874.3365.487
Traffic0.0000.0000.0000.0000.0130.0000.0000.000828140.979931976.6575.5363.3846.405
Weather0.0050.0000.0000.0000.1060.0150.0020.0812114.62612172.8282.5961.2182.987
Exchange0.0190.0030.0000.0200.1080.0530.0050.19787.676681.8761.8221.0802.060
  • We can observe that, compared with the vanilla SAMformer, the plug-in NormLin layer generally produces larger SVs, higher ranks, and, at the same time, better prevents attention entropy collapse. These results indicate better representational capacity than simple attention with rank regularization. In addition, this rank improvement phenomenon is consistent with the findings in Appendix E.2 of our paper.

References:

[1] Strang, Linear Algebra and Its Applications, International Thomson Publishing, 3rd edition.

[2] Attention is not all you need: pure attention loses rank doubly exponentially with depth. ICML 2021.

评论

Thank you for your timely response and raising your score : ) We are glad that all your initial concerns have been effectively addressed. Your comments are important to enhance the quality of our manuscript and we will definitely incorporating these revisions in our camera-ready version - thank you so much!

评论

(b) The weight matrix is trained to approximate the correlation matrix among channels CorrMatv\mathrm{CorrMat}_v.

  • In Figure 5 of the paper, we observe that the learned NormLin weights exhibit strong similarity to Softmax(CorrMatv)\mathrm{Softmax}(\mathrm{CorrMat}_v). Motivated by this observation, we pre-compute the correlation matrix among channels CorrMatv\mathrm{CorrMat}_v with the trainset and directly use Softmax(CorrMatv)\mathrm{Softmax}(\mathrm{CorrMat}_v) as a fixed weight matrix, resulting in a variant named NormLinc\mathrm{NormLin}_c:
NormLinc(x)=Softmax(CorrMatv)x.\mathrm{NormLin}_c(\mathbf{x}) = \mathrm{Softmax}(\mathrm{CorrMat}_v)\mathbf{x}.

We refer to OLinear with this variant as OLinear-C.

  • We then compare the average performance of OLinear-C with OLinear across multiple prediction lengths. ETT denotes the average results across the ETTh1, ETTh2, ETTm1 and ETTm2 datasets. As shown below, the average performance gap between OLinear and OLinear-C on these datasets is negligible (only 0.2%)!
DatasetECLTrafficETTSolarPEMSCarSales
MetricMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
OLinear0.1590.2480.4510.2470.3590.3760.2150.2170.0940.1870.3300.305
OLinear-C0.1610.2490.4510.2470.3590.3760.2150.2170.0940.1870.3300.305
  • If the ideal weight matrix is the matrix Softmax(CorrMatv)\mathrm{Softmax} (\mathrm{CorrMat} _v), this partly explains the performance advantage of NormLin over self-attention.

    • Because attention matrices in self-attention are input-dependent, it is harder for them to converge to the input-independent multivariate correlation matrix.
    • In contrast, NormLin uses an input-independent weight matrix, NormL1(Softplus(W))\mathrm{Norm} _{\mathrm{L1} }(\mathrm{Softplus} (\mathbf{W} )), making this approximation easier.
  • The matrix Softmax(CorrMatv)\mathrm{Softmax} (\mathrm{CorrMat} _v) is usually high-rank. How does NormLin_LoRA, whose weight matrix is NormL1(Softplus(AB))\mathrm{Norm} _{\mathrm{L1} }(\mathrm{Softplus} (\mathbf{AB}^{\top} )) also perform well, given that AB\mathbf{AB}^{\top} has a low rank of dd? The reason is that NormL1(Softplus())\mathrm{Norm} _{\mathrm{L1} }(\mathrm{Softplus} ( \cdot)) rescales and rotates each row differently and thus could increase the matrix rank [1], enabling a better approximation of Softmax(CorrMatv)\mathrm{Softmax} (\mathrm{CorrMat} _v).

  • We will accordingly tone down the analysis of weight-matrix rank (Lines 187-193 ) and DoF (Lines 196-198) in the final paragraph of Section 4.3, and relocate the OLinear-C exposition from Section 5.2 to Section 4.3, to better clarify the mechanism underlying the NormLin layer.

Quadratic cost

Thanks for your suggestion! We will definitely add a figure showing performance versus bottleneck size in the relevant section to improve readability.


We are eager to hear your feedback. We sincerely hope that our new comments can address your primary remaining concerns. We’d deeply appreciate it if you could let us know whether your concerns have been addressed -- thank you so much!

Thanks for your time,

Authors of #10716


References:

[1] DenseHMM: Learning Hidden Markov Models by Learning Dense Representations. NeurIPS 2020.

[2] Feature selection, L1 vs. L2 regularization, and rotational invariance. ICML 2004.

评论

Thank you for the comprehensive response. You have effectively addressed my initial concerns and outlined planned enhancements for the camera-ready version. In light of these updates, I will raise my overall rating. No further questions.

评论

Dear Reviewer jUXb,

We highly appreciate your timely and constructive responses, and we are glad that you found our expanded discussion around Degrees-of-Freedom very useful, and the quadratic-cost analysis clearly illustrates the efficiency trade-offs.

Yes, we agree with the reviewer that, from our LoRA experiments, the observed performance gains of NormLin do not primarily arise from increased rank or Degrees-of-Freedom alone. To further understand the method’s success reasons, we added more experiments and in-depth analysis, showing that: (a) The L1 normalization contributes more to NormLin's performance gains than the Softplus function, and (b) The weight matrix is trained to approximate the correlation matrix among channels CorrMatv\mathrm{CorrMat}_v.

(a) The L1 normalization contributes more to NormLin's performance gains than the Softplus function.

What component affects NormLin's performance most?

  • NormLin consists of two components:

    • The transformation function (Softplus);
    • Normalization strategy (L1 norm).
  • We add ablation studies on 5 transformation functions (Softplus, Identity, Softmax, Sigmoid, ReLU) and 3 normalization strategies (L1, L2, None), where Identity denotes no transformation and None denotes no normalization. The average results across four prediction lengths are presented below.

Trans. Fun.NormECLTrafficSolarPEMS03WeatherILI (S2)
MSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
SoftplusL10.1590.2480.4510.2470.2150.2170.0950.1990.2370.2601.7640.802
IdentityL10.1670.2550.4990.2520.2850.2240.0960.1990.2430.2641.8440.818
SoftmaxL10.1630.2510.4520.2470.2150.2170.0960.1990.2360.2591.8020.811
SigmoidL10.1610.2500.4500.2470.2150.2170.0960.1990.2390.2601.7670.802
ReLUL10.1630.2510.4540.2470.2150.2170.0970.1990.2390.2621.7590.802
SoftplusL20.2060.2980.6470.3280.2930.2730.1630.2540.2380.2601.8740.827
IdentityL20.1710.2590.5180.2590.2240.2220.1020.2050.2410.2621.8380.817
SoftmaxL20.2010.2940.6480.3280.2870.2670.1640.2540.2370.2601.8430.820
SigmoidL20.2140.3020.6470.3280.2930.2720.1640.2550.2380.2611.8540.823
ReLUL20.2010.2940.6480.3280.2850.2660.1630.2540.2370.2601.8400.821
SoftplusNone0.2110.3010.6520.3310.2950.2740.1630.2540.2590.2741.9250.846
IdentityNone0.1830.2800.6090.3330.2320.2310.1300.2310.2420.2641.8880.830
SigmoidNone0.2120.3010.6470.3280.2960.2750.1700.2580.2620.2781.8900.827
ReLUNone0.1990.2930.6480.3280.2930.2730.1680.2560.2380.2601.8060.817
  • From above, we found that the L1 normalization contributes more to NormLin's performance gains than the Softplus function. Specifically,

    • (i) the L1 norm outperforms L2 norm and no normalization by 10% and 12%, respectively;
    • (ii) under L1 normalization, Softplus outperforms Identity, Softmax, Sigmoid, and ReLU by 5%, 1%, 0.2%, and 0.3%, respectively.
  • The reasons behind the performance advantage of L1 norm could be intricate. It has become a natural choice since it induces a probabilistic distribution over each row of the weight matrix.

    • Without L1 normalization, each row Wi,:\mathbf{W}_{i,:} effectively includes a learnable scaling parameter cic_i, whether using L2 normalization or no normalization at all.. However, this scaling is rendered meaningless by the subsequent LayerNorm operation, which eliminates any learned magnitude differences.
    • In contrast, L1 normalization emphasizes the relative distribution of weights within each row rather than their absolute scale, which may lead to improved performance.
审稿意见
4

The paper introduces OLinear, a linear-based model for multivariate time series forecasting with two key components:

  • OrthoTrans: A data-adaptive transformation that diagonalizes the temporal Pearson correlation matrix, decorrelating features and simplifying forecasting.

  • NormLin: A lightweight linear layer with row-wise normalization that outperforms self-attention in accuracy and efficiency.

Experiments on 24 datasets demonstrate state-of-the-art performance, particularly for long-term forecasting, while maintaining low computational overhead.

优缺点分析

Strengths

  • Theorem 1 formally connects the temporal-correlation structure of the data with the orthogonal transformation used by OrthoTrans, giving the model a clear theoretical underpinning and enhancing interpretability.
  • NormLin replaces multi-head self-attention with a row-normalized linear layer, cutting both FLOPs and memory footprint by roughly 50 % while preserving—or improving—forecasting accuracy.
  • OLinear achieves state-of-the-art results on 140 forecasting tasks drawn from 24 public datasets, demonstrating robustness across domains and long-horizon scenarios.
  • Both OrthoTrans and NormLin are modular, training-free components that can be plugged into existing forecasters to boost performance with minimal engineering effort.
  • The experimental section is notably comprehensive, featuring extensive ablations, scalability tests, and comparisons against a wide range of baselines.

Weaknesses

  • While training-time speed and memory are reported, the paper does not quantify inference-time latency or serving cost—metrics that are crucial in streaming or high-frequency settings; deployment-cost analysis is therefore incomplete.
  • Important modern linear or correlation-based baselines (e.g., RLinear, TSMixer, StemGNN, DEPTS) are omitted, leaving open whether OLinear’s gains hold against the strongest current alternatives.
  • The claim that NormLin is a “universal token dependency learner” appears overstated: its superiority is demonstrated only on time-series data, with no evidence provided for other modalities or tasks.

问题

  • Why does NormLin show better gradient dynamics than attention? Are there any theoretic supports.
  • What is the relationship OrthoTrans with PCA?
  • Theorem 1 assumes Gaussian. How sensitive is OLinear to deviations from this assumption in real-world data?
  • How does performance degrade with noisy/missing data or non-stationary correlations?
  • What is the inference latency for ultra-long sequence (1M)? And some related references for long-term forecasting such as Pyraformer (ICLR 2022) should be discussed.

局限性

Yes

最终评判理由

Thanks the authors for providing detailed response especially those extra data support. I keep my original rating, and think this paper is acceptable.

格式问题

No

作者回复

Thank you for your positive comments on our work. We sincerely appreciate the reviewer’s recognition of our work as theoretically well-founded, experimentally comprehensive, and achieving state-of-the-art performance. Here are our responses to your concerns:

W1: On the deployment cost

  • We report the training and inference resource footprint of OLinear and baseline forecasters in Table 37 (Page 48), including inference time and GPU memory consumption. OLinear achieves higher inference efficiency than state-of-the-art Transformer-based forecasters (e.g., TimeMixer++, iTransformer, and CARD), benefiting from the simplified NormLin module. Moreover, Table 38 shows the efficiency differences before and after applying the NormLin module to Transformer-based forecasters.

W2: On more baselines

  • We compare OLinear with RLinear and TSMixer using the MSE metric (see below). The results of RLinear and TSMixer are taken from iTransformer and SAMformer, respectively. On average, OLinear outperforms these two forecasters by 17.3% and 25.6%, respectively.
DatasetHor.OLinearRLinearTSMixer
ETTm2960.1690.1820.211
1920.2320.2460.252
3360.2910.3070.303
7200.3890.4070.390
ECL960.1310.2010.173
1920.1500.2010.204
3360.1650.2150.217
7200.1910.2570.242
Exchange960.0820.0930.343
1920.1710.1840.342
3360.3310.3510.484
7200.8370.8861.204
Traffic960.3980.6490.409
1920.4390.6010.637
3360.4640.6090.747
7200.5020.6470.688
Weather960.1530.1920.214
1920.2000.2400.231
3360.2580.2920.279
7200.3370.3640.343
  • In Section 5.1, OLinear is compared with 11 carefully selected and widely acknowledged state-of-the-art forecasting models. We believe these results demonstrate the performance superiority of OLinear. For completeness, we will also include discussions on RLinear, TSMixer, StemGNN, and DEPTS in the final draft.

W3: On the claim of “universal token dependency learner”

  • On Line 289, we explicitly restrict the claim “universal token dependency learner” to the field of “time series forecasting.” However, on Lines 278 and 312, this field restriction was unintentionally omitted. We will revise these two sentences in the final draft to ensure accuracy. Thank you for pointing this out!

Q1: On the gradient dynamics

  • In Appendix B, we analyze the Jacobian matrix of the non-linear transformations in the self-attention mechanism and the NormLin layer. For self-attention, we have ca=Diag(c)ccT, \frac{\partial \mathbf{c} }{\partial \mathbf{a} } = \mathrm{Diag} (\mathbf{c}) - \mathbf{c}\mathbf{c}^{\mathsf{T}}, where cSoftmax(a)RN\mathbf{c} \triangleq \mathrm{Softmax}(\mathbf{a}) \in \mathbb{R}^N, and Diag(c)\mathrm{Diag} (\mathbf{c}) is the diagonal matrix with c\mathbf{c} as its diagonal.

  • For the NormLin layer, it holds that

\left ( \mathrm{Diag} \left ( \tilde{\mathbf{b}} \right ) - \bar{\mathbf{b}} \tilde{\mathbf{b}}^{\mathsf{T}} \right ),$$ where $\mathbf{c} \triangleq \mathrm{Norm}\_{\mathrm{L1} }\left ( \mathrm{Softplus} \left ( \mathbf{a} \right ) \right ) \in \mathbb{R}^N$, $\mathbf{b} \triangleq \mathrm{Softplus}(\mathbf{a})$, $\bar{\mathbf{b}} = \frac{\mathbf{b}}{\left \|| \mathbf{b} \right \||_1 }$ is the normalized $\mathbf{b}$, $\tilde{\mathbf{b}} \triangleq \mathrm{Sigmoid} (\mathbf{a})$. The detailed derivation process is presented in Appendix B. The two Jacobian matrices share similar structures, while the Jacobian matrix of NormLin has an additional learnable scaling factor $\frac{1}{\left \|| \mathbf{b} \right \||_1}$, providing greater flexibility. - **Figure 6** (Page 15) illustrates the Jacobian matrices of the self-attention mechanism and the NormLin layer under the same input $\mathbf{a}$, highlighting that NormLin tends to produce stronger gradient values. > Q2: On the relationship of OrthoTrans and PCA - Essentially, OrthoTrans is PCA with all features retained. It is adapted to the time series modality through its computation of $\mathrm{CorrMat}_{t}$. One contribution of our work is showing that decorrelating along the time dimension boosts forecasting performance. As shown in **Table 5**, OrthoTrans, as a plug-in, improves the performance of iTransformer, PatchTST, and RLinear. - Further analysis shows that OrthoTrans can **mitigate the intrinsic low-rank property of time series data**. In our reply to Q2 of Reviewer rQjb, we compared the median singular values of correlation matrices of time patches with and without OrthoTrans. The results show that OrthoTrans produces larger singular values, thereby increasing data diversity. This partly explains the performance gains brought by OrthoTrans. > Q3: On Theorem 1 - Theorem 1 shows that temporally decorrelated inputs facilitate forecasting, providing OLinear with a clear theoretical underpinning. The widely used Gaussian assumption is applied in Theorem 1. We acknowledge that real-world datasets can be more complex and difficult to model mathematically. Despite this, OLinear performs robustly across real-world datasets. We conduct extensive experiments, covering 24 datasets and 140 configurations, and OLinear consistently exhibits state-of-the-art performance. > Q4: On non-stationary and noisy dataset - Financial data are commonly regarded as non-stationary. We report average MAEs on Exchange, NASDAQ, SP500, and DowJones datasets. The results below show that OLinear remains robust on these datasets. | **Dataset** | **OLinear** | **DLinear** | **iTransformer** | **PatchTST** | |---------------|-------------|-------------|------------------|--------------| | Exchange | **0.399** | 0.414 | 0.403 | 0.404 | | NASDAQ (S1) | **0.125** | 0.170 | 0.137 | 0.132 | | SP500 (S1) | **0.167** | 0.205 | 0.193 | 0.192 | | DowJones (S2) | **0.848** | 0.857 | 0.869 | 0.862 | - Regarding noisy/missing data, we evaluate OLinear on traffic and weather datasets. (The widely used Weather dataset contains missing data, which are filled with -9999.) OLinear demonstrates clear performance advantages under these challenging conditions. | **Dataset** | **OLinear** | **DLinear** | **iTransformer** | **PatchTST** | |-------------|-------------|-------------|------------------|--------------| | Traffic | **0.247** | 0.383 | 0.282 | 0.343 | | PEMS03 | **0.199** | 0.375 | 0.221 | 0.291 | | PEMS07 | **0.164** | 0.395 | 0.204 | 0.303 | | Weather | **0.260** | 0.317 | 0.279 | 0.281 | > Q5: On inference latency and related work - For ultra-long sequences (1M), the inference latency varies with the number of channels. Using the data in **Table 37** (Page 48), we estimate that for a 1M-long series with 21 channels, the inference latency is about 0.77 seconds, which is suitable for streaming scenarios. | **# Channels** | **Infer. Time for 1M-long Series (seconds)** | |----------------|----------------------------------------| | 21 | 0.77 | | 137 | 0.78 | | 321 | 1.37 | | 862 | 3.72 | - We further compare OLinear with Pyraformer and report the average results across four prediction lengths. In the final draft, we will include discussions of more baselines including TSMixer, StemGNN, DEPTS, and Pyraformer. | **Dataset** | **OLinear** | | **Pyraformer** | | |-------------|-------------|--------|----------------|-------| | | MSE | MAE | MSE | MAE | | ETTm1 | **0.374** | **0.377** | 0.691 | 0.607 | | ETTm2 | **0.270** | **0.313** | 1.498 | 0.869 | | ETTh1 | **0.424** | **0.424** | 0.827 | 0.703 | | ETTh2 | **0.367** | **0.388** | 0.826 | 0.703 | | ECL | **0.159** | **0.248** | 0.379 | 0.445 | | Traffic | **0.451** | **0.247** | 0.878 | 0.469 | | Weather | **0.237** | **0.260** | 0.946 | 0.717 | | Exchange | **0.355** | **0.399** | 1.913 | 1.159 | **We hope these responses can fully address your concerns. Thank you again for your detailed feedback!**
评论

Dear Reviewer 5fFn,

We would like to thank you once again for your efforts and constructive comments. We would be happy to address any further questions you may have.

All the best,

Authors

审稿意见
4

The paper introduces OLinear, a computationally efficient alternative to transformer-based models for multivariate time series forecasting. The architecture combines two key components that can also be used as plugins: OrthoTrans, which applies orthogonal transformations to decorrelate input features, and NormLin, a linear layer that aims to reproduce channel-wise attention. OLinear consists of a Cross-Series Learner (CSL) using NormLin layers for multivariate correlations and an Intra-Series Learner (ISL) with MLPs for individual series processing. Results show OLinear achieves competitive or superior performance while reducing computations.

优缺点分析

Strengths:

  • The two-component design (OrthoTrans, NormLin for cross-variable interactions) creates a clean separation of concerns where temporal decorrelation is handled by preprocessing and multivariate correlations by learned linear layers.
  • These two modules can work as plugins for existing transformer architectures, with validated improvements on iTransformer (+5.1%) and PatchTST (+10.1%), which is really interesting
  • The empirical benchmark is exhaustive, covering 24 datasets and 140 configurations, including zero-shot and few-shot scenarios and seems to show strong results. The ablation study is extensive.
  • Rank analysis demonstrates good expressiveness of the model.

Weaknesses:

  • The proposed way to evaluate the correlation matrix is confusing and lacks proper formalization. I guess you want to estimate the autocorrelation of a block of T timesteps (context length) but what does this mean formally? The mathematical framework should be more rigorously defined as the concept is non-trivial.
  • In addition, when reading the description of the approach, two intuitive baselines came to my mind: (a) instead of eigenvector decomposition of the correlation matrix, we can directly perform data rotation by PCA (i.e., working with XXTXX^T), (b) the correlation matrix can alternatively be estimated through a sliding window approach by computing cross-correlation between the timestamps across different windows. It will be valuable to perform additional experiments to compare with these approaches in order to have more insights on the chosen methodology.
  • Averaging correlations across N variables dilutes variable-specific temporal structures without theoretical justification. Why did not you apply decorrelations for each channel separately?
  • Decorrelation effectiveness is not shown, no metrics show Q^T CorrMat Q ≈ Λ or residual correlation analysis on a validation set, for example.
  • When looking at the source code, it appears that there are some implementation errors. The authors consider the Pearson correlation, but in dataset/Generate_corrmat.ipynb, we have cov_matrix = cov_matrix / diag_vec which seems to divide only by σ_i, not σ_i × σ_j, producing an invalid asymmetric correlation matrix.
  • Some other architecture choices are not well-explained nor well-motivated (e.g., the CSL block, the dimension extension). Please see my questions below.
  • Reducing the number of FLOPs is beneficial, but the complexity (memory for example) remains quadratic with respect to D, which can still result in substantial computational overhead for high-dimensional multivariate series.

问题

  • On the asymmetric normalization: The cov_matrix / diag_vec normalization produces an asymmetric matrix that isn't a true correlation matrix. Is this an implementation error or deliberate choice? If intentional, what's the theoretical justification? Why is this implementation in a notebook? Maybe I didn't find the right file that saves the matrices.

  • On the decorrelation architecture: Have you explored architectures where each variable has its own Q(j)Q^{(j)} matrix to preserve variable-specific temporal structures? What information is lost by averaging correlations across all variables? Did you try the ablation studies with PCA on XXXX^\top? What is the intuition? Did you try an approach with learnable Q(j)Q^{(j)}?

  • What is the intuition behind the dimension extension? To what extent does it differ from the linear encoder? What does it bring? Why decorrelating after the dimension extension?

  • In the CSL: What is the purpose of the first linear layer, what is the intuition behind it? From my understanding, you want to reproduce the architecture of self-attention blocks, but then I don't see why we need the first layer. CSL: H~=LayerNorm(H~+Linear(NormLin(Linear(H~))))\tilde{H} = \text{LayerNorm}\left(\tilde{H} + \text{Linear}\left(\text{NormLin}\left(\text{Linear}\left(\tilde{H}\right)\right)\right)\right)

  • In Table 6, could you elaborate on what "Temporal NormLin" baseline exactly is? Do you keep the intra-series learner? Could you be more explicit about this architecture?

  • Could you also have a comparison with other lightweight multivariate methods such as SAMFormer [1] and TSMixer [2]?

References:

  • [1] (Ilbert et al., 2024) SAMFormer: Unlocking the potential of transformers in time series forecasting with sharpness-aware minimization and channel-wise attention.
  • [2] (Chen et al., 2023) TSMixer: An all-mlp architecture for time series forecasting.

局限性

  • As said earlier, the memory complexity remains quadratic with respect to D.

最终评判理由

I am raising my score as the authors have clarified most of my concerns including:

  • Clarification on the definition of the temporal correlation matrix and the way it is calculated,
  • Empirical justification that the approach effectively decorrelates the channels,
  • Extension of the experiments by incorporating more baselines,
  • Other concerns related to ablation study.

格式问题

No apparent formatting concerns.

作者回复

Thank you for your invaluable review. We sincerely appreciate the reviewer’s recognition that our plug-and-play OrthoTrans and NormLin modules are "really interesting" and the experiments are "exhaustive" and "extensive". Below are our responses to the concerns:

Q1: On the computation of CorrMat_t\mathrm{CorrMat}\_{t} and 'asymmetric' normalization

  • We first clarify our process of computing the temporal correlation matrix CorrMat_t\mathrm{CorrMat}\_{t}. For clarity, we use the univariate series. Let xtrainRM\mathbf{x}^{\text{train}} \in \mathbb{R}^{M} denote the training series. We then generate TT lagged sub-series (whose length is MT+1M-T+1): $$ \mathbf{s}_{i} = \mathbf{x}[i : M - T + i],\quad i = 0, 1, \dots, T-1.
- The reason for using the cov_matrix / diag_vec normalization in our code is that the sub-series $ \mathbf{s}\_{i}, i=0,1,\cdots,T-1 $ share almost identical standard deviations because they consist of the same $M-T+1$ data points. Since $M\gg T$, this overlapping part constitutes the majority of these sub-series. We further report the metric: $\frac{\sigma_{max}}{\sigma_{min}}-1 $, where $\sigma_{max}$ and $\sigma_{min}$ are the maximum and minimum standard deviations of the sub-series, respectively. | **Dataset** | **std_max/std_min-1** | |-------------|---------------| | ECL | 2E-03| | Traffic| 3E-03| | Weather| 1E-03| | Solar | 2E-04| | ETTm1 | 6E-04| - Because the standard deviations of these sub-series are (almost) the same, i.e., $\sigma_{i} \approx \sigma_{j}$ for $0 \leq i,j \leq T-1$, and diag_vec[i] = $\sigma_{i}^2$, we can simply use cov_matrix / diag_vec to obtain the Pearson correlation matrix. Another advantage of this approach is that the resulting $\mathrm{CorrMat}_{t}$ always has diagonal elements equal to 1, as required for correlation matrices. - We compare forecasting performance using Q matrices computed from cov_matrix / diag_vec (**Mode 1**) and cov_matrix / $\sigma \sigma^T$ (**Mode 2**), and find that the results are identical: | **Dataset** | **Mode 1** | | **Mode 2** | | |---------------|------------|--------|------------|--------| | | MSE | MAE | MSE | MAE | | ECL (Avg) | 0.159 | 0.248 | 0.159 | 0.248 | | Traffic (Avg) | 0.451 | 0.247 | 0.451 | 0.247 | | Weather (Avg) | 0.237 | 0.260 | 0.237 | 0.260 | - We further verify the orthogonality of Q matrices computed from Mode 1, using the metric $\|| \mathbf{Q}\mathbf{Q}^T - \mathbf{I} \||_F$ (where $\|| \cdot \||_F$ denotes the Frobenius norm): | **Dataset** | **96** | **192** | **336** | **720** | |-------------|-----------|-----------|-----------|-----------| | ECL | 1.580E-06 | 3.349E-06 | 4.237E-06 | 7.706E-06 | | Traffic | 1.697E-06 | 3.283E-06 | 4.171E-06 | 7.747E-06 | | Weather | 1.777E-06 | 3.311E-06 | 4.178E-06 | 7.704E-06 | | Solar | 1.743E-06 | 3.405E-06 | 4.371E-06 | 7.876E-06 | | ETTm1 | 1.755E-06 | 3.377E-06 | 4.213E-06 | 7.606E-06 | - Furthermore, as shown in Appendix I.8, OLinear is relatively robust even when Q matrices are computed with limited training data. - Since the Q matrices are **pre‑computed** and used throughout training and inference, we generate them once in a Jupyter Notebook, save them as ```.npy``` files, and load them during model initialization. > Q2: On the decorrelation architecture - Our choice of channel averaging (CA) over channel individual (CI) for computing the temporal correlation matrix is **based on the empirical results**. As shown below, CA consistently outperforms CI in forecasting performance. A possible reason is that temporal correlations for individual channels may be sensitive and time-varying, and CA can mitigate this effect. Furthermore, CI Q matrices are much larger than their CA counterparts. For example, for the Traffic dataset, the CI Q matrix (862 × 720 × 720) is 862 times larger than the CA Q matrix (720 × 720). In addition, the classic work TimesNet also adopts CA to determine the top‑k frequencies. | **Dataset** | **Horizon** | **CA** || **CI** || |-------------|-------------|--------|--------|--------|--------| | | | MSE| MAE| MSE| MAE| | ECL | 96| 0.131| 0.221| 0.141| 0.232| | | 192| 0.150| 0.238| 0.164| 0.255| | | 336| 0.165| 0.254| 0.176| 0.268| | | 720| 0.191| 0.279| 0.204| 0.292| | | Avg| **0.159**| **0.248**| 0.171| 0.262| | Solar| 96| 0.179| 0.191| 0.182| 0.197| | | 192| 0.209| 0.213| 0.214| 0.219| || 336| 0.231| 0.229| 0.237| 0.237| | | 720| 0.241| 0.236| 0.252| 0.244| | | Avg| **0.215**| **0.217**| 0.221| 0.224| | Weather| 96| 0.153| 0.190| 0.153| 0.191| | | 192| 0.200| 0.235| 0.205| 0.241| | | 336| 0.258| 0.280| 0.263| 0.285| | | 720| 0.337| 0.333| 0.350| 0.342| | | Avg| **0.237**| **0.260**| 0.243| 0.264| - OrthoTrans is technically equivalent to PCA without dimensionality reduction. Since orthogonal transformation is also a conformal transformation, OrthoTrans **rotates** the data into a new feature space where correlations are removed. Moreover, according to PCA theory, noise is suppressed in the primary components. - Regarding learnable Q matrices, we kindly refer to our response to Reviewer rQjb’s Q3. In short, we tried several updating schemes but observed no performance gain. Theoretically, Q is already the optimal orthogonal matrix for temporal decorrelation; altering it risks reintroducing correlations. These experimental results further confirm the effectiveness of OrthoTrans. > Q3: On the dimension extension - Unlike the linear encoder ($\mathbf{W} \mathbf{x} +\mathbf{b}$), the dimension extension module is an **outer product** of $\mathbf{x}$ with a learnable vector $\phi \_d \in \mathbb{R}^d$, which can be easily implemented using the $*$ operator in PyTorch. Further implementation details can be found in Lines 84–91 of ```model/OLinear.py``` in our code repository. - We report performance with various embedding sizes $d$ in Table 30 on Page 42. The results show that this module ($d = 16$) improves forecasting performance compared with its removal ($d = 1$). (In our implementation the module is skipped when $d = 1$.) Because OrthoTrans is a linear operation (matrix multiplication), its **relative order with the dimension-extension module has no impact on the final result**. > Q4: On the linear layers in the CSL - The two layers in the CSL, referred to as the pre‑ and post‑linear layers according to their execution sequence, are inspired by the classic self‑attention mechanism. As shown in **Appendix J.2**, incorporating both pre‑ and post‑linear layers yields an average performance improvement of 6% (see **Table 34**, Page 46) compared to the variant without them, highlighting their effectiveness in refining inputs for multivariate correlation modeling (pre‑linear) and downstream series representation learning (post-linear). > Q5: On the temporal NormLin - In **Table 6**, we present ablation studies in which NormLin and standard linear layers are either replaced or removed. The *temporal NormLin* baseline indicates that the ISL is **retained** while its two linear layers (**Eq. 5** in the paper) are **replaced with two NormLin layers**, which operate on the temporal dimension. Now the weight matrices in the two NormLin layers are $\mathbf{W}\_1 \in \mathbb{R}^{d_{model}\times d_{ff}}$ and $\mathbf{W}\_2 \in \mathbb{R}^{d_{ff}\times d_{model}}$, rather than the weight matrix $\mathbf{W} \in \mathbb{R}^{N \times N}$ in Eqs. 3 and 4. Please refer to line 299 in ```layers/Transformer_EncDec.py``` of our code repository for details. - Table 6 shows that our design—applying NormLin along the variate dimension and standard linear layers along the temporal dimension—consistently achieves the best performance. > Q6: Comparison with SAMformer and TSMixer - We provide MSEs and standard deviations to compare OLinear with SAMformer and TSMixer. Results of SAMformer and TSMixer are from the SAMformer paper. Note that SAMformer and TSMixer use a lookback window of 512 and thus enjoy extra performance gains. Despite this, OLinear achieves lower MSE and smaller standard deviations in most cases. - More results are presented in our response to Reviewer jUXb's Q2. When the lookback length is uniformly set as 96, OLinear outperforms SAMformer **by a substantial margin of 12.1%**. - We will incorporate a discussion of these two classic forecasters in the final draft. | **Dataset** | **Hor.** | **OLinear (Lookback: 96)** | **SAMformer (Lookback: 512)** | **TSMixer (Lookback: 512)** | |-------------|------|-------------|---------------|-------------| | ETTm2 | 96 | **0.169±1e-4** | 0.181±0.005 | 0.211±0.014 | | | 192 | **0.232±1e-4** | 0.233±0.002 | 0.252±0.005 | | | 336 | 0.291±2e-4 | **0.285±0.001** | 0.303±0.004 | | | 720 | 0.389±4e-4 | **0.375±0.001** | 0.390±0.003 | | ECL | 96 | **0.131±4e-4** | 0.155±0.002 | 0.173±0.004 | | | 192 | **0.150±1e-3** | 0.168±0.001 | 0.204±0.027 | | | 336 | **0.165±1e-3** | 0.183±0.000 | 0.217±0.018 | | | 720 | **0.191±2e-3** | 0.219±0.000 | 0.242±0.015 | | Exchange | 96 | **0.082±3e-4** | 0.161±0.007 | 0.343±0.082 | | | 192 | **0.171±2e-3** | 0.246±0.009 | 0.342±0.031 | | | 336 | **0.331±3e-3** | 0.368±0.006 | 0.484±0.062 | | | 720 | **0.837±0.018** | 1.003±0.018 | 1.204±0.028 | | Weather | 96 | **0.153±1e-3** | 0.197±0.001 | 0.214±0.004 | | | 192 | **0.200±2e-3** | 0.235±0.000 | 0.231±0.003 | | | 336 | **0.258±3e-3** | 0.276±0.001 | 0.279±0.007 | | | 720 | 0.337±4e-3 | **0.334±0.000** | 0.343±0.024 | **We hope these responses can fully address your concerns. Thank you again for your detailed feedback!**
评论

We would like to thank you again for your constructive comments!

  • To fully address your concerns about the quadratic cost of the NormLin layer, we introduce a bottleneck block to reduce computational complexity:

    • the channels are first projected to a fixed width (denoted as bottleneck size),
    • the NormLin module is applied in this lower-dimensional space,
    • and the representation is finally projected back to NN channels.
  • In this way, the complexity of the NormLin layer is reduced from O(N2)\mathcal{O}(N^2) to O(N)\mathcal{O}(N). We report the MAEs and resource footprints under various bottleneck sizes below.

DatasetHor.Ori.Bottleneck (8)Bottleneck (16)Bottleneck (64)Bottleneck (128)
ETTh1960.3820.3820.3810.3820.383
1920.4140.4150.4120.4150.416
3360.4380.4380.4400.4430.442
7200.4620.4670.4670.4680.461
ECL960.2210.2360.2360.2410.240
1920.2380.2550.2500.2560.254
3360.2540.2640.2710.2700.267
7200.2790.3020.2910.2910.295
Traffic960.2260.2550.2550.2630.280
1920.2410.2730.2890.2780.277
3360.2500.2700.2750.2830.281
7200.2700.2990.2960.2980.301
Weather960.1900.1920.1880.1900.190
1920.2350.2400.2370.2390.241
3360.2800.2840.2870.2810.282
7200.3330.3380.3360.3370.338
PEMS03120.1590.1620.1610.1600.162
240.1790.1860.1830.1820.181
480.2100.2230.2240.2240.224
960.2470.2750.2660.2640.271
DatasetMetricOri.Bottleneck (8)Bottleneck (16)Bottleneck (64)Bottleneck (128)
TrafficParams(M)6.174.714.744.925.16
FLOPs(G)4.914.164.184.274.39
T.T. (s/iter)0.020.0150.0150.0160.017
T.M.(GB)1.010.980.991.001.01
I.T.(ms/iter)5.715.605.635.715.80
I.M. (GB)0.430.390.390.400.41
ECLParams(M)4.794.594.604.674.78
FLOPs(G)1.651.551.561.591.65
T.T. (ms/iter)7.757.457.587.617.80
T.M.(GB)0.450.440.450.450.46
I.T.(ms/iter)2.112.042.052.082.09
I.M. (GB)0.170.160.160.160.17
PEMS03Params(M)4.844.604.614.694.80
FLOPs(G)1.851.731.741.771.83
T.T. (s/iter)8.148.058.068.128.16
T.M.(GB)0.480.480.480.490.50
I.T.(ms/iter)2.232.182.182.212.24
I.M. (GB)0.200.180.180.180.19
  • From above, we find that:

    • (1) the bottlenecked NormLin variants incur approximately a 4% performance degradation,
    • but (2) consume fewer resources, especially on datasets with a large number of channels.
  • This provides readers with additional flexibility to decide whether the full quadratic version of NormLin is worthwhile for their specific use cases.


We sincerely hope that our new comments can address your concerns. We’d deeply appreciate it if you could let us know whether your concerns have been addressed.

Thanks for your time,

Authors

评论

Dear Reviewer CEMa,

As the discussion deadline approaches, we are wondering whether our responses have properly addressed your concerns? Your feedback would be extremely helpful to us. If you have further comments or questions, we hope for the opportunity to respond to them.

Many thanks,

10716 Authors

评论

Thank you for the comprehensive rebuttal that has clarified some of my concerns.

Q1: While the justification for cov_matrix / diag_vec is now clear, my concern about the theoretical foundation remains. You've addressed the orthogonality of QQ (which is indeed guaranteed by construction), but my primary question was about the effectiveness of the decorrelation itself: can you show that CovMat(QiTx)\text{CovMat}(Q_i^T x) is actually diagonal?

More fundamentally, I question whether your sliding window approach actually estimates the autocorrelation of your input context. Your method computes correlations between overlapping subsequences sis_i and sjs_j, each with their own empirical means (in practice they could be almost equal indeed). However, true autocorrelation estimation requires:

$

\gamma(k) = \mathbb{E}[(X_t - \mu)(X_{t+k} - \mu)]

$

where μ\mu is a common reference point (typically the global mean). Your approach instead computes:

$

\text{Corr}(s_i, s_j)

$

where sis_i has mean μi\mu_i and sjs_j has mean μj\mu_j.

This is fundamentally different from autocorrelation estimation. For your input context of length TT, the standard approach would estimate autocorrelation using the entire available time series with a consistent global mean, not through correlations between windowed subsequences with varying local means.

Could you provide theoretical justification for why your windowing-based correlation matrix captures the same temporal dependencies as classical autocorrelation estimation? And more importantly, can you empirically show that this decorrelation actually works on unseen test data?

Q2: Your justification for channel averaging (CA) over channel individual (CI) is reasonable given the computational and performance trade-offs. However, regarding the PCA comparison, that wasn't my point. Standard PCA would decompose the series into NN non-overlapping windows of length TT and center them. You would have a matrix WW of size (N,T)(N,T), and you would decompose WTWW^T W to obtain the QQ matrix and proceed with the next steps as you do. This is different from your approach that applies T overlapping sliding windows of length MTM-T and computes the Pearson correlation between them.

Q3: The dimension extension explanation is experimentally adequate.

Q5: The clarification on how NormLin is applied in the temporal dimension is now clear.

Q6: The comparison with recent lightweight baselines (SamFormer, SimpleTM, TQNet, TimePro, TimeBase) is valuable and shows competitive performance. The bottleneck analysis effectively addresses the computational complexity concerns while providing a practical trade-off between performance and efficiency.

评论

We would like to thank you again for your constructive comments! To fully address your concerns, we compare OLinear with additional lightweight baselines, including several recently accepted works: SimpleTM (ICLR 2025), TQNet (ICML 2025), TimePro (ICML 2025), and TimeBase (ICML 2025). We report the MAEs below, highlighting the best results in bold. The lookback length is uniformly set to 96. Remarkably, OLinear outperforms SimpleTM, TQNet, TimePro, and TimeBase by 5%, 5%, 7%, and 16% on average, respectively, establishing OLinear as a strong new baseline for lightweight time series forecasting.

ModelHor.OLinearSimpleTM (ICLR 2025)TQNet (ICML 2025)TimePro (ICML 2025)TimeBase (ICML 2025)
ETTm1960.3340.3610.3530.3640.388
1920.3630.3800.3780.3830.409
3360.3850.4040.4010.4090.421
7200.4260.4380.4400.4460.461
Avg0.3770.3960.3930.4000.420
ETTm2960.2490.2570.2560.2600.271
1920.2900.2990.2980.3030.309
3360.3280.3380.3400.3420.346
7200.3870.3950.3960.3990.401
Avg0.3130.3220.3230.3260.332
ETTh1960.3820.3920.3930.3980.392
1920.4140.4210.4260.4290.423
3360.4380.4380.4460.4500.443
7200.4620.4620.4700.4740.458
Avg0.4240.4280.4340.4380.429
ETTh2960.3290.3380.3430.3450.376
1920.3790.3870.3930.3940.405
3360.4150.4010.4270.4310.440
7200.4310.4360.4460.4450.477
Avg0.3880.3910.4020.4030.424
ECL960.2210.2350.2290.2340.279
1920.2380.2470.2470.2490.281
3360.2540.2670.2640.2670.295
7200.2790.2930.2940.2990.327
Avg0.2480.2600.2590.2620.295
Traffic960.2260.2740.2610.2690.384
1920.2410.2800.2710.2760.362
3360.2500.2900.2770.2870.365
7200.2700.3090.2950.3120.386
Avg0.2470.2890.2760.2860.374
Weather960.1900.2070.2000.2070.215
1920.2350.2480.2450.2540.256
3360.2800.2900.2870.2960.297
7200.3330.3410.3420.3460.348
Avg0.2600.2710.2690.2760.279
Solar960.1910.2320.2330.2370.363
1920.2130.2470.2570.2630.404
3360.2290.2570.2630.2810.398
7200.2360.2520.2700.2850.388
Avg0.2170.2470.2560.2660.388

Thanks again for your constructive comments. We are very happy to answer any further questions.

评论

Dear Reviewer CEMa,

We highly appreciate your constructive comments, and we are glad that you found our initial responses have addressed your concerns on recent lightweight baselines, quadratic computational complexity and implementation details. We address your questions point-to-point in the following.

Q1 (Part 1): Can you show that CovMat(QiTx)\mathrm{CovMat} (\mathbf{Q}_i^T\mathbf{x}) is actually diagonal?

  • Thank you for raising this point. Theoretically, suppose the temporal correlation matrix CovMatt\mathrm{CovMat}_t admits the eigendecomposition CovMatt=QiΛQiT\mathrm{CovMat}_t = \mathrm{Q}_i \mathbf{\Lambda} \mathrm{Q}_i^T, where Qi\mathrm{Q}_i is an orthogonal matrix and Λ\mathbf{\Lambda} is a diagonal matrix. If the vector x\mathbf{x} is normalized (zero mean and unit variance), then the following holds:

CovMat(QiTx)=E[QiTxxTQi]=QiTE[xxT]Qi=QiTCorrMattQi=QiTQiΛQiTQi=Λ,\mathrm{CovMat} (\mathbf{Q}_i^T\mathbf{x})=\mathbb{E}[\mathbf{Q}_i^T \mathbf{x} \mathbf{x}^T \mathbf{Q}_i]=\mathbf{Q}_i^T \mathbb{E}[\mathbf{x} \mathbf{x}^T] \mathbf{Q}_i=\mathbf{Q}_i^T \mathrm{CorrMat}_t \mathbf{Q}_i = \mathbf{Q}_i^T \mathbf{Q}_i \mathbf{\Lambda} \mathbf{Q}_i^T \mathbf{Q}_i =\mathbf{\Lambda},

which is a diagonal matrix.

  • Empirically, we quantify the effectiveness of OrthoTrans by reporting the Frobenius norm of the off-diagonal elements in the correlation matrices for the original series, and for the series transformed by OrthoTrans, DFT, or Haar wavelets, using a window size of 96. For DFT, correlations are computed using only the real part of the transformed signal, while for Haar wavelets, only the high-frequency components are used. The results below show that OrthoTrans effectively mitigates temporal correlations and outperforms both DFT and Haar wavelets in decorrelation on both the training and test sets.
DatasetTraining setTest set
OriginalOrthoTransDFTWaveletOriginalOrthoTransDFTWavelet
ETTh161.361.462.584.9347.861.892.815.35
ETTm283.041.265.222.1365.742.056.463.22
ECL45.951.782.2611.2045.692.032.5711.26
Traffic34.801.793.296.9234.772.163.467.01
Weather61.702.8619.614.3365.463.9722.104.31
Avg57.371.836.595.9051.902.427.486.23

Q1 (Part 2): On global mean (std) and local mean (std)

  • We would like to show that the local mean and standard deviation (std) of the sub-series are very close to the global mean and std, since each sub-series contains only T1T-1 fewer elements than the global series (of length MTM \gg T). We report the following metrics: μ_err=maximean(s_i)μ1\mu \_{err}=\max_{i} \left | \frac{\mathrm{mean} (\mathbf{s} \_i)}{\mu}-1 \right |, and std_err=maxistd(si)STD1\mathrm{std} \_{err}=\max_{i} \left | \frac{\mathrm{std} (\mathbf{s} _i)}{\mathrm{STD} }-1 \right | , where μ\mu and STD\mathrm{STD} denote the global mean and std, respectively.
Datasetμ_err\mu \_{err}std_err\mathrm{std} \_{err}
ETTm11.4E-038.0E-04
Solar1.5E-039.0E-04
ECL1.4E-032.5E-03
Traffic3.5E-032.8E-03
Weather1.6E-031.3E-03

Q1 (Part 3): On classic autocorrelation estimation (ACF)

  • We respectfully argue that for a univariate series, computing the ACF inherently involves comparing two subseries of the original signal, as it measures the similarity between the signal and its lagged versions.

  • Assume that the series xRM\mathbf{x} \in \mathbb{R}^M has been normalized (i.e., zero mean and unit variance). The ACF can be regarded as the covariance between two sub-series: γ_k=E[x_tx_t+k]=1Mki=0Mk1x_ix_i+k=Cov(s_1,k,s_2,k),\gamma \_k = \mathrm{E} \left [ \mathbf{x} \_t \mathbf{x} \_{t+k} \right ]=\frac{1}{M-k} \sum_{i=0}^{M-k-1} \mathbf{x}\_i \mathbf{x}\_{i+k} = \mathrm{Cov}(\mathbf{s}\_{1,k}',\mathbf{s}\_{2,k}' ),

where s_1,k=[x_0,x_1,,x_Mk1]\mathbf{s}\_{1,k}'=\left [ \mathbf{x}\_0, \mathbf{x}\_1, \cdots,\mathbf{x}\_{M-k-1} \right ] , and s_2,k=[x_k,x_k+1,,x_M1]\mathbf{s}\_{2,k}'=\left [ \mathbf{x}\_k, \mathbf{x}\_{k+1}, \cdots,\mathbf{x}\_{M-1} \right ] . We assume that the means of s_1,k\mathbf{s}\_{1,k}' and s_2,k\mathbf{s}\_{2,k}' are 0. Both s_1,k\mathbf{s}\_{1,k}' and s_2,k\mathbf{s}\_{2,k}' have length MkM-k.

  • In OrthoTrans, γ_k=Cov(s_i,s_i+k),\gamma \_k = \mathrm{Cov} (\mathbf{s}\_i, \mathbf{s}\_{i+k} ), where s_i=[x_i,x_i+1,,x_MT+i]\mathbf{s}\_{i}=\left [ \mathbf{x}\_i, \mathbf{x}\_{i+1}, \cdots,\mathbf{x}\_{M-T+i} \right ] and s_i+k=[x_i+k,x_i+k+1,,x_MT+i+k]\mathbf{s}\_{i+k}=\left [ \mathbf{x}\_{i+k}, \mathbf{x}\_{i+k+1}, \cdots,\mathbf{x}\_{M-T+i+k} \right ] .

  • By comparing s_1,k\mathbf{s}\_{1,k}', s_2,k\mathbf{s}\_{2,k}' with s_i\mathbf{s}\_{i}, s_i+k\mathbf{s}\_{i+k}, we can observe that s_i\mathbf{s}\_{i} and s_i+k\mathbf{s}\_{i+k} are the truncated versions of s_1,k\mathbf{s}\_{1,k}' and s_2,k\mathbf{s}\_{2,k}' (with Tk1T-k-1 fewer entries) , respectively. In other words, our sliding window approach is a (slightly) truncated version of the classical ACF.

评论

Q1 (Part 4): Performance comparison of OrthoTrans and ACF

  • We compute γ_k=E[x_tx_t+k],k=1,2,,T1\gamma \_k = \mathrm{E} \left [ \mathbf{x} \_t \mathbf{x} \_{t+k} \right ], k=1,2,\cdots, T-1, and obtain the vector [1,γ_1,γ_2,,γ_T1][1, \gamma \_1, \gamma \_2, \cdots, \gamma \_{T-1}]. Based on this vector, we construct the symmetric Toeplitz correlation matrix and compute the corresponding Q matrix. We denote this method as ACF.

  • We directly compare the differences between the correlation matrices and Q\mathbf{Q} matrices obtained from OrthoTrans and ACF using the metric DiffF-norm(A,B)=ABF/T2\mathrm{Diff}_{\mathrm{F\text{-}norm}}(\mathbf{A},\mathbf{B} ) = \left \|| \mathbf{A}-\mathbf{B} \right \||_F / T^2 , where F\left \|| \cdot \right \||_F denotes the Frobenius norm. As shown below, the differences are negligible.

DatasetCorrMat_Diff_FnormQ_Diff_Fnorm
ETTh11.5E-045.4E-04
ETTm21.0E-064.6E-04
ECL8.0E-066.2E-04
Traffic1.1E-054.4E-04
Weather4.0E-062.0E-03
  • We report the off-diagonal Frobenius norm of the correlation matrices (size: 96×9696 \times 96) for the original series, and for the series transformed by OrthoTrans and ACF. The following results show that OrthoTrans and ACF achieve comparable and strong decorrelation performance on both training and test sets.
DatasetTraining setTest set
OriginalOursACFOriginalOursACF
ETTh161.361.461.4547.861.891.87
ETTm283.041.261.2765.742.052.06
ECL45.951.781.8045.692.032.05
Traffic34.801.791.8234.772.162.18
Weather61.702.862.8665.463.973.96
Avg57.371.831.8451.902.422.42
  • We further compare the forecasting performance of OrthoTrans and ACF. As shown below, OrthoTrans performs comparably to ACF.
DatasetHor.OrthoTransACF
MSEMAEMSEMAE
ETTh1960.3600.3820.3610.383
1920.4160.4140.4150.414
3360.4570.4380.4570.438
7200.4630.4620.4640.463
Avg0.4240.4240.4240.424
ETTm2960.1690.2490.1690.249
1920.2320.2900.2320.290
3360.2910.3280.2910.329
7200.3890.3870.3890.387
Avg0.2700.3130.2700.314
ECL960.1310.2210.1310.221
1920.1500.2380.1520.240
3360.1650.2540.1670.255
7200.1910.2790.1900.280
Avg0.1590.2480.1600.249
Traffic960.3980.2260.4040.226
1920.4390.2410.4340.241
3360.4640.2500.4580.250
7200.5020.2700.5080.270
Avg0.4510.2470.4510.247
Weather960.1530.1900.1510.190
1920.2000.2350.2030.239
3360.2580.2800.2590.281
7200.3370.3330.3380.333
Avg0.2370.2600.2380.260

Q2 (Part 1): On OrthoTrans and PCA

  • Thank you for your clear suggestion! We refer to the suggested variant as PCA. It first segments the series xRM\mathbf{x}\in \mathbb{R}^{M} into NN' non-overlapping windows of length TT, forming the matrix WRN×T\mathbf{W}\in \mathbb{R}^{N' \times T}. Then the correlations are computed among the columns of W\mathbf{W}.

  • In essence, OrthoTrans and the PCA variant differ in their unfolding strategies: although both use a patch size of TT, OrthoTrans employs sliding windows with stride 1, whereas PCA uses a non-overlapping segmentation with stride TT.

  • We further introduce a variant, OrthoTrans-C, for comparison:

    • In OrthoTrans-C, we do not use the CA strategy. Given the multivariate training series XtrainRN×M\mathbf{X}^{train} \in \mathbb{R}^{N \times M} , we apply PyTorch’s unfold function along the time dimension (second dimension) with patch size TT and stride 1, resulting in the matrix Xtrain_foldRN×M×T\mathbf{X}^{train}\_{fold} \in \mathbb{R}^{N \times M' \times T} . Then we flatten the first two dimensions of Xtrain_fold\mathbf{X}^{train}\_{fold} to obtain a matrix of size (NM)×T(N \cdot M') \times T, and compute the correlation matrix across its columns.
评论

Q2 (Part 2): Decorrelation performance of OrthoTrans, PCA and OrthoTrans-C

  • We compare the decorrelation effectiveness of OrthoTrans, PCA and OrthoTrans-C using the off-diagonal Frobenius norm with a window size of 96. In general, OrthoTrans and OrthoTrans-C achieve better decorrelation performance than PCA.
DatasetTraining setTest set
OriginalOrthoTransPCAOrthoTrans-COriginalOrthoTransPCAOrthoTrans-C
ETTh161.361.4610.771.4347.861.8911.661.75
ETTm283.041.267.981.0165.742.059.371.55
ECL45.951.7811.872.2445.692.0312.002.71
Traffic34.801.798.941.8134.772.169.112.17
Weather61.702.868.782.9465.463.9710.473.78
Avg57.371.839.671.8951.902.4210.522.39
  • To further study the relationship between decorrelation performance and unfolding stride in PCA, we report the following results on the testsets of different datasets, which clearly show that smaller strides lead to better decorrelation performance.
DatasetStride: 1Stride: 16Stride: 32Stride: 64Stride: 96
ETTh11.895.926.607.3611.66
ETTm22.054.405.956.719.37
ECL2.036.416.586.7612.00
Traffic2.164.454.534.689.11
Weather3.977.918.709.8310.47
Avg2.425.826.477.0710.52

Q2 (Part 3): Forecasting performance of OrthoTrans, PCA and OrthoTrans-C

  • We apply these decorrelation methods to OLinear and observe that they achieve comparable performance. The results also verify the robustness of OLinear to different Q\mathbf{Q} matrices.
DatasetHor.OrthoTransPCAOrthoTrans-C
MSEMAEMSEMAEMSEMAE
ETTh1960.3600.3820.3610.3820.3610.383
1920.4160.4140.4150.4140.4150.414
3360.4570.4380.4570.4390.4580.439
7200.4630.4620.4630.4620.4640.463
Avg0.4240.4240.4240.4240.4240.424
ETTm2960.1690.2490.1690.2490.1690.249
1920.2320.2900.2330.2900.2320.290
3360.2910.3280.2910.3280.2910.328
7200.3890.3870.3890.3870.3890.387
Avg0.2700.3130.2700.3140.2700.314
ECL960.1310.2210.1310.2210.1310.221
1920.1500.2380.1520.2400.1530.241
3360.1650.2540.1670.2560.1670.255
7200.1910.2790.1950.2850.1910.278
Avg0.1590.2480.1610.2500.1600.249
Traffic960.3980.2260.4020.2260.4020.226
1920.4390.2410.4350.2400.4370.240
3360.4640.2500.4620.2500.4630.250
7200.5020.2700.5050.2720.5090.270
Avg0.4510.2470.4510.2470.4530.246
Weather960.1530.1900.1520.1890.1520.189
1920.2000.2350.2010.2360.2020.238
3360.2580.2800.2600.2820.2600.283
7200.3370.3330.3390.3330.3350.332
Avg0.2370.2600.2380.2600.2370.260
  • We also examine the robustness of OrthoTrans, PCA, and OrthoTrans-C when the Q\mathbf{Q} matrices are computed using only the first 10% of the training set. We observe that OLinear performs robustly in this setting, consistent with the finding in Appendix I.8.
DatasetHor.OrthoTransPCAOrthoTrans-C
MSEMAEMSEMAEMSEMAE
ECL960.1310.2210.1310.2210.1310.221
1920.1520.2400.1520.2400.1540.242
3360.1660.2550.1680.2560.1650.254
7200.1890.2780.1860.2780.1860.278
Avg0.1590.2480.1590.2490.1590.248
Traffic960.4020.2260.4000.2260.4040.226
1920.4310.2400.4340.2400.4380.241
3360.4650.2500.4630.2500.4640.250
7200.5120.2710.5020.2720.5000.272
Avg0.4530.2470.4500.2470.4520.247
Weather960.1490.1870.1520.1890.1520.189
1920.2030.2390.2000.2370.2020.238
3360.2590.2810.2620.2820.2640.285
7200.3390.3320.3380.3340.3470.338
Avg0.2370.2600.2380.2600.2410.262
评论
  • The newly added experiments show that OrthoTrans, ACF, PCA, and OrthoTrans-C perform similarly when applied to OLinear, demonstrating its robustness to different decorrelation methods. We will include a discussion of these variants in a new appendix section and incorporate the results presented here to further strengthen our work. Thank you again for your constructive comments!

Q3-6: On the dimension extension, temporal NormLin, lightweight baselines, and computational complexity

  • We appreciate the reviewer’s constructive comments and are glad that our responses have addressed the concerns.

We are eager to hear your feedback. We sincerely hope that our new comments can address your remaining concerns. We’d deeply appreciate it if you could let us know whether your concerns have been addressed. Thank you so much!

Thanks for your time,

Authors of #10716

评论

Thank you for your detailed response. You have treated most of my concerns, so I have decided to raise my score. However, I still would raise some important issues that I would like to ask the authors to solve.

  • Please be more mathematically rigorous. If I understand correctly, you assume that CorrMattCorrMat_t is close to CovMattCovMat_t ( what is CorrMattCorrMat_t ? Is it the correlation matrix for a fixed channel?). In this case, you need to use \approx in all transitions like "CorrMatt=QiΛQiCorrMat_t = Q_i \Lambda Q_i^\top".
  • If ACF and your strategy give the same results, then I would suggest the authors to stick to ACF, because it is more intuitive and familiar to the broad reader. To me, it was very confusing to follow the paper, especially with CovMattCovMat_t notation (by the way, it is confusing to write CovMattCovMat_t and call it "the temporal correlation matrix").
  • When you compute the Frobenius norm of the off-diagonal elements, the values are not that meaningful as it depends on the size of the matrix. Maybe dividing by the number of off-diagonal elements would be more intuitive in order to have a value within [0,1].

Please incorporate all the modifications we have discussed during the rebuttal. This will greatly improve the quality of the paper.

Best regards, Reviewer CEMa

评论

Dear Reviewer CEMa,

We highly appreciate your constructive responses and great efforts in our multi-round discussion. We are encouraged that you found our responses have addressed most of your concerns! We would like to respond to your remaining questions as follows.

Q1: If I understand correctly, you assume that CorrMatt\mathrm{CorrMat}_t is close to CovMatt\mathrm{CovMat}_t ( what is CorrMatt\mathrm{CorrMat}_t? Is it the correlation matrix for a fixed channel?). In this case, you need to use \approx in all transitions like "CorrMatt=QiΛQiT\mathrm{CorrMat}_t=\mathbf{Q} _i \mathbf{\Lambda} \mathbf{Q} _i^T".

  • Thank you for raising this point. We sincerely apologize for any confusion caused by our mathematical notations, and would like to offer the clarifications as follows.

  • We use CorrMat\mathrm{CorrMat} and CovMat\mathrm{CovMat} to denote the correlation and covariance matrices, respectively, throughout the paper. The subscript tt (or vv, Line 262) indicates the temporal (or variate) dimension. Specifically, CorrMatt\mathrm{CorrMat}_t denotes the correlation matrix among the time-lagged subseries, averaged across all variates (Line 146). For a specific variate jj, we denote this as CorrMattj\mathrm{CorrMat}_t^j (Line 140).

  • For a normalized series, it holds that CorrMattCovMatt\mathrm{CorrMat}_t \approx \mathrm{CovMat}_t , based on the fact that the subseries share (almost) the same unit variance. However, since we explicitly define CorrMatt=QiΛQiT\mathrm{CorrMat}_t=\mathbf{Q} _i \mathbf{\Lambda} \mathbf{Q} _i^T, rather than CovMatt=QiΛQiT\mathrm{CovMat}_t=\mathbf{Q} _i \mathbf{\Lambda} \mathbf{Q} _i^T, on Line 149 in the paper, the use of the approximation symbol \approx in the equation on Line 153 seems unnecessary.

  • We are sorry for making a mistake in our previous response for writing "suppose the temporal correlation matrix CovMatt\mathrm{CovMat}_t admits the eigendecomposition CovMatt=QiΛQiT\mathrm{CovMat}_t = \mathrm{Q}_i \mathbf{\Lambda} \mathrm{Q}_i^T".

  • The correct one should be CorrMatt\mathrm{CorrMat}_t instead of CovMatt\mathrm{CovMat}_t. We sincerely apologize for the confusion this has caused. We have thoroughly examined the paper, and are relieved to observe that these two notions are consistently and correctly used in the manuscript.

Q2: If ACF and your strategy give the same results, then I would suggest the authors to stick to ACF, because it is more intuitive and familiar to the broad reader. To me, it was very confusing to follow the paper, especially with CovMatt\mathrm{CovMat}_t notation (by the way, it is confusing to write CovMatt\mathrm{CovMat}_t and call it "the temporal correlation matrix").

  • Thank you for the suggestion! We will choose to stick ACF in Section 4.2 for our final version, which may be more familiar to a broad range of readers.

  • Please kindly note that the ACF here is only used for computing CorrMatt\mathrm{CorrMat}_t in OrthoTrans, so that such revision does not affect the core contributions of our paper.

  • We have carefully reviewed the manuscript and confirmed that CorrMatt\mathrm{CorrMat}_t is not mistakenly written as CovMatt\mathrm{CovMat}_t in our final version.

Q3: On the metric to quantify decorrelation

  • Thank you for the great suggestion! We divide the off-diagonal Frobenius norm by the number of off-diagonal elements, and report this new metric below.
    • In our previous experiments of quantifying decorrelation, for DFT we used only the real part, and for Haar wavelets we used only the high-frequency components. We now adopt a fairer scheme: for DFT, we concatenate the real and imaginary parts; for Haar wavelets, we concatenate both the low- and high-frequency components.
DatasetTraining SetTest Set
OriginalOrthoTransDFTWaveletOriginalOrthoTransDFTWavelet
ETTh16.7E-031.6E-045.8E-043.7E-035.2E-032.1E-046.0E-043.1E-03
ETTm29.1E-031.4E-041.1E-034.6E-037.2E-032.2E-041.3E-033.7E-03
ECL5.0E-031.9E-044.7E-043.6E-035.0E-032.2E-044.9E-043.6E-03
Traffic3.8E-031.9E-045.5E-042.6E-033.8E-032.3E-045.7E-042.6E-03
Weather6.7E-033.1E-044.1E-033.6E-037.2E-034.0E-044.6E-033.7E-03
Avg6.3E-032.0E-041.4E-033.6E-035.7E-032.6E-041.5E-033.3E-03
  • Under this new metric, OrthoTrans reduces the correlation in the original series by 96%, and outperforms DFT and Haar wavelet transforms by 84% and 93%, respectively.

  • Following your suggestion, we will include the quantitation of temporal decorrelation of OrthoTrans in a new appendix section and incorporate the results presented here to further strengthen our work.


We really appreciate your valuable suggestions which have undoubtedly contributed to improving the quality of our paper. Please let us know if we have properly addressed your questions and we are more than happy to discuss more!

Thanks for your time,

Authors of #10716

审稿意见
4

This paper tackles entangled temporal dependencies in multivariate time series forecasting through two main contributions: a data-adaptive orthogonal transformation based on eigenvalue decomposition of Pearson correlation matrices, and a computationally efficient normalized linear layer (NormLin) that serves as an alternative to self-attention mechanisms. The proposed method leverages dataset-specific statistical properties to decorrelate temporal features before applying linear forecasting models. The comprehensive evaluation across multiple datasets and and tasks in this paper demonstrates consistent performance gains over strong baselines, with both components showing modularity as plug-in enhancements for existing architectures.

优缺点分析

Strengths:

This paper presents a well-executed combination of theoretical innovation and empirical rigor. The core contribution using eigenvalue decomposition of temporal Pearson correlation matrices for orthogonal transformation, provides an elegant theoretical foundation with clear justification. The work demonstrates state-of-the-art performance across multiple datasets and tasks, with both components serving as effective plug-in modules that consistently improve various baseline models. I am particularly impressed by the comprehensive experimental evaluation, which includes extensive ablation studies, and evaluation across carious tasks with different prediction horizons.

Weaknesses:

See questions below

问题

  1. How does the method perform on highly non-stationary time series where the correlation structure changes significantly over time? Have you considered adaptive or online updates to the orthogonal matrices?

  2. You mention that OrthoTrans improves the rank of attention matrices in Transformer models. Can you provide more detailed analysis or proof of this phenomenon? What is the relationship between decorrelation and representation diversity?

  3. Have you compared against other adaptive bases beyond the eigenvalue decomposition, such as those learned through neural networks or other unsupervised methods?

  4. The orthogonal matrices Qi and Qo are computed from the training data. The paper only briefly mentions robustness to limited training data in the appendix but doesn't thoroughly investigate how performance degrades with smaller training sets or distribution shifts.

局限性

Yes

最终评判理由

The authors have provided comprehensive response to my questions. I choose to maintain my current score.

格式问题

No

作者回复

Thank you for your constructive response. We sincerely appreciate the reviewer’s recognition of OLinear as "a well-executed combination of theoretical innovation and empirical rigor". Below are our responses to the concerns:

Q1: On non-stationary time series

  • Financial data (e.g., stock prices and exchange rates) are widely recognized as non-stationary time series. As shown in Tables 2 and 3 of the paper, OLinear consistently achieves state-of-the-art performance on the real-world non-stationary time series datasets: Exchange, NASDAQ, SP500, and DowJones.
  • Regarding online updates of the Q matrices, we further conduct experiments where Q matrices are replaced by those computed on the validation set. After retraining with the updated Q matrices (using the same trainset), the model exhibits similar forecasting performance (as shown below). This phenomenon implies that the Q matrices obtained by OrthoTrans with the trainset can generalize well to the full dataset. Further discussion on online learning could be a future direction of this work.
DatasetHor.Data in paperVal_Q and Retrain
MSEMAEMSEMAE
Exchange960.0820.2000.0820.200
1920.1710.2930.1710.293
3360.3310.4140.3310.414
7200.8370.6880.8360.688
Avg0.3550.3990.3550.399
NASDAQ30.0360.0920.0360.092
60.0490.1170.0490.117
90.0620.1370.0620.137
120.0730.1540.0730.154
Avg0.0550.1250.0550.125
SP50030.0350.1260.0350.127
60.0530.1580.0540.158
90.0700.1810.0700.181
120.0880.2040.0880.204
Avg0.0610.1670.0610.167

Q2: On the rank improvement of attention matrices

  • The observation that OrthoTrans improves the rank of attention matrices is interesting and partly explains why OrthoTrans, as a plug-in, consistently enhances Transformer-based forecasters (see Table 5 of the paper). A potential reason behind this phenomenon is that OrthoTrans can mitigate the intrinsic low-rank property of time series data. The empirical results are presented below.

  • | Dataset | Identity | OrthoTrans | DFT | Wavelet | |-------------|--------------|----------------|---------|-------------| | ECL | 0.162 | 0.446 | 0.332 | 0.381 | | Traffic | 0.002 | 0.568 | 0.177 | 0.049 | | Weather | 0.012 | 0.572 | 0.352 | 0.292 | | Solar | 0.007 | 0.481 | 0.032 | 0.023 | | PEMS03 | 0.241 | 0.450 | 0.356 | 0.395 |

  • Following TimeBase (ICML 2025), we compute the median singular values of the correlation matrix among non-overlapping time series patches (patch length = 24, number of patches = 30); see Figure 1 in TimeBase for more details. The results are presented above. As shown, OrthoTrans yields larger singular values than other transformation bases (with “Identity” denoting no transformation). In other words, OrthoTrans alleviates low-rank tendencies [1,2] and enhances data diversity.

Q3: On adaptive bases

  • We have considered learnable Q matrices for potential performance improvements. The most straightforward approach is adding a learnable matrix to Q. More advanced alternatives include using a learnable vector v\mathbf{v} with a Householder transformation to ensure orthogonality: Q=Q(I2vvTv22),\mathbf{Q}'=\mathbf{Q} \left ( \mathbf{I} -2\frac{\mathbf{v}\mathbf{v}^T}{\left \| \mathbf{v} \right \|^2_2 } \right ), or applying the Cayley transform: Q=(I+A+V)1(IAV),\mathbf{Q}'= \left ( \mathbf{I}+\mathbf{A} +\mathbf{V} \right )^{-1} \left ( \mathbf{I}-\mathbf{A} -\mathbf{V} \right ), where V\mathbf{V} is the learnable skew-symmetric matrix and A=(IQ)(I+Q)1\mathbf{A}=\left ( \mathbf{I} -\mathbf{Q} \right ) \left ( \mathbf{I} + \mathbf{Q} \right )^{-1}. We also explored an input-adaptive version: Q=Q+αLinear1(x)Linear2(x)T,\mathbf{Q}'=\mathbf{Q}+ \alpha \cdot \mathrm{Linear_1}(\mathbf{x}) \mathrm{Linear_2}(\mathbf{x})^\mathrm{T} , where x\mathbf{x} is the input series and α\alpha is the learnable parameter. The results are reported below:
DatasetHorizonOLinearPlus DeltaHouseholderCayleyAdapt_Q
MSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
ECL960.1310.2210.1310.2210.1310.2210.1300.2210.1310.221
1920.1500.2380.1520.2400.1530.2410.1530.2410.1530.241
3360.1650.2540.1660.2560.1660.2550.1650.2550.1640.254
7200.1910.2790.1880.2770.1930.2820.1880.2780.1920.281
Avg0.1590.2480.1590.2480.1610.2500.1590.2490.1600.249
Traffic960.3980.2260.3970.2270.4000.2260.4010.2260.3990.226
1920.4390.2410.4360.2410.4350.2400.4370.2410.4380.240
3360.4640.2500.4600.2500.4610.2500.4660.2500.4600.250
7200.5020.2700.5100.2710.5000.2710.5050.2710.5070.271
Avg0.4510.2470.4510.2470.4490.2470.4520.2470.4510.247
Weather960.1530.1900.1490.1870.1520.1890.1510.1890.1510.189
1920.2000.2350.2000.2360.2040.2390.2050.2400.2030.238
3360.2580.2800.2660.2850.2630.2840.2630.2840.2600.282
7200.3370.3330.3340.3310.3390.3340.3430.3370.3370.332
Avg0.2370.2600.2370.2600.2390.2610.2400.2620.2380.260
  • However, these attempts do not yield performance gains. These results highlight the importance of temporal decorrelation: modifications to the Q matrices reintroduce correlations into the output series (since Q is theoretically the optimal orthogonal matrix for temporal decorrelation). This further demonstrates the superiority of the proposed OrthoTrans scheme.

Q4: On smaller training sets or distribution shifts

  • We report performance with smaller training sets in Appendix I.4. As a linear-based forecaster, OLinear does not heavily rely on large training data and adapts well when the training size is reduced from 100% to 5% (see Figure 16 on Page 32 of the paper).

  • Distribution shifts are challenging for any deep forecasting model. Tables 2 and 3 in the paper show that OLinear performs competitively on real-world non-stationary datasets. In Table 28, OLinear also demonstrates strong zero-shot and generalization capabilities.

We hope these responses can fully address your concerns. Thank you again for your detailed feedback!

[1] TimeBase: The Power of Minimalism in Efficient Long-term Time Series Forecasting. ICML 2025.

[2] Robust recovery of subspace structures by low-rank representation. TPAMI 2012.

评论

Dear Reviewer rQjb,

We would like to thank you once again for your efforts and constructive comments. We would be happy to address any further questions you may have.

All the best,

Authors

评论

Thanks to the authors for the detailed response and experimental results, which are great additions to the paper. I have no further questions, and I will maintain my score.

评论

Dear Reviewer rQjb,

We really appreciate your recognition of our work and your kind words, and we are happy to hear that you consider our responses are great additions to the paper and your concerns have been addressed. Again, thank you for your valuable suggestions which have undoubtedly contributed to improving the quality of our paper.

Many thanks,

The authors of #10716

评论

We sincerely thank all the reviewers for their insightful reviews and valuable comments, which are instructive for us to improve our paper.

In this work, we propose OLinear, a linear-based model for time series forecaseting in the orthogonally transformed domain. It mainly consists of the (plug-and-play) OrthoTrans and NormLin modules, and "demonstrates state-of-the-art performance" (Reviewer rQjb) across "24 datasets and 140 configurations" (Reviewer CEMa). Reviewer jUXb states that as a dataset-adaptive approach, OrthoTrans could "inspire further work" (Reviewer jUXb). When used as a replacement of self-attention, the NormLin module cuts FLOPs "by roughly 50%" (Reviewer 5fFn), "with validated improvements on iTransformer (+5.1%) and PatchTST (+10.1%), which is really interesting" (Reviewer CEMa). Reviewer rQjb further notes that "this paper presents a well-executed combination of theoretical innovation and empirical rigor".

In addition, we curate and publicly release six new multivariate time series datasets: SP500, DowJones, CarSales, Power, Website, and Unemp, to facilitate the development of the time series community.

The reviewers raised insightful and constructive comments. We are devoting all our efforts to addressing the concerns by providing sufficient evidence and the requested results. Below is a summary of the major responses in our rebuttal:

  • Further analysis. Reviewer rQjb suggests further analysis of how OrthoTrans improves the rank of attention matrices in Transformer models. In response, we provide the perspective that OrthoTrans can mitigate the intrinsic low-rank property of time series data. Empirical results show that OrthoTrans yields larger singular values of the correlation matrix among non-overlapping time series patches than those obtained with other transformation bases. This phenomenon partly explains the performance advantages of OrthoTrans over DFT and wavelet bases.

  • More baselines. As requested by Reviewers CEMa, 5fFn and jUXb, we compare OLinear with more baselines, including SimpleTM (ICLR 2025), TQNet (ICML 2025), TimePro (ICML 2025), and TimeBase (ICML 2025), SAMformer, RLinear, TSMixer, and Pyraformer. OLinear consistently outperforms these baselines on a wide range of datasets. Specifically, OLinear outperforms SAMformer by a substantial margin of 12.1%. Notably, replacing self-attention in SAMformer with the NormLin layer improves MAE performance by 10.6% and 5.8% on the Traffic and ECL datasets, respectively.

  • Elaboration on implementation details. As requested by Reviewer CEMa, we provide detailed explanations of our implementation and use experiments to justify our design choices, including the computation of temporal correlation matrices, the design of the NormLin module, and the details of the “temporal NormLin” baseline. For the channel average (CA) scheme used to compute CorrMatt\mathrm{CorrMat}_t, we conduct ablation studies demonstrating the performance advantages of CA over the channel-individual (CI) counterpart.

  • Quantitation of temporal decorrelation. Following Reviewer jUXb's suggestions, we report the Frobenius norm of the off-diagonal elements to validate OrthoTrans's effectiveness in temporal decorrelation. We also provide an intuitive solution to reduce the computational load of the NormLin layer. A statistical significance test using Student’s t-test between OLinear and iTransformer shows that OLinear has a clear performance advantage over the classic iTransformer. We also demonstrate the robustness of OLinear using unified hyperparameter settings.

  • Learnable Q matrices. Reviewers rQjb and CEMa suggest using learnable Q matrices for potential improvements. To this end, we design several approaches (Plus Delta, Householder, Cayley, and Adapt_Q) to implement learnable Q matrices and compare them with the fixed ones. However, these variants do not outperform the original OrthoTrans. These results highlight the importance of temporal decorrelation and the challenges of learning matrices that perform better than the pre-computed Q matrices.

We are grateful for the reviewers’ valuable suggestions. We would be very happy to answer any further questions.

最终决定

This paper introduces OLinear, a new linear multivariate time series forecasting model, where the O indicates that the model operates in an orthogonally transformed domain. It includes two key components: OrthoTrans, a data-adaptive orthogonal transformation that uses temporal Pearson correlation matrices; NormLin: a lightweight, normalized linear layer that effectively replaces more complex self-attention mechanisms, leading to gains in both accuracy and efficiency. The empirical study shows good performance.

All reviewers gave positive ratings, and indicated that the rebuttal comments are useful to clarify unclear things in the original submission. The authors promised to include the rebuttall comments into the final version to improve the paper. Overall, the paper presents an interesting linear model for time series forecasting.