PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
4
3
3
ICML 2025

Patch-wise Structural Loss for Time Series Forecasting

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

We propose PS loss, a novel loss function that measures local structural similarity of time series for more accurate forecasting.

摘要

关键词
Time Series ForecastingLoss FunctionDeep Learning

评审与讨论

审稿意见
3

Traditional loss functions, such as Mean Squared Error, often miss structural dependencies in time series forecasting. This paper proposes a Patch-wise Structural Loss to improve accuracy by focusing on patch-level structural alignment. It uses Fourier-based Adaptive Patching to divide the series and incorporates local statistical features—correlation, variance, and mean—with dynamic gradient weighting. Testing shows enhanced forecasting performance across multiple datasets and models.

给作者的问题

Question 1: Why use adaptive patching? What are the drawbacks of fixed-length patching?

论据与证据

There are no obvious problematic claims in the paper.

方法与评估标准

The design and presentation of the method both make sense. The evaluation of the method also aligns with the established standards in the field.

理论论述

No question about theoretical claims.

实验设计与分析

The design of the main experiments is comprehensive, as the PSloss is applied to various model architectures and achieves relatively good results. The ablation experiments are thorough, and their design and analysis help me better understand the method.

补充材料

I primarily examine the experimental section in the appendix, focusing on the results presented on Appendix D to H.

与现有文献的关系

  1. Most current time series works, such as PatchTST[1] and iTransformer[2], use MSE loss as the optimization objective. This paper, however, highlights the limitations of using MSE loss for optimization.
  2. The paper employs the Pearson Correlation Coefficient[3] to characterize correlation loss and the Kullback–Leibler (KL) divergence to characterize variance loss.
  3. Inspired by previous works on balancing multi-task losses, such as [4-5], the paper proposes Gradient-based Dynamic Weighting to achieve balanced optimization.

References:

  • [1] A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
  • [2] iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
  • [3] Pearson correlation coefficient
  • [4] Multi-task learning as multiobjective optimization.
  • [5] Sparsetsf: Modeling long-term time series forecasting with* 1k* parameters.

遗漏的重要参考文献

No essential reference not discussed.

其他优缺点

Strengths:

  1. The paper is well written and easy to understand.
  2. The paper includes comprehensive evaluations. The diverse ablation study helps to understand the proposed approach.

Weaknesses:

  1. The shortcomings of MSE presented in the paper are not only its limitations as an optimization objective but also as a metric. However, to align with the work in the field, the paper still uses MSE and MAE as main metrics. Figure 3 demonstrates how PSloss contributes to the final prediction results, but I believe the authors could provide more quantitative metrics to further illustrate this point.

其他意见或建议

No more comments or suggestions.

作者回复

Thank you very much for your valuable feedback. Below are our responses to your concerns and suggestions.

[W1] Additional quantitative metrics for evaluating PS loss performance

To provide a more comprehensive evaluation, we incorporated additional metrics: Dynamic Time Warping (DTW), Time Distortion Index (TDI), and Pearson Correlation Coefficient (PCC), to assess the performance of PS loss. A detailed explanation of these metrics is provided below:

MetricDefinitionInterpretation
Dynamic Time Warping (DTW)Measures the minimum cumulative distance between two sequences after applying an optimal non-linear alignment.Lower DTW indicates that the prediction closely matches the ground truth after optimal alignment.
Time Distortion Index (TDI)Quantifies the amount of temporal warping or distortion required to achieve the optimal alignment obtained by DTW.Lower TDI indicates less temporal adjustments for optimal alignment, while a higher TDI signifies greater distortion.
Pearson Correlation Coefficient (PCC)Measures the linear relationship between two sequences.Higher PCC indicates better preservation of the sequence's overall trend and structure.

On the iTransformer model, PS loss consistently improves all three shape-aware metrics, indicating better structural alignment. On the ETTh2 dataset, the iTransformer trained with MSE achieves lower DTW scores, reflecting smaller numerical differences after optimal alignment. However, the higher TDI in this case suggests that the alignment requires more extensive temporal warping, which indicates greater structural distortion compared to the forecasts generated using PS loss.

MetricsDTWTDIPCC
LossMSE+PSMSE+PSMSE+PS
ETTh17.3557.3247.8886.9590.5140.530
ETTh26.8917.01624.72322.7050.2990.342
ETTm16.5686.43512.45111.3810.5380.557
ETTm25.9135.61126.96922.4950.3250.387
Weather5.4105.40941.44040.3430.3240.352

Due to space limitations, we report the average metric values across all forecasting lengths. Please refer to (Table 6) for full results.

[Q1] Drawbacks of fixed-length patching

Fixed-length patching requires grid search over a predefined set of patch lengths, which introduces computational overhead and lacks adaptability across datasets.

Fixed Patch length3612244896
960.3790.3780.3790.3830.3830.386
1920.4300.4290.4280.4310.4310.433
3360.4730.4730.4740.4800.4800.483
7200.4960.4990.4800.4930.4930.508
Avg.0.4440.4450.4400.4460.4500.453

In our ETTh1 experiments, the best-performing fixed patch size was found to be P=12P = 12, which matches the patch length estimated by our adaptive patching strategy using the dominant period pp. This demonstrates that our method can automatically identify an appropriate patch length without manual tuning.

审稿意见
4

This paper proposes the Patch-wise Structural (PS) loss function for time series forecasting. The PS loss improves the alignment of local statistical properties (correlation, variance, and mean), addressing the limitations of traditional point-wise loss functions like MSE. By incorporating patch-level analysis, PS loss enhances the ability to model complex temporal structures. Extensive experiments on 7 real-world time series datasets demonstrate that PS loss significantly outperforms traditional methods, improving forecasting accuracy across various models, including LLM-based forecasting models.

update after rebuttal

I support for accept

给作者的问题

  1. About the ablation study, Could the authors explore whether PS loss can replace MSE loss entirely while achieving similar results in terms of forecasting accuracy?
  2. Since PS loss involves local analysis, how does it affect training time, especially for large-scale datasets such as ECL? Would training time significantly increase for large datasets?
  3. In the zero-shot forecasting experiment, do longer forecasting horizons benefit more from PS loss, or is its impact more significant in short-term forecasting?
  4. How does PS loss compare with other loss functions such as FreDF (Learning to Forecast in Frequency Domain)? Does the combination of PS loss with these other functions enhance forecasting performance?

论据与证据

The claims made in the paper are well-supported by experimental evidence. The experiments in the submission and appendix demonstrate that the proposed method is effective and robust across different models and architectures. The source code and detailed experimental procedures further enhance the study's reproducibility.

方法与评估标准

The methods and evaluation criteria used in this paper are well-suited for the problem of time series forecasting. The authors choose relevant benchmark datasets and employ suitable evaluation metrics to assess model performance. The integration of PS loss with MSE is clearly explained, and the experimental setup is sound. The use of both quantitative and qualitative results strengthens the validity of the conclusions.

理论论述

I have checked the proofs and the theoretical claims behind PS loss, which are robust, and do not find any issues.

实验设计与分析

The experimental design is solid and thorough, and the results support the claims made in the paper. However, there are several issues that need to be discussed:

  1. The paper does not provide a detailed comparison between the GDW strategy and grid search for loss coefficient selection. It would be helpful if the authors could offer a deeper analysis of the GDW strategy’s effectiveness.
  2. The authors should provide visualizations or analyses showing how the different loss weights evolve during training, to better understand how the GDW strategy influences model optimization.

补充材料

I have reviewed the supplementary material, including the experimental setup and the source code, and found no errors. The supplementary material is consistent with the content of the main paper and provides additional clarity on the implementation and experimental procedures.

与现有文献的关系

The proposed PS loss builds on existing time series forecasting loss functions but introduces a more flexible and localized approach. The introduction of patch-wise structural alignment is a novel contribution that distinguishes this work from existing methods. This innovation positions PS loss as a valuable contribution to the field of time series forecasting and makes it relevant to both academic research and practical applications.

遗漏的重要参考文献

The authors have effectively cited and discussed relevant work in time series forecasting, loss functions, and the patching mechanism in time series forecasting in the paper.

其他优缺点

Strengths:

  1. The paper presents a clear and novel contribution to time series forecasting by addressing the limitations of traditional loss functions through the integrate seamless with PS loss, which provides a more precise method for structural alignment and achieves more practical predictions.
  2. The gradient-based dynamic weighting strategy is a novel contribution that enhances the effectiveness of PS loss by adjusting the weight of each component based on gradient magnitudes, improving robustness without the computational cost of grid search.

Weaknesses:

  1. The paper includes a sensitivity analysis of the hyperparameters \lamba and \delta, but it lacks a detailed explanation of how they vary in different scenarios. It is recommended that the authors clarify the parameter settings.

其他意见或建议

Minor inconsistency in notation: In Figure 2, \alpha is used for L_Corr, \beta for L_Mean, and \gamma for L_Var, but these notations are inconsistent with the rest of the manuscript and formula. It would be beneficial to unify the notation.

作者回复

Thank you very much for your valuable feedback. Below are our responses to your concerns and suggestions.

[E1] Comparison between GDW and grid-search

To evaluate GDW against a traditional grid search for selecting loss weights, we conducted experiments on the ETTh1 dataset using iTransformer. Both methods use the same overall PS loss weight λ\lambda. For grid search, coefficients α\alpha, β\beta, and γ\gamma were chosen from {0.3, 0.5, 0.7, 1.0}, totaling 64 runs per prediction length. We report the best and average performance from grid search for comparison:

MethodGDWGrid Search (Best)Grid Search (Average)
MetricMSEMAEMSEMAEMSEMAE
960.3790.3960.3800.3960.3850.398
1920.4280.4240.4280.4240.4320.426
3360.4740.4530.4730.4500.4830.456
7200.4800.4780.4830.4790.5130.494
Avg0.4400.4380.4410.4370.4530.444

GDW achieves performance comparable to the best grid-searched results, while avoiding exhaustive tuning. Moreover, it reflects the intuition that the weights of correlation, variance, and mean should evolve dynamically during training to maintain balanced attention across all three loss terms, which static coefficients cannot assure.

[E2] Visualization of loss weights generated by GDW

We visualize the evolution of weights generated by GDW in Figure 1 and made the following observations:

  • Weight Range: The weights for correlation, variance, and mean have different ranges, reflecting the inherent variation in their gradient magnitudes, which highlights the need for adaptive balancing.
  • Weight Evolution: Correlation weight tends to decrease, while variance and mean weights increase. This does not imply shifting focus, but rather ensures equilibrium among components, allowing structural alignment to be preserved during optimization.

This confirms that GDW adaptively balances multiple objectives throughout training, improving stability and convergence.

[W1] Hyperparameter settings

The PS loss weight λ\lambda is selected from {0.1, 0.3, 0.5, 0.7, 1.0, 2.0, 3.0, 5.0, 10.0}. The patch length threshold δ\delta is chosen from {24, 48}.

[Q1] PS loss as a standalone objective

Results show that PS loss alone yields comparable accuracy to MSE+PS, demonstrating its effectiveness as a standalone optimization objective (Table 8).

ModeliTransformerTimeMixer
LossMSE + PSPS OnlyMSE+PSPS Only
ETTh10.4400.4390.4370.429
ETTh20.3750.3800.3690.364
ETTm10.3960.3960.3750.377
ETTm20.2810.2820.2700.274
Weather0.2530.2530.2430.242

[Q2] PS loss complexity on large datasets

We report the empirical runtime cost by measuring the average seconds per epoch during training using iTransformer across three datasets: ETTh1 (small), Weather (medium), and ECL (large):

DatasetMSEPSTime Increase
ETTh11.962.660.71
Weather10.6314.043.40
ECL25.0230.205.18

Despite the added cost, the runtime increase is modest and justified by performance gains.

[Q3] Zero-shot performance across forecasting lengths

We extended zero-shot experiments based on iTransformer to forecast lengths of 96, 336, and 720, in addition to 192 (reported in the paper). PS loss improved accuracy in 33 out of 36 settings, confirming its robustness across both short- and long-term horizons. See Table 9 for details.

Model96336720
Loss FunctionMSE+PSMSE+PSMSE+PS
ETTh1→ETTh2/m1/m20.4990.4460.5370.5750.5790.566
ETTh2→ETTh1/m1/m20.6260.5400.6410.6140.6860.645
ETTm1→ETTh1/h2/m20.4200.3850.5140.4720.5500.506
ETTm2→ETTh1/h2/m10.6220.4850.7920.5410.8700.554
Imp-15.58%-11.39%-15.42%

[Q4] Combination of PS loss and FreDF loss

FreDF focuses on frequency-domain alignment to mitigate label autocorrelation, while PS loss emphasizes patch-wise structural alignment in the time domain. Their goals are complementary. We evaluated MSE+FreDF, MSE+PS, and MSE+PS+FreDF using iTransformer as backbone.

LossMSE+FreDFMSE+PSMSE+PS+FreDF
DatasetMSEMAEMSEMAEMSEMAE
ETTh10.4430.4370.4400.4380.4360.433
ETTm10.4040.4060.3960.3970.3960.397

Results show that combining the two losses yields either improved or comparable performance, supporting their compatibility. Table 10 provides full results.

审稿人评论

The authors have provided clear and satisfactory responses to my concerns. Their clarification of the generalization issue and the motivation behind modeling patch-level alignment is convincing. The novelty of the proposed framework is better justified, and the new experiments and metrics further support the method's effectiveness. I am now more confident in the contribution of this work and support its acceptance.

作者评论

Thank you very much for your thoughtful review of our work. We sincerely appreciate your valuable feedback and your confidence in our contribution. Thank you once again for your insightful suggestions and continued support!

审稿意见
3

The authors propose a novel patch-wise structural (ps) loss, which is designed to enhance structural alignment by comparing time series at the patch level. By leveraging local statistical properties, e.g., correlation, variance, and mean, PS loss captures nuanced structural discrepancies overlooked by traditional point-wise loss. Experiments demonstrate that the PS loss can improve the performance of the state-of-the-art models across diverse real-world datasets.

update after rebuttal

I support for accept.

给作者的问题

  1. As the proposed loss function is specifically designed for time series forecasting, the comparative experiments should not only focus on long-term time series forecasting but also encompass other experimental settings, e.g., short-term time series forecasting and ultra-long-term time series forecasting, as mentioned by existing methods [1, 2].

  2. Since some latest methods [1, 3] also improve model performance by introducing constraints or loss functions, as a loss specifically designed for time series, the authors should elaborate on the differences between their proposed loss function and these existing designs in the related work Section. In addition, more comparative experiments should be conducted to validate the effectiveness of the proposed loss functions against these advanced loss functions.

  3. In Section 4.6, the authors demonstrate that the PS loss can improve generalization to unseen datasets. Given that LLMs have also been proven to exhibit strong generalization capabilities under zero-shot settings [4, 5, 6], it is suggested to study whether the PS loss can further enhance the performance of LLMs under zero-shot settings. The authors should validate this through additional experiments.

4.The design of the PS loss appears to be somewhat complex. The authors are suggested to include time complexity analysis.

  1. To provide an intuitive understanding of the performance improvements brought by the proposed loss, the authors should explicitly list the performance gains in terms of percentage improvements. [1]Shang Z, Chen L, Wu B, et al. Ada-MSHyper: adaptive multi-scale hypergraph transformer for time series forecasting. NeurIPS, 2024. [2]Jia Y, Lin Y, Hao X, et al. WITRAN: Water-wave information transmission and recurrent acceleration network for long-range time series forecasting. NeurIPS, 2023. [3]Ye W, Deng S, Zou Q, et al. Frequency Adaptive Normalization For Non-stationary Time Series Forecasting. NeurIPS, 2024. [4]Zhou T, Niu P, Sun L, et al. One fits all: Power general time series analysis by pretrained LM. NeurIPS, 2024. [5]Liu Y, Qin G, Huang X, et al. Autotimes: Autoregressive time series forecasters via large language models. NeurIPS, 2024. [6]Jin M, Wang S, Ma L, et al. Time-LLM: Time Series Forecasting by Reprogramming Large Language Models. ICLR, 2024.

论据与证据

The effectiveness of different module designs has been validated by the experimental results.

方法与评估标准

The proposed methods and evaluation criteria are appropriate for time series forecasting.

理论论述

I have checked the correctness of the proofs for the theoretical claims.

实验设计与分析

The manuscript has some weaknesses in the experiments. Please see Questions for details.

补充材料

I have reviewed all contents in the supplementary material.

与现有文献的关系

The introduce of patch-wise structural loss is beneficial for time series forecasting.

遗漏的重要参考文献

Some important baselines e.g., Ada-MSHyper [1], and FAN [3], need to be compared. Please see Questions 1 and 2 for details.

[1] Shang Z, Chen L, Wu B, et al. Ada-MSHyper: adaptive multi-scale hypergraph transformer for time series forecasting. NeurIPS, 2024.

[3] Ye W, Deng S, Zou Q, et al. Frequency Adaptive Normalization For Non-stationary Time Series Forecasting. NeurIPS, 2024.

其他优缺点

1.The paper presents a notable innovation by exploring the integration of patch-wise structural loss into time series forecasting, a direction scarcely addressed by existing methodologies.

2.The organization of this paper is clear and the paper is well written.

其他意见或建议

No

作者回复

Thank you very much for your valuable feedback. Below are our responses to your concerns and suggestions.

[Q1] PS loss on ultra-long-term and short-term forecasting

We evaluated PS loss on ultra-long-term (T = {1080, 1440, 1800, 2160}) and short-term (T = {12, 24, 48}) forecasting tasks using iTransformer and DLinear. We report averaged MSE results. Please refer to Table 1 and Table 2 for full results.

  • Ultra-long-term: MSE reduced by 7.38% (iTransformer) and 11.01% (DLinear).
ModeliTransformerDlinear
Loss FunctionMSE+PSMSE+PS
ETTh10.7530.6930.6960.628
ETTh20.5450.4941.2411.127
ETTm10.5770.5360.4870.474
ETTm20.4800.4660.5570.463
Imp.-7.38%-11.01%
  • Short-term: MSE reduced by 3.43% (iTransformer) and 1.60% (DLinear).
ModeliTransformerDlinear
Loss FunctionMSE+PSMSE+PS
PEMS030.1100.1070.2390.235
PEMS040.1050.1010.2830.279
Imp.-3.43%-1.60%

These results demonstrate the effectiveness of PS loss across both short-term and ultra-long-term forecasting tasks.

[Q2] Comparison with Ada-MSHyper and FAN

The contributions of Ada-MSHyper and FAN differ from our PS loss, as they address distinct challenges:

  • Ada-MSHyper introduces a hypergraph transformer with a graph constraint loss to enhance multi-scale interaction modeling through hypergraph learning.
  • FAN proposes a frequency-based adaptive normalization method to address both trend and seasonal non-stationary patterns.
  • PS (Ours) presents a novel loss function that enhances structural alignment between predictions and ground truth via patch-wise statistical metrics.

We also combined PS loss with both methods. Please refer to Table 3 and Table 4 for the full results.

  • Ada-MSHyper + PS. PS loss improves the average performance by 8.28% (MSE) and 4.68% (MAE).
MethodAda-MSHyperAda-MSHyper+PS
DatasetMSEMAEMSEMAE
ETTh10.1370.2620.1320.254
ETTh20.1070.2310.1050.227
Imp.--8.28%4.68%
  • FAN + PS. When using DLinear as the backbone, PS loss further improves the average performance of FAN by 2.07% (MSE) and 2.31% (MAE).
MethodDlinear+FANDlinear+FAN+PS
DatasetMSEMAEMSEMAE
ETTh10.4440.4850.4390.479
ETTh20.1370.2620.1320.254
Imp.--2.07%2.31%

These results demonstrate that while FAN and Ada-MSHyper focus on different aspects, PS loss can still further improve their performance by enhancing the structural alignment of the forecasted series.

[Q3] PS loss on LLM-based models for zero-shot forecasting

We conducted zero-shot forecasting experiments with LLM-based models: OFA, AutoTimes, and Time-LLM. PS loss improved forecasting accuracy with average MSE reductions of 2.07% (OFA), 6.33% (AutoTimes), and 7.29% (Time-LLM). Please refer to Table 5 for the full results.

ModelOFAAutoTimesTime-LLM
Loss Function+MSE+PS+MSE+PS+MSE+PS
ETTh1→ETTh2/m1/m20.4100.4170.4210.4210.4200.405
ETTh2→ETTh1/m1/m20.4610.4540.5680.4990.5060.450
ETTm1→ETTh1/h2/m20.3350.3360.3590.3460.3490.338
ETTm2→ETTh1/h2/m10.4110.3590.4450.3730.4240.374
Imp-2.07%-6.33%-7.29%

[Q4] Time complexity analysis of PS loss

We analyze the time complexity of PS loss with respect to the forecast length TT, number of channels CC, and hidden dimension dd. The overall complexity arises from three main components:

  • Fourier-based Adaptive Patching (FAP): The complexity of this component is dominated by the Fast Fourier Transform (FFT), which is O(TlogT)O(T\log T) per channel. Since FFT is applied to each of the CC channels, the total time complexity is O(CTlogT)O(C\cdot T \log T).
  • Patch-wise Structural Loss (PS): The series is split into N2TPN \approx \frac{2T}{P} patches, where PP is the patch length. Calculating correlation, variance, and mean over each patch requires O(P)O(P) operations. Given CNC \cdot N patches, the total complexity becomes O(CNP)=O(CT)O(C \cdot N \cdot P) = O(C\cdot T).
  • Gradient-based Dynamic Weighting (GDW): The gradient computation for each loss component w.r.t. the model output has shape dTd \cdot T, leading to a complexity of O(dT)O(d \cdot T).

Therefore, the overall time complexity of PS loss is O(CTlogT+CT+dT)O(C \cdot T \log T + C \cdot T + d \cdot T).

[Q5] Performance gains in terms of percentage improvements

We now report percentage improvements throughout the paper and summarize them in updated Table.

审稿人评论

Thank you for your response, especially for conducting additional experiments during the rebuttal process. I will maintain my score of 3 and vote for acceptance.

作者评论

Thank you very much for your thoughtful review of our work. We are sincerely grateful for your valuable feedback and your recognition of the additional experiments. Thank you once again for your insightful suggestions and support!

审稿意见
3

Most previous time series forecasting models use MSE as the loss function, which treats each time step independently and neglect the structural dependency among steps. To fill the gap, this work proposes Patch-wise Structural (PS) Loss. PS Loss first splits target series into patches with patch size determined by FFT. Then correlation, variance and mean losses are computed within each patch and averaged. A gradient-based dynamic weighting mechanism is used to balance the weights of three losses during training. Experiments on real-world datasets show that PS loss can boost forecasting accuracy of both traditional and LLM-based models.

给作者的问题

See my questions above.

论据与证据

Claims are well supported.

方法与评估标准

Overall, the methods and evaluation are convincing. I have a few minor questions or concerns as follows:

  1. PS loss utilizes FFT to detect the period in the target series for patching. What if there is no periodic pattern in the target? How does PS loss perform on such datasets (considering most datasets used in experiments have obviously daily periods)?
  2. In gradient-based dynamic weighting, why does mean loss require further adjustment by Equation 12 among three losses? What is the purpose behind this design, and has there been an ablation study conducted on it?
  3. PS loss focuses on structural dependency, but the evaluation metrics are still point-wise MSE and MAE, which may not effectively measure structural consistency. The authors should not be criticized for this, as they are standard metrics. However, I would still like to inquire whether there are other candidate metrics (like DTW) that could better reflect structural information.

理论论述

Not applicable, no new theoretical claims is proposed.

实验设计与分析

  1. Considering that FFT brings additional computation cost, how is the efficiency of PS loss w.r.t different forecasting lengths and channel numbers?
  2. Can PS loss along without MSE be used as the loss function? How does it perform?

补充材料

I have reviewed all the appendix.

与现有文献的关系

The primary objective of this paper is to enhance time series forecasting, with outcomes that can be effectively applied across diverse downstream domains

遗漏的重要参考文献

Not applicable, references are generally comprehensive.

其他优缺点

Overall, this is a commendable work. The topic is significant, as most recent studies have primarily focused on backbone designs, leaving loss functions relatively underexplored. The presentation is clear and easy to follow. Should the authors adequately address my concerns, I would be happy to raise the score.

其他意见或建议

See my comments and suggestions above.

作者回复

Thank you very much for your valuable feedback. Below are our responses to your concerns and suggestions.

[M1] PS loss performance on non-periodic targets

When there is no clear periodic pattern in the target series, the dominant frequency—i.e., the one with the highest amplitude in the FFT spectrum—does not necessarily correspond to a true periodic component in the data. Instead, it typically falls into one of two categories:

  • Short period (p<δp<\delta): This often results from high-frequency components such as local fluctuations or noise. In this case, Equation (3) yields a short patch length, which still allows the model to focus on finer-grained local structure.
  • Long period (p>δp>\delta): This typically corresponds to low-frequency background components or weak global trends. In this case, the patch length will be capped at δ\delta to prevent excessively large patches that could hinder fine-grained comparisons.

This design allows PS loss to adapt to both periodic and non-periodic series, using frequency content to guide patch granularity, while δ\delta prevents overly large patches. On the Exchange dataset, which lacks a clear periodic pattern, PS loss still improved MSE by 6.43% on DLinear, demonstrating its effectiveness.

[M2] Purpose of mean loss refinement

The purpose behind this design is to first focus on aligning the shape of the series, and then gradually increase attention to the value offset. As correlation and variance alignment improve during training (indicated by increasing cc, vv), the model allocates more weight to LmeanL_{mean} to refine value-level offsets. Ablation on iTransformer and TimeMixer (ETTh1) confirms its effectiveness.

MethodiTrans*+PSW/o c&vTimeM*+PSW/o c&v
MetricMSEMAEMSEMAEMSEMAEMSEMAE
960.3790.3960.3800.3960.3660.3920.3680.391
1920.4280.4240.4290.4250.4210.4210.4180.420
3360.4740.4530.4800.4580.4890.4530.4980.457
7200.4800.4780.5050.4920.4740.4630.4800.464
Avg0.4400.4380.4480.4430.4380.4320.4410.433

[M3] Additional metrics for structural evaluation

Beyond MSE/MAE, we include additional shape-aware metrics: DTW, TDI, and PCC, using iTransformer for evaluation (Please see Reviewer Pd9N [W1] for metric details). PS loss consistently improves all three metrics, indicating better structural alignment (Table 6).

MetricsDTWTDIPCC
LossMSE+PSMSE+PSMSE+PS
ETTh17.3557.3247.8886.9590.5140.530
ETTh26.8917.01624.72322.7050.2990.342
ETTm16.5686.43512.45111.3810.5380.557
ETTm25.9135.61126.96922.4950.3250.387
Weather5.4105.40941.44040.3430.3240.352

[E4] Time complexity analysis of PS loss

1. Theoretical time complexity analysis

We analyze the time complexity of PS loss with respect to the forecast length TT, number of channels CC, and hidden dimension dd. The overall complexity arises from three main components:

  • Fourier-based Adaptive Patching (FAP): The complexity of this component is dominated by the Fast Fourier Transform (FFT), which is O(TlogT)O(T\log T) per channel. Since FFT is applied to each of the CC channels, the total time complexity is O(CTlogT)O(C\cdot T\log T).
  • Patch-wise Structural Loss (PS): The series is split into N2TPN \approx \frac{2T}{P} patches, where PP is the patch length. Calculating correlation, variance, and mean over each patch requires O(P)O(P) operations. Given CNC\cdot N patches, the total complexity becomes O(CNP)=O(CT)O(C\cdot N\cdot P) = O(C\cdot T).
  • Gradient-based Dynamic Weighting (GDW): The gradient computation for each loss component w.r.t. the model output has shape dTd\cdot T, leading to a complexity of O(dT)O(d\cdot T).

Therefore, the overall time complexity of PS loss is O(CTlogT+CT+dT)O(C\cdot T\log T+C\cdot T+d\cdot T).

2. Actual run time overhead

We report the empirical runtime cost by measuring seconds per epoch using iTransformer across three datasets: ETTh1 (small), Weather (medium), and ECL (large):

DatasetMSEPSTime Increase
ETTh11.962.660.71
Weather10.6314.043.40
ECL25.0230.205.18

Despite the added cost, the runtime increase is modest and justified by performance gains.

[E5] PS loss performance without MSE

Results show that PS loss alone yields comparable accuracy to MSE+PS, demonstrating its effectiveness as a standalone optimization objective (Table 8).

ModeliTransformerTimeMixer
LossMSE+PSPS OnlyMSE+PSPS Only
ETTh10.4400.4390.4370.429
ETTh20.3750.3800.3690.364
ETTm10.3960.3960.3750.377
ETTm20.2810.2820.2700.274
Weather0.2530.2530.2430.242
审稿人评论

Thank you for your response, especially for conducting additional experiments during the rebuttal process. I will maintain my score of 3 and vote for acceptance. Moreover, I suggest that the experiments on complexity and new metrics (including their definitions, calculation methods, etc.), as well as the corresponding analysis, should be added to the final camera-ready version.

作者评论

Thank you very much for your thoughtful review of our work. We sincerely appreciate your valuable feedback and will ensure the additional experiments and analysis are incorporated into the final version. Thank you once again for your insightful suggestions and support!

最终决定

All reviewers support acceptance, highlighting significant methodological novelty, comprehensive validation, and clear improvements, while recommending minor revisions and clarifications for final submission.