PaperHub
6.0
/10
Poster4 位审稿人
最低3最高8标准差2.1
8
8
5
3
4.3
置信度
正确性2.8
贡献度2.5
表达3.3
ICLR 2025

TimeKAN: KAN-based Frequency Decomposition Learning Architecture for Long-term Time Series Forecasting

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-26

摘要

关键词
Kolmogorov-Arnold Network; Time Series Forecasting

评审与讨论

审稿意见
8

This paper explores a method for decomposing mixed frequency components into distinct single-frequency components to improve time series forecasting accuracy. The proposed approach, called TimeKAN, is based on the Kolmogorov-Arnold Network (KAN). TimeKAN's process consists of three key components: (1) Cascaded Frequency Decomposition (CFD) blocks, which use a bottom-up cascading approach to obtain series representations for each frequency band; (2) Multi-order KAN Representation Learning (M-KAN) blocks, which capture and represent specific temporal patterns within each frequency band; and (3) Frequency Mixing blocks, which recombine the separated frequency bands back into the original series format.

The study demonstrates that TimeKAN outperforms several state-of-the-art forecasting methods, including Autoformer, FEDformer, and iTransformer, by achieving lower MSE and MAE across multiple time series datasets such as Weather, ETTh2, and ETTm2.

优点

  1. Figure 1 is beautifully designed and provides an intuitive overview of each component in the new TimeKAN method, as well as how they connect.

  2. The study makes effective use of large-scale datasets and performs comparisons with a variety of other methods (including CNN-based and Transformer-based models), demonstrating the advantages of TimeKAN. The model is also tested across different prediction lengths, and for datasets where performance is less optimal, the paper offers thorough explanations and detailed insights.

  3. The analysis delves into several key components of TimeKAN, such as Upsampling, Depthwise Convolution, and Multi-order KANs. I especially appreciated this section, as it not only establishes that TimeKAN outperforms other deep learning methods but also shows that each individual component of TimeKAN is optimally designed.

缺点

  1. Section 3.2 appears somewhat disorganized. While the overall logic is clear, the expression could be refined for clarity. Additionally, more mathematical details and background should be provided, which can be included in the appendix.

  2. If possible, please add more data to Table 5 in Section 4.3. Supplement it with the performance of other methods in Table 1 on parameters (params) and MAC across these six datasets.

问题

N/A

评论

Many thanks to Reviewer BGvS for providing thorough detailed comments.

Q1: "Section 3.2 appears somewhat disorganized. While the overall logic is clear, the expression could be refined for clarity. Additionally, more mathematical details and background should be provided, which can be included in the appendix."

Thank you for your valuable suggestions. We have reorganized and refined the content as follows: The sequence preprocessing part originally included in Section 3.2 Cascaded Frequency Decomposition has been relocated to a newly created section, Section 3.2 Hierarchical Sequence Preprocessing. We have also revised Section 3.3 Cascaded Frequency Decomposition to provide a clearer explanation of the step-by-step process used to obtain each frequency component in a cascaded manner. Furthermore, we have added detailed mathematical discussions on the Kolmogorov-Arnold Network and the Fourier Transform in Appendix A.

Q2: "If possible, please add more data to Table 5 in Section 4.3. Supplement it with the performance of other methods in Table 1 on parameters (params) and MAC across these six datasets."

Thank you for your valuable suggestion. We have added the performance of the remaining methods from Table 1 in terms of parameters (params) and MACs across these six datasets. The complete comparison of parameters and MACs is shown below (Params|MACs).

DatasetsETTH1ETTH2ETTm1ETTm2WeatherElectricity
TimeMixer75.50K|20.37M75.50K|20.37M75.50K|20.37M77.77K|24.18M104.43K|82.62M106.83K|1.26G
iTransformer841.57K|77.46M224.22K|19.86M224.22K|19.86M224.22K|19.86M4.83M|1.16G4.83M|16.29G
PatchTST3.75M|5.90G10.06M|17.66G3.75M|5.90G10.06M|17.66G6.90M|35.30G6.90M|539.38G
TimesNet605.48K|18.13G1.19M|36.28G4.71M|144G1.19M|36.28G1.19M|36.28G150.30M|4.61T
MICN25.20M|71.95G25.20M|71.95G25.20M|71.95G25.20M|71.95G111.03K|295.07M6.64M|19.5G
Dlinear18.62K|0.6M18.62K|0.6M18.62K|0.6M18.62K|0.6M18.62K|0.6M18.62K|0.6M
FreTS3.24M|101.46M3.24M|101.46M3.24M|101.46M3.24M|101.46M3.24M|101.46M3.24M|101.46M
FILM12.58M|2.82G12.58M|2.82G12.58M|2.82G12.58M|2.82G12.58M|8.46G12.58M|8.46G
FEDFormer23.38M|24.96G23.38M|24.96G23.38M|24.96G23.38M|24.96G23.45M|25.23G24.99M|30.89G
AutoFormer10.54M|22.82G10.54M|22.82G10.54M|22.82G10.54M|22.82G10.61M|23.08G12.14M|28.75G
TimeKAN12.84K|7.63M15.00K|8.02M14.38K|7.63M38.12K|16.66M20.94K|29.86M23.34K|456.50M

As can be seen, except for DLinear, our TimeKAN consistently demonstrates a significant advantage in both parameter count and MACs compared to any other model. DLinear is a model consisting of only a single linear layer, which makes it the most lightweight in terms of parameters and MACs. However, the performance of DLinear already shows a significant gap when compared to state-of-the-art methods. In this case, the advantage in parameters may not be a major consideration for researchers. Therefore, in the main text, we compare the parameters and MACs of the current state-of-the-art methods: TimeKAN, TimeMixer, iTransformer, and PatchTST, and the complete comparison results are provided in Appendix B.2. Hope you can understand this arrangement.

评论

Thank you for your response. I believe my questions have been resolved.

审稿意见
8

TimeKAN is a time series forecasting model that combines frequency decomposition, representation learning, and mixing. It first uses a moving average to separate high and low frequencies, creating multi-level sequences that are embedded into a high-dimensional space. Cascaded Frequency Decomposition (CFD) blocks progressively isolate each frequency band. The Multi-order KAN Representation Learning (M-KAN) blocks use Kolmogorov-Arnold Networks to capture temporal patterns within each frequency band independently. Finally, the Frequency Mixing blocks recombine these decomposed bands to restore the original sequence, which is then used for forecasting through a linear layer

优点

1 Clarity and Structure: The paper follows a logical flow from problem statement to conclusions, making complex ideas accessible. 2. Thorough Background: A strong review of related work provides valuable context, situating the contribution within the field. 3. Detailed Experiments: Comprehensive experiments across multiple datasets support the model’s performance claims, with ablation studies highlighting component effectiveness. 4. Focused Writing: The paper stays on topic, avoiding unnecessary details and maintaining focus on the core contribution

缺点

Depth in Explanation: The methodology section could offer more detail on complex components like Kolmogorov-Arnold Networks for greater accessibility.

问题

What if we split the frequency band to more layers (more than 3 for example ). Will it increase performance ?

评论

We would like to sincerely thank Reviewer 8Zzz for providing a detailed review and insightful suggestions.

Q1: "Depth in Explanation: The methodology section could offer more detail on complex components like Kolmogorov-Arnold Networks for greater accessibility."

Thanks to your valuable suggestions, we have added mathematical details about how Kolmogorov-Arnold Networks perform forward propagation as well as refined the description of the Chebyshev KAN in Section 3.4, please see the red text in Section 3.4.

Q2: "What if we split the frequency band to more layers (more than 3 for example ). Will it increase performance ?"

In fact, the number of frequency bands is a hyperparameter that we manually set to accommodate the frequency domain characteristics of different datasets. To explore the impact of the number of frequency bands on performance, we set the number of frequency bands to 2, 3, 4, and 5. The effects of different frequency band divisions on performance (MSE|MAE results under 96-to-96 setting) are shown in the table below:

Number of FrequencyETTh2WeatherElectricity
20.292|0.3400.164|0.2090.183|0.270
30.290|0.3390.163|0.2090.177|0.268
40.290|0.3400.162|0.2080.174|0.266
50.295|0.3460.164|0.2110.177|0.273

As we can see, in most cases, dividing the frequency bands into 3 or 4 layers yields the best performance. This aligns with our prior intuition: dividing into two bands results in excessive frequency overlap, while dividing into five bands leads to too little information within each band, making it difficult to accurately model the information within that frequency range. We have added the above discussion to Appendix B.5.

评论

Could you please explain why splitting the band more than three times results in worse performance? Logically, doing so should result in better time and spectral resolution, which would enhance performance.

评论

Logically, more frequency bands could potentially improve performance by allowing for finer modeling. However, in real-life time series, noise is more prevalent in the high-frequency range. When the number of bands exceeds a certain threshold, some high-frequency bands may contain only useless noise, which harms prediction accuracy.

One solution is to use an uneven frequency division, modeling only those bands with effective content. However, since the frequency distribution varies across different time series datasets, manually dividing the frequency for each specific dataset would be impractical and hinder model generalization. Therefore, we divide the frequency uniformly into multiple bands and limit the number of bands to a reasonable range (e.g., 3-4 bands), ensuring each band contains useful information for prediction while maintaining the model's generalization ability. In this way, we can apply band-specific modules to extract effective frequency data, improving the prediction performance. Our results show that 3 or 4 bands is an optimal threshold, beyond which the impact of noise on prediction is amplified, leading to a decline in performance.

We hope our response addresses your question, and we would be happy to answer any further questions before the rebuttal ends.

审稿意见
5

In this paper, the authors propose a time series forecasting method based on KAN. TimeKAN uses frequency decomposition and KAN to effectively capture temporal correlations of the data. Experiments show the effectiveness of TimeKAN based on real-world datasets.

优点

  1. A vivid introduction and related work section to explain the background of time series forecasting and KAN.
  2. A clear figure to illustrate the overall framework of TimeKAN. In the methodology section, all components are detailed.
  3. Good ablation study to test the effectiveness of the different components of the model.

缺点

  1. The author does not explain why TimeKAN does not perform well in the Electricity dataset. It would be helpful if the authors could provide potential reasons or hypotheses for why TimeKAN underperforms on the Electricity dataset specifically.
  2. From Table 4, I do not see a huge increase with KAN compared to MLP models. Generally, if these results are similar, mostly, KAN is much slower than MLP. It would be good to see runtime comparisons between KAN and NLP implementations. Additionally, if KAN is slower than MLP in practice, it would be beneficial for authors to discuss more reasons why we prefer KAN over MLP.
  3. For the look-back window, the authors do not compare TimeKAN with other models. For most models, when the prediction length is fixed, the prediction accuracy will increase as the look-back window increases. It is beneficial to provide a comparative analysis of TimeKAN's performance with varying look-back windows compared to other baseline models. This would provide a more comprehensive evaluation of TimeKAN's capabilities relative to existing methods.
  4. For baseline methods, it is better to choose more frequency-based (such as FreTS) methods since frequency decomposition is a key contribution.

问题

  1. See weaknesses.
  2. Could you provide a short theoretical analysis about why in some cases in time series forecasting, KAN is better than MLP?
  3. For Table 2, why not include the electricity dataset?
  4. For KAN, you mentioned that the Kolmogorov-Arnold representation theorem states that any multivariate continuous function can be expressed as a combination of univariate functions and addition operations. Could you explain more about how it can capture multivariate correlations?
评论

Q6: "For Table 2, why not include the electricity dataset?"

Thanks to your valuable reminder, we have added the results of the missing electricity dataset in Table 2, which is shown in the following table:

DatasetsETTh1ETTh2ETTm1ETTm2WeatherElectricity
MetricMSE|MAEMSE|MAEMSE|MAEMSE|MAEMSE|MAEMSE|MAE
Linear Mapping0.401|0.4130.312|0.3620.328|0.3650.180|0.2630.164|0.2110.184|0.275
Linear Interpolation0.383|0.3980.296|0.3470.336|0.3700.181|0.2630.165|0.2100.196|0.277
Transposed Convolution0.377|0.4070.290|0.3440.326|0.3660.178|0.2610.163|0.2110.188|0.274
Frequency Upsampling0.367|0.3950.290|0.3400.322|0.3610.174|0.2550.162|0.2080.174|0.266

Q7: "For KAN, you mentioned that the Kolmogorov-Arnold representation theorem states that any multivariate continuous function can be expressed as a combination of univariate functions and addition operations. Could you explain more about how it can capture multivariate correlations?"

When using KAN to perform supervised learning tasks, the correlations among multivariate are implicitly reflected in the learnable univariate activation functions. Each variable in KAN is regulated by a learnable univariate activation function. If multiple variables exhibit consistent behavior, KAN tends to reduce redundancy by allowing only a small number of activation functions to remain effective, while deactivating those corresponding to redundant variables. Through supervised training, KAN learns and captures the implicit correlations among variables and uses the learnable activation functions to regulate their behavior. If explicit direct relationships among variables are needed, unsupervised learning can be employed to find a non-zero function gg such that g(x1,,xn)0g(x_{1}, \cdots, x_{n})\approx0. After training, the explicit relationships between each variable and the remaining variables can be constructed by extracting the expressions of the corresponding learnable activation functions. In this scenario, the univariate refers to one element in the input sequence and does not represent a variate in the multivariate time series. In TimeKAN, KAN is used to model the latent representation of time series, i.e., learning the hidden dimension representation. And we adopt a variate-independent strategy for multivariate processing, also known as the channel-independent strategy in PatchTST.

评论

Thank you for your response, which has addressed my concerns to a certain extent. However, after reading other reviews, I share Reviewer XpEF's concerns regarding the validity of KAN itself. Therefore, I chose to maintain my original score, but I will discuss this further with the AC and other reviewers in the next stage to decide whether to change the score.

评论

Dear Reviewer gWX6,

We would like to sincerely thank you once again for your valuable feedback. We fully understand your concerns about the effectiveness of KAN as a novel neural network. However, with less than 24 hours remaining in the rebuttal period, we would like to kindly remind you that vanilla KAN has already received all positive evaluations from five reviewers on OpenReview, and its theoretical contributions have been recognized by the community. In TimeKAN, we have thoughtfully designed the Multi-order KAN to better align the variable characteristics of KAN with the multi-frequency nature of time series, and we have demonstrated its effectiveness through experimental results. Given this, we genuinely hope that you will reconsider your score. Finally, we would like to express our sincere appreciation for your constructive comments, which have greatly enhanced the quality of our paper.

Best regards,

The Authors

评论

Q5: "Could you provide a short theoretical analysis about why in some cases in time series forecasting, KAN is better than MLP?"

Time series typically consist of components from multiple frequencies. Suppose we divide the time series into KK frequency bands from low to high frequency, where each frequency band fkf_k contains NN frequencies. The complexity CfkC_{f_k} of the kk-th frequency band fkf_k can be defined as its spectral entropy, i.e., Cfk=i=1NP(fi)logP(fi)C_{f_k}=-\sum_{i=1}^N P(f_i) \log P(f_i) where P(fi)P(f_i) is the normalized power spectrum. In most cases, the time series contains a small number of concentrated high-amplitude frequencies in the low-frequency region, resulting in lower complexity. On the other hand, the distribution in the high-frequency region is relatively uniform, leading to higher complexity. Therefore, the complexity of each frequency band usually satisfies Cf1<Cf2<<CfKC_{f_1} < C_{f_2} < \cdots < C_{f_K}. To perform more accurate time series prediction, we assume that for each frequency band fkf_k, its representation in the latent space is FkF_k. Furthermore, assume there exists an perfect network MkM_k that perfectly fits FkF_k in the latent space, such that Mk(fk)=Fk.M_k(f_k) = F_k. Thus, the complexity of the network should be proportional to the complexity of the corresponding frequency band, satisfying CM1<CM2<<CMKC_{M_1} < C_{M_2} < \cdots < C_{M_K}. Our goal is to find the network M^_k\hat{M}\_k such that the following fitting error is minimized: (M^_k(fk)Mk(fk))(\hat{M}\_k(f_k) - M_k(f_k)) for each fkf_{k}. To achieve this, we first need to minimize the complexity error betweem M^_k(fk)\hat{M}\_k(f_k) and Mk(fk)M_k(f_k). We define the complexity error for the kk-th frequency band as ΔCk=CM^_kCMk\Delta C_k = C_{\hat{M}\_k} - C_{M_k}, where CM^_kC_{\hat{M}\_k} represents the complexity of the fitted network M^_k\hat{M}\_k. When using MLP as the fitting network, the complexity CM^_kC_{\hat{M}\_k} is fixed. Since the complexity of the MLP cannot be adjusted, there can be at most one CM^_kC_{\hat{M}\_k} that satisfies: ΔCk=0\Delta C_k = 0, because the CMkC_{M_k} is monotonically increasing with kk. Therefore, when using the fixed-complexity MLP as the fitting network, there will inevitably be an error accumulation, and the total error satisfies k=1KΔCkϵ\sum_{k=1}^K \Delta C_k \geq \epsilon, where ϵ\epsilon is a lower bound for the error. In contrast, when using KAN, we can flexibly control the complexity by adjusting the polynomial degree of the KAN's internal kernel. In the optimal case, each frequency band can satisfy ΔCk=0\Delta C_k = 0, resulting in a total error of zero: k=1KΔCk=0\sum_{k=1}^K \Delta C_k = 0. This shows that the KAN can achieve a better theoretical performance upper bound when fitting multiple frequency components in the latent space. Furthermore, by accurately fitting the frequency representations, KAN can provide more reliable foundations for precise time series prediction.

评论

Q3: "Lack of comparison between TimeKAN and other methods regarding look-back windows."

Thank you for your valuable suggestion. We have added the performance of TimeMixer, iTransformer and PatchTST with varing look back windows for comparison in Figure 2. You can also see the table ( MSE results of 96 step forecasting horizon in Ettm2 dataset) below.

ModelsIuput-48Iuput-96Iuput-192Iuput-336Iuput-512Iuput-720
PatchTST0.1910.1830.1760.1730.1730.188
iTransformer0.1930.1800.1770.1750.1790.181
TimeMixer0.1870.1750.1710.1790.1700.176
TimeKAN0.1900.1740.1700.1670.1630.164

As we have seen, TimeKAN can clearly benefit from longer look back windows compared to other models and consistently performs optimally in most cases.

Q4: "For baseline methods, it is better to choose more frequency-based (such as FreTS) methods since frequency decomposition is a key contribution."

Thanks to your valuable suggestions, we have added two representative frequency methods FiLM and FreTS in Table 1. A comparison between TimeKAN and these two frequency methods is shown in the following table (Average MSE|MAE of all forecasting horizons):

DatasetsETTh1ETTh2ETTm1ETTm2WeatherElectricity
MetricMSE|MAEMSE|MAEMSE|MAEMSE|MAEMSE|MAEMSE|MAE
FreTS0.491| 0.4750.433| 0.4460.407| 0.4170.339| 0.3740.245| 0.2940.192| 0.282
FiLM0.516| 0.4830.402| 0.4200.412| 0.4020.288| 0.3280.271| 0.2900.223| 0.302
TimeKAN0.417| 0.4270.383| 0.4040.376| 0.3950.277| 0.3220.242| 0.2720.197| 0.286

It can be seen that TimeKAN significantly outperforms FiLM and FreTS on most of the datasets, which shows the effectiveness of our designed frequency Decomposition-Learning-Mixing architecture.

评论

Many thanks to Reviewer gWX6 for providing thorough detailed comments.

Q1: "The author does not explain why TimeKAN does not perform well in the Electricity dataset. It would be helpful if the authors could provide potential reasons or hypotheses for why TimeKAN underperforms on the Electricity dataset specifically."

Our TimeKAN is a model based on frequency analysis. We infer that its poor performance on the electricity dataset is due to the overly short look-back window (T=96T=96), which cannot provide sufficient frequency information. To verify this, we compare the average number of effective frequency components under a specific look-back window. Specifically, we randomly select a sequence of length TT from the electricity dataset and transform it into the frequency domain using FFT. We define effective frequencies as those with amplitudes greater than 0.1 times the maximum amplitude. Then, we take the average number of effective frequencies obtained across all variables to reflect the amount of effective frequency information provided by the sequence. When T=96T=96 (the setting in the paper), the average number of effective frequencies is 10.69. When we extend the sequence length to 512, the average number of effective frequencies becomes 19.74. Therefore, the effective frequency information provided by 512 time steps is nearly twice that of 96 time steps. This indicates that T=96T=96 loses a substantial amount of effective information. To validate whether using T=512T=512 allows us to leverage more frequency information, we extend the look-back window of TimeKAN to 512 on the electricity dataset and compare it with the state-of-the-art methods, TimeMixer and MOMENT. The results are shown in the table below (results of MSE|MAE):

ModelsPredict-96Predict-192Predict-336Predict-720
MOMENT0.136|0.2330.152|0.2470.167|0.2640.205|0.295
TimeMixer0.135|0.2310.149|0.2450.172|0.2680.203|0.295
TimeKAN0.133|0.2300.149|0.2470.165|0.2610.203|0.294

Although TimeKAN performs significantly worse than TimeMixer when T=96T=96, it achieves the best performance on the electricity dataset when the look-back window is extended to 512. This also demonstrates that TimeKAN can benefit significantly from richer frequency information. For a fair comparison with other methods, we have adopted the common 96-step lookback window setting, which actually compromises the advantages of TimeKAN in long sequence. We have added the above discussion to Appendix B.4.

Q2: "From Table 4, I do not see a huge increase with KAN compared to MLP models. Generally, if these results are similar, mostly, KAN is much slower than MLP. It would be good to see runtime comparisons between KAN and NLP implementations. Additionally, if KAN is slower than MLP in practice, it would be beneficial for authors to discuss more reasons why we prefer KAN over MLP."

One significant reason why the improvement of KAN over MLP is not particularly pronounced lies in the well-designed nature of our Frequency Decomposition-Learning-Mixing architecture, which enables even MLP to demonstrate competitive performance. In TimeKAN, we use Chebyshev polynomials as the basis functions for the learnable activation functions, denoted as ChebyshevKAN. Assuming each layer of the neural network contains NN neurons, the computational cost per layer for ChebyshevKAN with a maximum degree of KK is N2KN^2K, while for an MLP, the computational cost is N2N^2. Therefore, the computational cost of ChebyshevKAN is KK times that of an MLP. Notably, in the comparison presented in Table 4 of our paper, we ensure that the computational costs of the MLP and ChebyshevKAN are similar. Under comparable computational costs, we observe that the inference speeds (iter/s) of the two methods are not significantly different, as shown in the table below:

TimeKANETTh1ETTm1Weather
MLP-based0.0124s0.0148s0.0246s
KAN-based0.0121s0.0159s0.0250s

Therefore, under similar computation consumption, KAN achieves relatively better performance, demonstrating its capability in dynamic frequency representation. With the emergence of more KAN variants, we can anticipate further improvements in the computational efficiency and forecasting performance of TimeKAN.

审稿意见
3

The paper presents TimeKAN, a Kolmogorov-Arnold Network (KAN)-based model for long-term time series forecasting, designed to handle complex multi-frequency patterns in real-world data. Traditional models struggle with mixed frequencies, but TimeKAN addresses this with a three-part architecture: Cascaded Frequency Decomposition to separate frequency bands, Multi-order KAN Representation Learning to model each band’s specific patterns using adaptable polynomial orders, and Frequency Mixing to recombine frequencies effectively. Experiments show that TimeKAN achieves superior accuracy and efficiency compared to existing models, making it a robust, lightweight solution for complex TSF tasks.

优点

Exploratory Application of KAN to Time Series Forecasting: The paper attempts to introduce Kolmogorov-Arnold Networks (KAN) into time series forecasting, using multi-order polynomial representations to handle the complexities of different frequency components. While KAN has not been widely applied in this area, this effort demonstrates its potential flexibility in data fitting and offers an alternative approach to traditional MLPs.

Comprehensive Experimental Design: The paper includes experiments across various time series datasets, such as weather, electricity, and energy data, covering diverse scenarios. Additionally, it conducts ablation studies to examine the effects of each module. These experiments help to assess TimeKAN’s performance and may provide a reference point for further research.

缺点

Lack of Innovation: The primary contribution of this paper is the integration of the Kolmogorov-Arnold Network (KAN) into time series forecasting, yet the work does not introduce novel methods or substantial breakthroughs in methodology. While the inclusion of KAN is somewhat new, other components, such as frequency decomposition and mixing, are mature techniques, and the paper does not propose innovative applications or enhancements to these. Overall, this work appears more like a combination of existing technologies rather than a genuinely innovative study.

Absence of Comparison with Cutting-Edge Models: The experiments lack direct comparisons with state-of-the-art models, especially those in high demand for time series forecasting, such as large language models (LLMs) and foundation models. Given current research trends, these models have become widely adopted benchmarks. Without such comparisons, the effectiveness of the proposed method remains unclear, especially as the improvements presented are relatively limited compared to advancements in mainstream approaches.

Reliance on KAN, a Model with Limited Validation: The foundation of TimeKAN is the KAN model, which has not yet been widely validated or accepted. Its theoretical correctness and practical effectiveness remain uncertain, which casts doubt on the reliability and generalizability of TimeKAN as a whole. If there are inherent issues with KAN, the predictive performance and stability of TimeKAN could be compromised, making the paper's conclusions less convincing.

Insufficient Analysis of Computational Efficiency: While the paper claims that TimeKAN is more lightweight than existing methods, it lacks an in-depth analysis of its actual computational efficiency, especially compared to more mainstream and optimized time series models. Additionally, there is no quantification of the computational cost associated with KAN’s multi-order polynomial calculations when handling long-sequence data. Given that many time series tasks require efficient real-time computations, focusing solely on parameter reduction does not adequately demonstrate TimeKAN’s advantage in computational efficiency; the absence of data on inference speed and computational cost undermines its practical applicability.

Focus on Single-Task Performance Rather Than Generalized Representation: A dominant trend in time series modeling now follows the approach of large language models (LLMs) to develop foundation models and leverage self-supervised representation learning. This approach enables generalization across various tasks and domains, ultimately aiming for a “one-fit-all” solution. However, this paper remains focused on improving single-task performance in time series forecasting (TSF), which may be of limited value in light of the broader goals of the field. Furthermore, the improvements reported in the experimental results are relatively modest, and without statistical significance testing, it remains unclear if these gains are truly meaningful or could simply be attributed to random variation.

问题

To my knowledge, KAN itself has not been formally accepted, meaning it has not undergone rigorous peer-reviewed validation. If KAN’s theoretical foundation is later found to be flawed, would this impact the validity of this paper? – If the underlying theory or structure of KAN is later shown to have limitations or inaccuracies, would the overall reliability of TimeKAN be compromised? Has the author considered this risk, and are there alternative solutions in place?

In the experimental section, this paper does not compare TimeKAN with current state-of-the-art models (such as large language models or foundation models), making it difficult to assess its actual performance – Without direct comparisons with these advanced models, can TimeKAN demonstrate a significant advantage? If the authors believe TimeKAN holds particular value in computational efficiency or predictive accuracy, could more data be provided to quantify this advantage?

A major trend in time series research is developing foundation models, inspired by large language models, to generalize across domains and tasks. However, it is unclear if TimeKAN’s current design can support such robust representation learning – Can TimeKAN truly compete with established frameworks like Transformers in terms of generalization and adaptability? If not, have the authors considered alternative approaches to enhance TimeKAN’s structural robustness and flexibility for broader applications?

评论

Q4-part3: "The absence of data on inference speed and computational cost undermines its practical applicability."

In fact, the computational cost of TimeKAN has been presented in Table 5 using the MACs metric. Compared with the three other best-performing models (including MLP-based and Transformer-based methods), TimeKAN requires significantly less computational cost. For inference time (s/iter), we have added a comparison of TimeKAN and other models under the same settings. Due to the computational cost of multiple Fourier Transforms, TimeKAN’s inference speed is relatively slower but remains in the same order of magnitude as other methods except time series foundation model Time-FFM, and significantly outperforms Time-FFM.

ModelsETTh1ETTm1Weather
Time-FFM0.0216s0.0219s0.0539s
PatchTST0.0107s0.0116s0.0208s
TimeMixer0.0136s0.0155s0.0223s
iTransformer0.0114s0.0126s0.0185s
TimeKAN0.0121s0.0159s0.0250s

Q5-part1: "Compared to one-fit-all solutions like LLM-based time series methods or time series foundation methods, approaches focusing solely on improving time series forecasting may have limited value. Besides, TimeKAN may not be able to compete with established frameworks like Transformers in terms of generalization and adaptability."

Our TimeKAN is carefully designed for time series forecasting. As shown in the table above, TimeKAN significantly outperforms the current state-of-the-art LLM-based time series methods Time-FFM and time series foundation models MOMENT. Furthermore, as recent studies indicate: LLM-based time series forecasters perform the same or worse than basic LLM-free ablations, yet require orders of magnitude more compute [3]. This suggests that the effectiveness of current LLMs for the time series forecasting task remains questionable. Besides, the state-of-the-art time series foundation model MOMENT requires pretraining on over 13 million time series datasets, which consumes approximately 404 GPU hours and has more than 20000 times the number of parameters as TimeKAN. However, it still fails to outperform TimeKAN on typical time series forecasting tasks. Therefore, whether achieving superior performance in time series forecasting requires building large models following trends in other fields or designing lightweight, task-specific models remains an open question.

Based on the current evidence, large models trained on massive datasets have not surpassed our lightweight TimeKAN in time series forecasting tasks. While such models perform well in generalizing across multiple tasks, they are yet to achieve the best results on single forecasting task. In conclusion, we argue that the future of time series forecasting should not be confined to scaling up models. Designing high-performance, lightweight models that incorporate domain knowledge for single tasks presents a promising alternative pathway.

Q5-part2: "The improvements reported in the experimental results are relatively modest, and without statistical significance testing, it remains unclear if these gains are truly meaningful or could simply be attributed to random variation. "

To evaluate the robustness of TimeKAN, we repeated the experiments on three randomly selected seeds and compared it with the second-best model (TimeMixer). We report the mean and standard deviation of the results across the three experiments, as well as the confidence level of TimeKAN's superiority over TimeMixer. The results are averaged over four prediction horizons (96, 192, 336, and 720).

DatasetMSEMAE
TimeKANTimeMixerConfidenceTimeKANTimeMixerConfidence
ETTh10.422±0.0040.462±0.00699%0.430±0.0020.448±0.00499%
ETTh20.387±0.0030.392±0.00399%0.408±0.0030.412±0.00490%
ETTm10.378±0.0020.386±0.00399%0.396±0.0010.399±0.00199%
ETTm20.278±0.0010.278±0.0010.324±0.0010.325±0.00190%
Weather0.243±0.0010.245±0.00199%0.273±0.0010.276±0.00199%

As shown in the table, in most cases, we have over 90% confidence that TimeKAN outperforms the second-best model and demonstrates good robustne of TimeKAN.

[3] Tan et al., Are Language Models Actually Useful for Time Series Forecasting?, NeurIPS 2024.

评论

I appreciate that the author has responded to all the questions I raised in a very serious and comprehensive manner. However, since this paper is based on the KAN algorithm, from my personal experience in using the KAN algorithm in the past, I do have some concerns and doubts that are hard to let go regarding its performance and efficiency. At this stage, before the KAN algorithm has successfully passed the strict peer review process and been finally accepted, and also before it has obtained widespread recognition and affirmation in the academic community, I think this paper may not be able to meet the acceptance criteria for the time being. Therefore, I have decided to maintain the current score evaluation.

评论

Q3: "The foundation of TimeKAN is KAN model, which has limited validation and uncertain theoretical correctness. If KAN is found to have flaws, it could undermine overall reliability of TimeKAN."

In fact, the foundation of TimeKAN is Decomposition-Learning Mixing architecture, which decompose time series into multiple frequency bands represented in the time domain and leverage time-domain processing techniques to learn representations for specific frequencies. We have demonstrated the effectiveness of this framework, showing that even when KAN is replaced with a traditional MLP, competitive performance is still maintained. To better represent various frequency components in a more fitting manner, we further developed the Multi-order KAN to enhance the Decomposition-Learning Mixing architecture. In addition, based on the Decomposition-Learning Mixing architecture, the performance of the well-designed Multi-order KAN surpasses that of the direct use of KAN, indicating that a carefully designed Multi-order KAN can more accurately model different frequency components. As better KAN variants emerge in the future, our framework can benefit further. Therefore, from a practical perspective, our current efforts have shown that the design of the Multi-order KAN is an effective component of our framework.

From a theoretical standpoint, although KAN is still a relatively new member of the deep learning family, the Kolmogorov-Arnold Network theorem it relies on has been established for decades. The Kolmogorov-Arnold Network theorem can be described as: any multivariate continuous function can be represented as a combination of univariate functions and addition operations. In our Multi-order KAN, we primarily utilize univariate functions of varying complexities to form networks with different fitting capabilities, thus representing different frequency components, which is essentially the core concept of the Kolmogorov-Arnold Network theorem. Therefore, we believe that the use of Multi-order KAN can be fundamentally attributed to the well-established Kolmogorov-Arnold Network theorem, making the design of our TimeKAN consistent with foundational theory.

Q4-part1: "Lack of in-depth analysis of the actual computational efficiency between TimeKAN and mainstream time series models."

Thank you for your valuable feedback. We have added an analysis of the computational complexity of TimeKAN in Appendix B.1. Given a multivariate time series with MM variables and LL time steps, the computational complexity of TimeKAN is approximately O(ML logL)\mathcal{O}(ML\ logL). This indicates that the complexity scales linearly with the number of variables and grows logarithmically with the number of time steps. The term L logLL\ logL arises from the computational complexity of the Fast Fourier Transform. To further reduce the computational complexity with respect to the number of time steps, we can employ a pre-selected set of Fourier bases for quick implementation, reducing the complexity to (ML)\mathcal(ML). In comparison, the computational complexity of PatchTST is (L2)\mathcal(L^{2}), iTransformer has a complexity of (M2)\mathcal(M^{2}), and TimeMixer exhibits a complexity of ML2ML^{2}.Therefore, theoretically, TimeKAN achieves a better balance between the number of variables and input length when compared to these state-of-the-art methods. Notably, the actual computational speed may vary due to differences in parameter settings. For instance, while the computational complexity of PatchTST scales quadratically with input length, its computation speed can be significantly improved by selecting larger patch sizes.

Q4-part2: "There is no quantification of the computational cost associated with KAN’s multi-order polynomial calculations when handling long-sequence data. "

In TimeKAN, the role of the Multi-order KAN is to learn representations at different frequencies, i.e., learning the latent space representation. For a single KAN, assuming the highest degree of the Chebyshev polynomial is kk and the hidden dimension is DD, the computational cost for channel modeling is M\*L\*K\*D2M\*L\*K\*D^{2}, which increases linearly with the sequence length LL. Therefore, it does not result in excessive computational costs for long sequences.

评论

Q2: "Lack of comparison between TimeKAN and current state-of-the-art LLM-based time series models as well as time series foundation models in terms of predictive accuracy or computational efficiency."

Thanks for your valuable suggestion. Through comparisons, our TimeKAN outperforms the state-of-the-art LLM-based time series model and time series foundation model in the majority of cases across six datasets. We have added a state-of-the-art LLM-based time series model TimeFFM [1], which have the same look-back window (T=96T=96) setting as TimeKAN. Additional comparative results are shown below:

ModelsPredictETTh1ETTh2ETTm1ETTm2WeatherElectricity
TimeFFM (T=96)960.3850.3010.3360.1810.1910.198
1920.4390.3780.3780.2470.2360.199
3360.4800.4220.4110.3090.2890.212
7200.4620.4270.4690.4060.3620.253
TimeKAN (T=96)960.3670.2900.3220.1740.1620.174
1920.4140.3750.3570.2390.2070.182
3360.4450.4230.3820.3010.2630.197
7200.4440.4430.4450.3950.3380.236

Additionally, we also compare our TimeKAN with the state-of-the-art time series foundation model MOMENT [2], which is pretrained on time series dataset of size 13M. Notably, the look back window of MOMENT is set to 512. Therefore, we extend the look window of TimeKAN to 512 to ensure a fair comparison. The result of the comparison is as follows:

ModelsFETTh1ETTh2ETTm1ETTm2WeatherElectricity
MOMENT (T=512)960.3870.2880.2930.1700.1540.136
1920.4100.3490.3260.2270.1970.152
3360.4220.3690.3520.2750.2460.167
7200.4540.4030.4050.3630.3150.205
TimeKAN (T=512)960.3570.2660.2930.1630.1430.133
1920.4010.1380.3230.2240.1860.149
3360.4080.3330.3530.2740.2370.165
7200.4470.3760.3980.3600.3090.203

As shown in the two tables above, our TimeKAN outperforms the state-of-the-art LLM for time series models and time series foundation models in the majority of cases across six datasets, demonstrating its effectiveness in time series forecasting. Furthermore, Time-FFM has 90M parameters, and MOMENT has 385M parameters, while TimeKAN has an average of only 21K parameters. Therefore, TimeKAN is a simple yet effective model, achieving superior performance with just 0.025% (or even less) of the parameters required by these foundational models.

[1] Liu et al., Time-FFM: Towards LM-Empowered Federated Foundation Model for Time Series Forecasting., NeurIPS 2024.

[2]Goswami et al., MOMENT: A Family of Open Time-series Foundation Models., ICML 2024.

评论

We would like to sincerely thank Reviewer XpEF for providing the valuable feedback.

Q1: "This work appears more like a combination of existing technologies, such as frequency decomposition and mixing, rather than a genuinely innovative study."

In fact, the core contribution of TimeKAN lies in re-examining the time series forecasting task from the perspective of multi-frequency decomposition learning. To the best of our knowledge, TimeKAN is the first method to decompose time series into multiple frequency bands represented in the time domain and leverage time-domain processing techniques to learn representations for specific frequencies, which effectively handle the challenges arising from complex frequency mixing. As a result, TimeKAN fundamentally differs from prior methods in frequency domain learning (e.g., FreTS, FEDformer) and decomposition-based learning (e.g., TimeMixer, MICN) in the following ways:

(1) Frequency-Domain Learning: FreTS, FEDformer, and other frequency-domain based methods focus on learning the representation of frequency signals directly in the frequency domain. However, current evidence suggests that methods learning directly in the frequency domain generally perform worse than those learning directly in the time domain. In contrast, we propose using Frequency Upsampling to cascade different frequency components’ representations in the time domain, which allows TimeKAN to directly process different frequency components in the time domain, avoiding the difficulty of learning frequency domain signals directly. To our knowledge, no previous work has adopted this paradigm.

(2) Decomposition-Based Learning: Decomposition-based methods like TimeMixer and MICN mostly decompose sequences into subcomponents from a time-domain perspective, while we decompose time series from a frequency-domain perspective. Frequency domain perspective enables a more comprehensive analysis of time series, encompassing both periodic long-term variations and sharp short-term fluctuations. By extracting the sequence from the frequency domain and learning in the time domain, our TimeKAN is able to accurately model the subcomponents of the time series.

Furthermore, the incorporation of the Multi-order KAN enhances the framework’s capability to adaptively learn specific frequency components. TimeKAN has been demonstrated state-of-the-art performance across multiple long-term time series forecasting tasks and showcasing its lightweight and parameter-efficient architecture.

评论

Dear Reviewer XpEF, 

We would like to express our sincere gratitude once again for your valuable feedback. We fully understand your concerns about the effectiveness of KAN as a novel neural network, but with less than 24 hours remaining until the rebuttal period concludes, we would like to kindly remind you that the vanilla KAN has already received all positive evaluations from five reviewers on OpenReview, and its theoretical contributions have been acknowledged by the community. This demonstrates the potential of KAN for widespread adoption in the deep learning community. In TimeKAN, we have carefully designed the Multi-order KAN to align the variable characteristics of KAN with the multi-frequency nature of time series, and demonstrated its effectiveness through experiments. Therefore, we sincerely hope that you might reconsider your score. Finally, we would like to reiterate our appreciation for your thoughtful comments, which have significantly improved the quality of our paper. 

Best regards, 

The Authors

评论

We sincerely appreciate all the reviewers for their thoughtful feedback and constructive suggestions, which have been immensely helpful in guiding us to enhance our paper. We would like to clarify three points in order to address the remaining concerns of some reviewers to the best of our ability:

(1)The core foundation of TimeKAN lies in the Decomposition-Learning-Mixing architecture rather than KAN, which decomposes time series into multiple frequency bands represented in the time domain and leverages time-domain processing techniques to learn representations for specific frequencies. Multi-order KAN are carefully designed to efficiently adapt this framework and improve forecasting performance across multiple datasets. TimeKAN has been demonstrated state-of-the-art performance across multiple long-term time series forecasting tasks and showcasing its lightweight and parameter-efficient architecture.

(2)We understand the the concerns of some reviewers regarding the effectiveness of KAN. However, we emphasize that KAN is not a seamless replacement for MLP and may not work in all cases. It requires careful design to fully realize its potential. As shown in Table 4 of the paper, directly replacing MLP with KAN results in a performance that is actually worse than MLP. Therefore, in TimeKAN, we have designed the Multi-order KAN to better match the modeling of different frequency components, thereby maximizing the advantages of KAN’s variable complexity. And the results demonstrate that Multi-order KAN outperforms both the direct use of KAN and the direct use of MLP.

(3) To the best of our knowledge, the vanilla KAN has received positive feedback on OpenReview, with five reviewers consistently agreeing on its contribution to the community, which to some extent demonstrates its effectiveness. Therefore, we hope the reviewers will approach KAN with a more open perspective.

We sincerely hope the reviewers will reconsider their score and look forward to further feedback.

AC 元评审

The paper proposes a novel forecasting method that exploits the frequency structure of time series. The method, TimeKAN, has a decomposition component, a representations learning component based on the Kolmogorov-Arnold Network, and a mixing component.

Overall, this paper has a solid technical contribution and convincing experiments. The reviewers also praised the clarity of the paper, the thorough presentation of background work, and the ablation studies.

There were some reservations expressed because the KAN algorithm has not yet been accepted at a conference - however, it has been public on arXiv since April 2024 and the proposed method is sufficiently different from that algorithm, so I do not see it as a problem, particularly since the authors have conducted additional experiments replacing the KAN structure with an MLP, showing the merits of the other components in their architecture (i.e. the method is not exclusively reliant on the success of KAN).

The concerns raised by reviewer XpEF about comparisons to state-of-the-art models have been adequately resolved in the rebuttal through the addition of comparisons to moment and TimeFFM.

Thus, I recommend acceptance of this submission.

审稿人讨论附加意见

As mentioned in the meta-review above, there were some concerns expressed with respect to the KAN-based block in the paper and comparisons to s.o.t.a. which, in my view, were addressed by the authors.

最终决定

Accept (Poster)