6.0

/10

Poster5 位审稿人

最低4最高7标准差1.3

3.2

置信度

正确性2.6

贡献度2.6

表达2.6

NeurIPS 2024

Scaling Law for Time Series Forecasting

Jingzhe Shi,Qinwei Ma,Huan Ma,Lei Li

OpenReview PDF

提交: 2024-05-12更新: 2024-12-19

TL;DR

Our research proposes a novel theory for scaling laws in time series forecasting, addressing anomalies observed in previous studies and emphasizing dataset size, model complexity, and forecast horizon in deep learning methodologies.

摘要

关键词

Time series forecastingScaling lawTheory

评审与讨论

审稿意见

评分: 7置信度: 32024-06-24

The authors propose a set of scaling laws on time series based relating model size, data size, and the "history dependence" of the time series. They end with a power law which they then try on different datasets.

优点

The topic is interesting and it would be a good thing to study.

缺点

The presentation is really bad and it makes the understanding difficult. The theoretical results have assumptions that are sometimes not properly stated as assumptions, and sometimes hard to justify.

Some of the theoretical preliminaries are either questionable, or badly explained. If this is solved I might look into the actual derivations more in detail:

Quasi-isometric assumption: it seems to imply that the mapping is either the identity or close to it. Imagine you have a language where there are two synonyms; then the only way in which the equation holds is when alpha is very small. Which is fine, but then the value of alpha can be arbitrarly small! They should justify when this is valid.
A similar limitation appears in assumptions 4 and 5, which seem to indicate that the data must follow some markovian structure. And I can definitely find cases where they would not apply: imagine a language with X letters forming Y words of length Z. If Y<<X^Z (which almost always happens) when L<Z you could easily end up with the rank of the covariance matrix for sequences (of length less than Z) being higher than the rank of words. In more practical settings, there are languages (ex: german) where the sense of a word is tied to a word that happens much later in the sentence (in german, verbs with preposition when there is a nebensatz), which would relate to the toy example I mention before, because the rank of the covariance decreases as the length increases.
In line 132 they assume that the data follows Zipf's law. This is a big assumption, and while it could apply to languages, it is hard to argue that it applies to every time series (and see Clauset et al. "Power-law distributions in empirical data"). They should at least mention why they think it applies to the datasets they are using.
In sec 3.2.4 they assume that the model partitions the space uniformly, and claim that this is the worst case. It is not clear why would that be the case, and don't seem to actually analyze other scenarios later on (if they do, refer to it on the same sentence).

I haven't checked this in detail, but it seems to me that with assumptions 1 to 3 the model is either linear or very similar to it, and by 4 and 5 the data might need to be implicitly autoregressive. Taking those two together and assuming that the data has a power law distribution, it seems almost necessary that the loss or model size would follow the laws that they propose.

For the experiments:

From looking at Fig 1, what I could conclude is that different datasets/models follow different distributions.
I think that from the fit is very hard to make any proper justification. MOst of the trends could be fit by some exponential or a simple polynomial, so they should compare to some sensible alternatives, and use some valid statistic criterion (akaike, for example) to compare.
In the same line of thought, fitting on a logscale is often hard, and I don't see how they did the fit. The fit should be more properly explained (and likely done, check http://bactra.org/weblog/491.html for example)
Fig. 2, from looking at it I would expect the double-descent to relate to those results.
I did not see in the theory any indication that the losses could scale negatively (as in some cases of Fig. 2). THis seems to be important, as some data has such trend. If their model is not suposed to overfit, this is a crucial assumption?
In line 85, a less relevant point: "[33] shows that the average intrinsic dimension for a time series sequence should converge to a finite value, which indicates that the intrinsic dimension of a sequence should be proportional to its length." This sentence is wrong as far as I can tell: if the average intrinsic dimension converges to a finite value, it is not proportional to its length. For example, imagine the sequence 01010101010.... irrespective of the length, the intrinsic dimension is fixed.

The notation is also problematic. They use L for both length (line 100) and loss (Eq. in line 144), and sometimes they just write "loss" instead of L.

I pointed out a few problems with the writing. There are more, but at this point it's on the authors to fix them:

Lines 21-22 Neural Nets utilize different model architectures, including FFN-based, Transformer-based and Convolution-based neural nets have been proposed.
Sometimes they refer to "scaling laws" in plural, and sometimes as "the Scaling Law" in singular as if there was a single one.
Line 67: "There have been many works analyzing on different mathematical properties "
Line 1667: if the data samples are "sparse". I assume they mean that there are few data points.

问题

Check weaknesses. Also, how would the authors' results relate to the double-descent curve? This would put the two cases in context. Also, their scaling law seems to be tied to ever-decreasing loses

局限性

They should really address the limitations of their assumptions and discuss when are they valid.

作者回复

2024-08-05

Dear Reviewer 7rY6,

Thank you for a detailed review! We truly appreciate your reviews and questions, especially for our theories, which aim our theory towards an omnipotent theory. We would like to briefly summarize your concerns, and our rebuttal and then expand upon this summary.

1.Theory Assumptions

1.1. Your view: Current Assumptions lead to close-to-linear relationships.

Response:

1.1.1.Our current assumptions is not equivalent to linear, and we can further weaken our assumptions while deriving similar results, (which would be more complicated a process hence we did not choose to do so in our original submission, which is presented later in this rebuttal).

1.1.2. We chose stronger assumptions for simplicity of derivation, which may lead to closer to linear models. We would like to mention that although counter-intuitively, linear models are indeed important models in time series forecasting, represented by a series of important experimental and theoretical works for time series[1,5,6].

We make these assumptions, not to include every possible scenario, but to focus on the most relevant properties related to the research problem.

1.2. Your view: Current assumptions lead to autoregressive behavior.

Response:

1.2.1. Our theory utilizes Conditioned Expectation, which actually still holds even in cases you refer to as Nebensatz. We will use the exact example of Germany to show how it works.

1.2.2. The task of Time Series Forecasting itself has assumed time series follows at least partially autoregressive behaviors, and the success of data-driven methods has validated this assumption.

1.3. Your view: Zip-f is too strong an assumption.

Response:

1.3.1. In our paper we have proven it under CI case experimentally using PCA, presented in Figure 4 and Figure 8 in the original submission.

1.3.2. We further conduct PCA on intermediate vectors with iTransformer, which represents the CD case. We appreciate your advice, hence we use p-value to show fitness. Please refer to Figure 1 in Extra Page PDF.

1.4. Your view: The claim in line 85 (the intrinsic dimension of a sequence is proportional to its length) is wrong.

Our response:

1.4.1. This is an empirical result of previous work[7] for chaotic systems, which can not be denied by manually constructed counter-examples. (E.g. It does not hold for one to claim Zipf-law for vocabularies in NLP being wrong by constructing counter-examples like '\n'.join(['I like {}'.format(n) for n in ALL_NOUNS]).)

1.5. Notions and Typos:

Thank you for pointing these out! We would check them and further polish our paper.

2. Double-descent curve and ever-decreasing loss.

We have never claimed that our theory covers all aspects and all phenomena in time series forecasting tasks. What we are focusing on theoretically is to explain the impact of look-back horizon on scaling behaviors (as well as dataset size and model size), which is not yet well-studied. In Figure 2 of our original submission, we posited that the observed increase in loss could be attributed to overfitting. We further conduct experiments to validate the impact of dataset size, model size and look-back horizon on scaling behaviors for time series forecasting, which was not validated before. (Please refer to global rebuttal).

More details.

1.1.1 Weaker Assumptions

The Quasi-isometric assumption can be replaced with this Inverse Lipschitz assumption: $\\phi^{-1}$ would be $K_I$ -Lipschitz under L2 norm. That is:

\\forall x,y \\in \\mathcal{M}(L), \\| \\phi^{-1}(x)-\\phi^{-1}(y)\\|_2 \\leq K_I\\| x-y\\|_2.

We further make two assumptions:

Causality. We assume existing an optimal model $F[S]$ to predict the next $S$ frames given previous $h\rightarrow\infty$ frames, so that the error only originates from the intrinsic noise.

\\exists F[S]: \\mathcal{M}\rightarrow \\mathcal{M}(S), s.t. \\lim\\limits_{h\\rightarrow \\infty} \\mathbb{P}(y\\mid x_{-h:0}) = (1-\\eta)\\delta(F[S]\(x_{-h:0})) + \\eta \\mathcal{N}(F[S]\(x_{-h:0}), \\Sigma_S)

Where $\\eta$ stands for the noise ratio in the system and we use $\\mathcal{N}(\\mu, \\Sigma)$ to represent a normal distribution with mean $\\mu$ and covariance $\\Sigma$ here.

Uniform Sampling noise: When drawing a single length- $L$ sample, we assume that the sampling noise is uniform in each direction in $\\mathcal{L}$ ,

\mathbb{E}_{x[-H:S]\in \mathcal{O}(H+S)}[\|\phi^{-1}(m(y_i)) - \phi^{-1}(y_o)\|_2] \leq K_I e_m

And from assumption 1 the second term is bounded by $e(S)$ . We can derive that:

L_{\text{ori}} \leq K_I e_m + e(S)

which shows loss in original space is linearly bounded by loss in the intrinsic space. W.l.o.g we may only study loss in the intrinsic space.

Because this rebuttal section has character limitation, we would like to present further deduction in detail later in the discussion period.

1.2.1. How it works for future-dependency like Nebensatz.

Take two phrases as an example:

a.Ich wusste nicht, dass er Hamburger essen wollte.

b.Ich wusste nicht, dass er Bier trinken wollte.

If P(essen | Ich wusste nicht, dass er) = P(trinken |Ich wusste nicht, dass er ) = 1/2, and P(Hamburger | Ich wusste nicht, dass er __ essen wollte) = P(Bier | Ich wusste nicht, dass er __ trinken wollte) = 1.

It is possible in practice that we do not know what comes next had we not seeing 'essen' or 'trinken' no matter how many words before that we see, but this uncertainty has already been modeled into Bayesian Loss for $M(\infty)$ and would be absorbed in the term $\\eta* tr(\\Sigma_S)$ in our bound for $L_{Bayesian}$ , which will be further absorbed into the term $\\eta \\sigma_M^2 S^2 d_I(S)$ in the last formula on Page4 of our original Submission, only adding a constant to the final loss.

We have listed papers cited above in the global rebuttal. They could possibly act as supplement to our reply for your detailed concerns.

评论- Preliminary assesment

2024-08-08

I will check the responses more carefully in the next days and decide if I change my grading, but for now I mention a couple of points.

1.1 Fair, I will check it later.
1.2: It's unclear to me if you are saying that it's fine to use the autorregressive assumption (1.2.2) or that you are not using it (1.2.1). For the record, I am fine with making this assumption, but you should explain in plain words that you're using it.
1.3: This is more of a clarification of the statements. If you are assuming Zipf's law on the data, you should state it clearly as one of your key assumptions. I agree that is important to show that it applies to the data you are using, but that does not mean that it applies to every dataset. So besides verifying it for your data, you should add it to your assumptions in the bullet points.
1.4. Line 85. The problem is not whether the claim is empirical or relates to previous works, but rather that it is not stated properly . If D(L) is the average intrinsic dimension of a sequence of length L and converges to a finite value, this does not imply or suggest that the intrinsic dimension grows with the length.This is only the case if you are refer to the average as lim_{L->inf} D(L)/L, but usually when you say the average of a quantity you would refer over many samples (so intrinsic dimension for many samples of a sequence of a fixed length L).

I still think that the empirical validation is not really supporting the theory, in the sense that it's few points per line. THe main problem is on Fig. 1, where fitting a 3-parameter function to something with 8 points is ok if that's the data you have, but there are very few points for the number of parameters that you have and therefore you really should check that the fit is significant, or if other simple models (with the same number of parameters, not driven by the theory) give a similar fit.

评论- More details for the deduction: 1.1.1. Weaker Assumptions lead to similar results

2024-08-07

We'll present a more accurate set of assumption with weaker assumptions and how to derive similar result with these assumptions. Note that in our original submission we used stronger assumptions. We do not expect our theory to cover every aspect and explain every scenario, but to focus on the most important properties related to our research problem; and the following are a more complicated version, for your reference, in which we use much weaker assumptions to replace the quasi-isometry assumption.

The revised assumptions are as follows:

1.Information-preserving: Intuitively speaking, we should be able to recover the real sequence from its corresponding intrinsic vector with the expected error bounded by a small constant value. Formally we can state this as follows:

Exists a mapping $\phi$ from the original length- $L$ sequence space $\mathcal{O}(L)$ to $\mathcal{M}(L)$ , an inverse mapping $\phi^{-1}:\mathcal{M}(L)\to \mathcal{O}(L)$ and a small constant $e(L)$ related to $L$ so that for any $\mathbb{E}_{x\sim \mathcal{O}(L)}\|x-\phi^{-1}(\phi(x))\|_2^2\leq e(L)$ .

2.Inverse Lipschitz: $\phi^{-1}$ should be $K_I$ -Lipschitz under L2 norm. That is:

\forall x,y \in \mathcal{M}(L), \|\phi^{-1}(x)-\phi^{-1}(y)\|_2 \leq K_I \|x-y\|_2

3.Bounded: Unchanged

4.Isomorphism: Unchanged

5.Linear Truncation: Unchanged

6.Causality: We assume there exists an optimal model $F[S]$ to predict the next $S$ frames given previous $h\rightarrow\infty$ frames, so that the error only originates from the intrinsic noise.

\exists F[S]: \mathcal{M}\to \mathcal{M}(S)\, s.t. \lim\limits_{h\to \infty} \mathbb{P}(y\mid x_{-h:0}) = (1-\eta)\delta(F[S]\(x_{-h:0})) + \eta \mathcal{N}(F[S]\(x_{-h:0}), \Sigma_S),

Where $\eta$ stands for the noise ratio in the system and $\mathcal{N}(\mu, \Sigma)$ represents a normal distribution with mean $\mu$ and covariance $\Sigma$ .

7.Uniform Sampling noise: When drawing a single length- $L$ sample, we assume that the sampling noise is uniform in each direction in $\mathcal{L}$ ,

Assumptions 1 and 2 ensure that if we predict the intrinsic vector accurately, we can predict the original time series well. Thus we may only consider the task to predict a vector in $\mathcal{M}(S)$ given the vector corresponding to its previous $H$ frames in $\mathcal{M}(H)$ , which justifies the task formulation.

A formal deduction is shown as follows:

Proof.

Consider we are predicting $x[0:S]$ from $x[-H:0]$ , let $y_i\in \mathcal{M}(H)$ be the intrinsic vector of $x[-H:0]$ and $y_o\in \mathcal{M}(S)$ be the intrinsic vector of $x[0:S]$ (the true intrinsic vector). If we have a model $m$ so that:

\mathbb{E}_{x[-H:S]\in \mathcal{O}(H+S)}[\|m(y_i)-y_o\|_2] \leq e_m

where $e_m$ represents the expected error in the intrinsic space. Then, from assumption 2 we have:

\mathbb{E}_{x[-H:S]\in \mathcal{O}(H+S)}[\|\phi^{-1}(m(y_i)) - \phi^{-1}(y_o)\|_2] \leq K_I e_m

and from assumption 1 we know that:

L_{\text{ori}} \leq K_I e_m + e(S)

Hence, the loss in the original space should be:

L_{ori} = \mathbb{E}_{x[-H:S]\in \mathcal{O}(H+S)}[\|\phi^{-1}(m(y_i)) - x[0:S]\|_2] \\\\

= \mathbb{E}_{x[-H:S]\in \mathcal{O}\(H+S)} \[\|\phi^{-1}(m(y_i)) - \phi^{-1}(y_o)\|_2]

\text{ } \text{ }\text{ } + \mathbb{E}_{x[0:S]\in\mathcal{O}(S)}[\|\phi^{-1}(y_o)-x[0:S]\|_2]

\leq K_I e_m + e(S)

which shows that the loss in the original space is linearly bounded by the loss in the intrinsic space. W.l.o.g we may only study the loss in the intrinsic space.

评论- Appreciate the new assumptions and the clarifications

2024-08-12

I appreciate the extra work, I will update my score.

评论- Appreciate your patience; Supplement on experiments for polynomial fit

2024-08-12

We truly appreciate your constructive suggestions and patience towards our work!

In previous section we used second-order polynomial of $\log(x)$ because it fits better than the second-order polynomial of $x$ ; we would like to further add the result of $g_4(x)=A+Bx+Cx^2$ , for your reference:

ModernTCN:

AiC, BiC	Traffic	Weather	ETTh1	ETTh2
f	-103.9,-103.4	-87.2,-87.0	-69.5,-69.6	-64.7,-65.3
g1	-95.3,-94.9	-79.1,-79.0	-60.6,-60.7	-45.6,-46.0
g2	-94.6,-63.3	-71.1,-71.0	-59.7,-59.8	-45.4,-45.8
g3	-103.5,-103.4	-87.5,-87.0	-65.1,-65.3	-47.3,-47.9
g4	-93.3,-94,9	-81.3,-83.1	-56.5,-56.7	-43.1,-43.8

iTransformer:

AiC, BiC	Traffic	Weather	ETTh1	ETTh2
f	-71.6,-70.7	-74.8,-74.6	-50.7,-50.9	-56.9,-56.7
g1	-66.7,-66.1	-73.1,-72.9	-45.7,-45.8	-57.9,-57.8
g2	-63.3,-62.6	-71.1,-71.0	-41.7,-41.8	-57.9,-57.7
g3	-72.0,-70.5	-75.8,-74.6	-47.7,-47.9	-56.9,-56.7
g4	-54.9,-56.3	-70.1,-72.0	-38.9,-39.7	-56.4,-56.2

MLP/Linear:

AiC, BiC	Traffic	Weather	ETTh1	ETTh2
f	-91.9,-91.3	-91.5,-91.7	-89.1,-88.8	-62.8,-62.6
g1	-67.1,-66.7	-83.1,-83.2	-67.5,-67.4	-61.1,-60.9
g2	-66.0,-65.6	-82.1,-82.3	-65.6,-65.5	-59.9,-59.8
g3	-84.1,-83.5	-92.1,-92.3	-82.5,-82.3	-63.2,-62.9
g4	-60.1,-61.2	-81.3,-83.4	-89.1,-88.8	-60.1,-59.9

It can be seen from these tables that $g_4$ is worse than $g_3$ on fitting the experiment results, and our formula is more preferred against it.

Thank you again for your detailed reviews!

评论- Clarifications on Key Assumptions and Empirical Validation

2024-08-09

Dear reviewer 7rY6:

Thank you for your reply! We would like to clarify more details in our paper and use better methods to show our experimental results in a clearer way.

1.2 We apologize for any possible lack of clarity in our rebuttal; here are our further response:

1.2.1 Our result does not rely on the strict autoregressive property of time series, as the non-causal relationships could be interpreted as intrinsic noise within our theoretical framework.

1.2.2 We wrote in our rebuttal to clarify that, as you have pointed out, time series may not be strictly autoregressive (in cases like Nebensatz), and our theory is actually describing a similar idea with the 'intrinsic noise' of time series, because of which it would be hard (or even impossible) for models that predict several future frames from several past frames to achieve zero Bayesian Loss.

1.3 We appreciate the feedback on clarity. We did mention the Zip-f assumption in Section 3.2.1 (at the top of page 4), which is within the same section as the other assumptions, though not explicitly listed together: our thought was that we mainly intended to discuss cases where features in intrinsic space follow Zip-f law, which is more of a practical observation compared to previous assumptions; meanwhile other feature degradation patterns may also give similar results (though the formula may be different) w.r.t. impact of horizon. We recognized that we indeed focus on Zip-f case (observed in experiments we conducted), hence we would further clarify it and add it to bullet points in our revised paper.

1.4 Our intention was not to suggest a direct causal relationship between average intrinsic dimension and total intrinsic dimension in the original paper. We agree that this relationship is contingent on specific preconditions cited in the referenced paper. We would like to further clarify this to avoid further misunderstandings in our revised paper.

3.Empirical Validation

We truly appreciate the suggestion of comparing with other possible candidates. For Fig. 1. we have further conducted regression with more possible formula candidates on data points we measured and used AiC, BiC as metrices to compare our theory model with other possible condidates, which better shows the fitness of our proposed theory. We would like to provide further experimental results:

Our theory:

f(x)=A+B/x^\alpha

Possible candidates could be:

g_1(x)=A/x^\alpha

g_2(x)=A+B\log(x)

g_3(x)=A\log(x)^2+B\log(x)+C

(We observed that $g_3$ could also be a relatively good approximation, but on all curves, $g_3$ would give results with $A>0$ , indicating an increase in loss with dataset size (beyond the 'optimal dataset size'), which is not observed in experiments, so it should not be considered a good theory for time series forecasting loss: it is further the case since our theory is either approximately on-pair or better than it.)

Here are the AiC and BiC values for these candidates on our experimental results for different models on different datasets.

ModernTCN:

AiC, BiC	Traffic	Weather	ETTh1	ETTh2
f	-103.9,-103.4	-87.2,-87.0	-69.5,-69.6	-64.7,-65.3
g1	-95.3,-94.9	-79.1,-79.0	-60.6,-60.7	-45.6,-46.0
g2	-94.6,-63.3	-71.1,-71.0	-59.7,-59.8	-45.4,-45.8
g3	-103.5,-103.4	-87.5,-87.0	-65.1,-65.3	-47.3,-47.9

iTransformer:

AiC, BiC	Traffic	Weather	ETTh1	ETTh2
f	-71.6,-70.7	-74.8,-74.6	-50.7,-50.9	-56.9,-56.7
g1	-66.7,-66.1	-73.1,-72.9	-45.7,-45.8	-57.9,-57.8
g2	-63.3,-62.6	-71.1,-71.0	-41.7,-41.8	-57.9,-57.7
g3	-72.0,-70.5	-75.8,-74.6	-47.7,-47.9	-56.9,-56.7

MLP/Linear:

AiC, BiC	Traffic	Weather	ETTh1	ETTh2
f	-91.9,-91.3	-91.5,-91.7	-89.1,-88.8	-62.8,-62.6
g1	-67.1,-66.7	-83.1,-83.2	-67.5,-67.4	-61.1,-60.9
g2	-66.0,-65.6	-82.1,-82.3	-65.6,-65.5	-59.9,-59.8
g3	-84.1, -83.5	-92.1,-92.3	-82.5,-82.3	-63.2,-62.9

On previous tables the best (smallest) one and those with <1 difference in any matrices are marked as bold.

It can be seen that our theory either surpasses the candidates here or is no worse by more than $1$ point. In several cases (e.g., ModernTCN -ETTh2, MLP - Traffic, etc) it beats the 2nd-best one by a large margin. This has better shown the accuracy of our proposed formula.

Thank you for the constructive concerns, and we are looking forward to further discussions!

评论- Appreciated te extra experiments, why not a polynomial fit?

2024-08-12

I appreciate the extra experiments and clarifications, I will update my score.

However, while I agree that the log(x) is a perfectly valid function, it's a bit odd to take it as the default value, as one would usually start with a simple polynomial (as I mentioned in the initial review). Could you try A +Bx + Cx^2? It should be straightforward if you have the code for log(x)

评论- Further clarification on difference to double desent curve

2024-08-10

In our previous reviewer specific rebuttal and in the global rebuttal we have concluded our contribution and novelty, and we would like to further clarify the difference to double descent curve here, for your detailed reference:

2.1.Our claim that longer horizon may lead to worse performance is very different from the double descent curve, which mainly considers the impact of dataset size and model capacity. In fact, in our experiments for look-back horizon in Figure 5, most models (except for linear models whose number of parameters depends largely on look-back horizon) have number of parameters that changes little by look-back horizon, and the decreasing-then-increasing loss is mainly caused by the change in look-back horizon. Meanwhile, as we have mentioned, the time series community has been using 'benefit from longer look-back horizon' as a metric for ‘good models’ for two years[1,2,3,4], so we think that the impact of look-back horizon is not completely understood yet. This impact is actually different from the double descent curve.

2.2. We would like to propose our point of view further that, even if some derived results in some theoretical and experimental work have been understood to some extent, it does not follow that a work providing theoretical and experimental analysis of a phenomenon is meaningless or has little contribution. There are many theoretical analysis papers[10,11,12] proposing a variety of theories to explain the Scaling Law that has been observed [13,14]. We think these works do provide the community with different perspectives on understanding the scaling law, and hence have their novel contributions, even if these proposed theories lead to the well-known scaling law. At the same time, we appreciate the contribution of these works even if some of these theories may not be omnipotent (e.g. cases like overfitting, etc.): these works provide theories that aim not to include every possible scenario, but to focus on the most relevant part of the research problems of Scaling Law, and they have provided novel insight for the community.

Papers cited in the discussion period are presented in our official comment on our global rebuttal, which may act as further reference.

审稿意见

评分: 7置信度: 22024-07-08

This paper introduces scaling laws for DNN analysing MTS data, involving dataset sample size, model size, look-back horizon, dataset covariate size (dimension), and noise/uncertainty in the dataset. It relies on an axiomatization of the intrinsic information space (linearity, isomorphism) and bayesian optimal model (lipschitz-continuity) to derives an approximation of the Loss in different scenarios, and tests aspects of this laws in experiments.

优点

The paper enters the difficult discussion about power-laws in DNN to the time series forecasting field, with a clear approach motivated by mathematical assumptions on what time series intrinsic information should respect. Linearity with respect to horizon especially. The identification of the influence of model overparametrization is explicited. While I am not entirely clear on which approaches are completely new and which other are inspired by scaling law studies of other data types, I believe that the resulting theoretical laws are novel and take into account the particular of TS data.

Experiments cover several important parameters in the laws, and show matching results with the derived scaling laws.

缺点

It is hard to understand where the assumptions on the intrinsic space (3.2.1) come from. Especially lipschitz-continuity and space quasi-isomorphism. Are those properties 1) required for results to hold? 2) observed in practice in meaningful representations of (multivariate) time series data? 3) Are all the described properties completely new, or inspired from other works on Deep Neural Networks?

问题

Since the paper relies on an intrinsic space formulation, did authors study the representation learned by the neural networks? For instance, using PCA and/or tSNE on intermediary vectors. Could we verify that the assumptions made in 3.2.1 are plausible in such a way?

局限性

Several limitations are identified in the paper: need for larger datasets, impact of prediction horizon (fixed in this work), supervised forecasting task specific analysis.

I would also add limitation/concern of mine: the evaluation datasets come from a same MTS benchmarking effort, and current DNN structures where built to optimize their performance on these datasets (among others). It is unknow if this induces bias compared to datasets generally not tested during standard benchmarking. It could happen that our current TS DNN models are overoptimized for the common benchmarking datasets. Optimally, the power scaling law should involve a few datasets not typically evaluated in benchmarking models.

作者回复

2024-08-04

Dear Reviewer oD7J,

Thank you for the detailed and constructive review! We truly appreciate your reviews and questions. Here are our responses:

1. Theory assumptions

Let us give a sketch of our proof, where our assumptions are marked as bold text:

Suppose the true sample is $x[-H:S]$ , i.e. we are going to predict $x[0:S]$ from $x[-H:0]$ . Now we have a model $m$ that predicts the corresponding intrinsic vector $\phi(x[0:S]) \in \mathcal{M}(S)$ with error bounded by $e_i$ , then by assumptions 1, 2 we can verify that the error in the original space is bounded by:

\alpha e_i - e - e' \leq L_{ori}\leq \alpha e_i + e + e'

Hence, analyzing the loss in the intrinsic space would derive the same result, which justifies our task formulation.

By assumption 3, we may partition the intrinsic space uniformly to get a lower bound of the loss.

We can first decompose the total loss into two components, i.e. Bayesian loss and approximation loss (section 3.2.2). Then the Bayesian loss could be written as:

L_{Bayesian} \leq (1-\eta)K_1^2E[var(P^{-1}[\infty,H]\(x))] + \eta \cdot tr\(\Sigma_S)

Where the first term is a projection term, and the second is a noise term which almost makes no effect on our analysis. By assumptions 4 and 5, the relationship of a sequence with its subsequences is similar to linear projections to the subspace with the minimal eigenvalues if the horizon is large, thus the projection term could be bounded with the eigenvalues of the distribution, ensuring the Bayesian loss is bounded. (section 3.2.3)

Considering the approximation loss. If we have sufficient data, the approximation loss should come from two sources, the intrinsic noise in the data and the effect of unseen dimensions in the intrinsic space. The former is assumed to be uniform and thus can be easily calculated, while the latter should be calculated with help from assumptions 4 and 5, similar to the deduction in Bayesian loss. For the scenario where data is scarce, we don't need the assumptions and the approximation loss basically depends on the distance of a test sample to its nearest training sample in the intrinsic space. (section 3.2.4)

Then we can combine the two loss components and analyze the optimal horizon or other properties. (section 3.3)

2.PCA on intermediate vectors

Thank you for the constructive advice! We obtain intermediate vector representations for multi-variate time series datasets and conduct PCA experiments on them, further validating that these time series datasets do follow a Zip-f distribution with respect to their features. Please refer to Figure 1 in our additional page of PDF uploaded to our “global” rebuttal response, in which we show the results. We observe that such a deep-learning method may face under-training issues, thus causing possible uncertainty in evaluating high-rank features with a limited amount of training data, hence this PCA result could complement each other with the PCA result obtained by directly applying PCA on raw input sequence (under Channel-Independent and RevIN settings) provided in Figure 8 in Appendix G in our original submission.

3.TS DNN models are overoptimized for the common benchmarking datasets.

It is indeed possible that some DNNs have structures overfitting several datasets which may not generalize beyond them. Here we mainly choose models that are designed with simplicity (e.g. linear models, iTransformer models which are basically transformers, ModernTCN models which are composed of time-series-optimized convolutional layers) and the datasets are actually quite different from each other in size (e.g. 1000x difference for data sample number) and in feature degradation (e.g. please refer to PCA results for exchange and other datasets), so we think the power law we observe here should be general, at least to a certain extent. However, we do appreciate the important point you have mentioned. This concern may extend beyond the field of time series forecasting, and we will include a description of it in the next version of our paper.

2024-08-12

Thank you for the detailed answer. I believe it addresses my concerns. I hope to see your work published.

2024-08-12

Thank you very much for your thoughtful and constructive feedback. We truly appreciate the time you have invested in reviewing our work and for your suggestions on how to improve the presentation of our work. We hope our work will contribute to the field.

审稿意见

评分: 4置信度: 32024-07-13

This paper proposes a theory for scaling law in time series forecasting that accounts for dataset size, model complexity, and data granularity, with a particular focus on the look-back horizon. This paper empirically evaluates various models across diverse time series forecasting datasets to verify the validity of the scaling law and validate the theoretical framework, especially regarding the influence of the look-back horizon.

优点

this paper is complete with good organization.
several theatrical analyses are provided for the justification.

缺点

the main argument is not novel. I think it is well-known that a larger horizon may not bring up a better performance in time series forecasting. In some cases, it can but the larger horizon would also bring up more noise that can hinder the forecasting,
the contribution is kind of weak. I don't find some quite useful conclusions or contributions for time series forecasting. With the scaling laws the author verified, there is no further improvement of new models on state-of-the-art benchmarks.

问题

See weakness

局限性

See weakness

作者回复

2024-08-04

Dear Reviewer SJaN:

Thank you for the review! We truly appreciate your ideas and questions. Here are our responses:

1.Novelty and contribution related to the 'Main Argument'

1.1. ‘Longer horizon gives worse performance’ is an important fact that could bring insight to the community, and the impact of the look-back horizon is not yet well understood. As presented in the Introduction part in our original submission, in the past two years it has been a common practice for the time series community[1, 2, 3, 4] to try to prove the advantage of the proposed methods by claiming that these methods could benefit from improving look-back horizon; which we show is not necessarily an accurate way to show the superiority of a certain model.

1.2. Our main argument is not only that a longer horizon gives worse performance. Actually, we have provided a theory to explain Scaling Laws in Time Series Forecasting, which pays special attention to the impact of input horizon. This is, actually, very different from proposing only that a long horizon would harm performance, as our theory analyzes the complex impact of different combinations of model width, dataset size and lookback horizon. We also summarize our contribution in the next point we would like to make.

2. Our Theoretical and Experimental Contribution

As stated in our paper, we summarize our contribution as follows:

2.1. We introduce a novel theoretical framework that elucidates scaling behaviors from an intrinsic space perspective, highlighting the critical influence of the look-back horizon on model performance.

2.2. We conduct a comprehensive empirical investigation into the scaling behaviors of dataset size, model size, and look-back horizon across various models and datasets to validate the effectiveness of our claim.

We do agree that proposing novel architectures and approaches achieving state-of-the-art performance is very important for the development of Machine Learning, but we also think that providing insights into important questions, validating hypotheses and proposing related theories could also contribute to the community, which may lead to potential improvement in performance of future models. For example, in our original submission, we have demonstrated in Appendix F why down-sampling can improve performance.

We further clarify our contribution, novelty as well as papers cited in this rebuttal section in the global rebuttal section, which may also act as a reference for your constructive concerns. Please refer to the global rebuttal section for more details.

审稿意见

评分: 5置信度: 42024-07-13

This paper proposes a theory that explains why complex models do not necessarily outperform simpler models even under the presence of larger amount of data, and why longer inputs hurt performance of some models. The authors consider the data size, model complexity, data granularity and the look-back horizon. Together with the theory here proposed the authors present empirical evidence verifying the theoretical insights.

优点

The authors present an analysis of scaling laws for time series forecasting by adding components that are characteristic of time series forecasting, i.e. the fact that a limited amount of information of the past is provided to do inference/forecasts. Whereas in other domains it is usually assumed that the larger the look-back window then the better the estimations, in time series forecasting this is not always the case. Based on this the authors proceed to present a framework based on intrinsic spaces where slides of time series can be represented as vectors - they consider a suitable space for the look-back window and the forecasted values.

Perhaps the most relevant contribution of the authors is the theoretical framework considered, together with numerical verifications.

缺点

The authors present a theoretical framework that is interesting but lacks of precision. The authors constantly derive equations but at the end it is unclear what are the hypothesis required to be sure that the corresponding results hold.
The authors present results in different models, but these results seem to be done on models trained on individual datasets, and hence it is unclear if the results here presented hold for the case of pretrained models, which are pretrained in multiple time series datasets.
In line 193 the authors claim that their results are well approximated by the laws here derived. But this is completely unclear a the derived laws are not plotted together with the empirical ones per model. This leaves the verification task completely to the readers. This further makes it more challenging to check the correctness of the presented results.
Appendix A.1: The proof presented is not a formal proof but a sketch of a proof.
Typos:
- line 113: “ and thus We can scale” [there is a wrong capital letter]
- line 122 “s[h_1] to s_[h_2-1]” there is a wrong positioning of a bracket

问题

How do the presented result hold for the case of pretrained models, or any other setting where models are trained on multiple datasets? Since the recent developments of large models for time series this kind of results would be of high value.

It would be nice if the authors could provide more details on the derivations presented in 3.2.2. Although the main sketch is there is rather unclear how this is done.

局限性

The authors clearly state that the analysis on pretrained models remains for future work. This is a valid statement from the authors as one can assume that novel challenges arrive for this setting, and arguably the notion of scaling law for single-dataset models is not fully answered yet, being this work an interest step towards this topic.

作者回复

2024-08-05

Dear Reviewer Vwbb,

Thank you for a detailed and inspiring review! We truly appreciate your concerns and adivce! Here are our responses:

1. Fitting experimental results with our derived formula

As displayed in Figure 1,2 and 5 in our original submission, the solid lines are the fitted lines of the theoretical formulas. We recognize that the presentation of these graphs may seem that the lines are obtained by connecting data points, hence we will add further captions to these figures to show that these lines are fitted lines in our next version of paper.

2. With respect to Pretrained Models or Large Mixed Datasets

This is a very insightful question given that there have been works focusing on pretrained time series forecasting models with zero-shot forecasting abilities. However, the theory for scaling laws for pretraining and finetuning still remains a relatively unexplored area compared to the theory for scaling laws for single-dataset models, which is still a topic to study these days. The case of training on Mixed Datasets may be closer to our case which considers single-dataset.

Our theory holds without the need for any modification as long as the dataset itself follows a Zip-f distribution in the intrinsic space. This assumption is sort of natural given that Zip-f law is a natural distribution, whereas we give two analyses from Theoretical and Experimental correspondingly on why it holds for mixed, large datasets.

Theoretically, if a large dataset is composed of $s$ l subdatasets of similar size each following the Zip-f law with degradation coefficient $\alpha_1 < \alpha_2 < \ldots < \alpha_s$ , each with size $S_i$ and follows Zip-f law: $\lambda_{ij}= A_i/j^{\alpha_i}$ , where $\lambda_{ij}$ represents the $j$ -th eigenvalue of the $i$ -th dataset. Suppose the intrinsic dimensions are orthogonal with each other (hence the PCA components are orthogonal). A simple assumption could be the new intrinsic space is a direct product of the old intrinsic spaces, hence eigenvalues are the union of all the old eigenvalues. After which, an eigenvalue of value $S$ should be the $idx_{total}$ -largest eigen value, in which:

idx_{total} = \sum_i(idx_{i}) = \sum_i (A_i/S) ^{1/\alpha_i}.

When $S$ is small (or correspondingly, when $idx$ is relatively large) this sum is dominated by the small $\alpha$ s, and in limitation cases this sum is dominated by the $\alpha_1$ term: $idx \approx (A_1/S)^{1/\alpha_1} + C$ , which is approximately a Zip-f distribution.

Experimentally, we use the Mixed Dataset of Traffic, Weather, ETTh1, ETTh2, ETTm1, ETTm2, exchange and ECL to train a Channel-Independent 2-layer 512-dimension MLP and use the Intermediate vector before the decoder layer as a feature vector to do PCA analysis. We found that the result follows a Zip-f distribution for higher-order components (Please refer to Figure 2 in our Extra Page PDF).

(Actually the individual datasets themselves (like Traffic, Weather, etc) are composed of data from different times, areas, etc, so they are actually composed of sub-datasets themselves and they are showing the Zip-f distribution. This also indicates that larger datasets may have similar Zip-f distributions.)

Hence it would not be a bad assumption that Zip-f law still holds for datasets with mixed sub-datasets, and our theory, without major modification, could be applied to these cases.

3.More detailed proof for Section 3.2.2

Yes, we would like to provide a more detailed proof here, which will be updated in the next version of our paper.

\begin{aligned} L&=E_{x\sim\mathcal M(H+S)}[(x[H:H+S]-m(x[0:H]))^2]\\ \end{aligned}

Let $m^*$ denote the optimal Bayesian model, then it should satisfy:

m^*(x[0:H])=E_{x\sim\mathcal M(H+S)}[x[H:H+S]|x[0:H]]

Thus:

L=E_{x\\sim\\mathcal M(H+S)}[(x[H:H+S]-m^*(x[0:H])+m^*(x[0:H]-m(x[0:H]))^2]\\\\

L=L_{sum}+L_{cross}, \text{in which}:

L_{sum} = E_{x\\sim\\mathcal M(H+S)}[(x[H:H+S]-m^*(x[0:H]))^2]+E_{x\\sim\\mathcal M(H)}[(m^*(x[0:H])-m(x[0:H]))^2]

is the sum of squares

and

L_{cross} = 2\*E_{x\\sim\\mathcal M\(H+S)} [\(x\[H:H+S]-m^\*\(x[0:H]))\*\(m^*\(x\[0:H])-m\(x\[0:H]))]

is the cross term.

The optimal model (Bayesian model) should satisfy:

m^\*(x[0:H]) = \sum\limits_{x[H:H+S]} P(x[H:H+S]\mid x[0:H]) x[H:H+S].

Therefore, it follows that: (note the conditional expectation of $x[H:H+S]-m^*(x[0:H])$ given $x[0:H]$ is $0$ , which is from the definition of the Bayesian Model)

L_{cross}= \\sum\\limits_x P(x[H:H+S]\\mid x[0:H])P(x[0:H])[(x[H:H+S]-m^\*(x[0:H]))*(m^\*(x[0:H])-m(x[0:H]))] =0.

Hence, the cross term should be zero, and the loss is a sum of the square terms: one is determined by the capability of the optimal Bayesian model (the Bayesian Error), and the other is about how well the model approximates the Bayesian model (the Approximation Error): $L=L_{cross}+L_{sum}=L_{sum}=L_{Bayesian}+L_{Approx}$ .

4.Typos

Thank you for pointing out these typos! We would check typos carefully and further polish our paper.

2024-08-12

I would like to thank the authors for the time invested in this reply. I would encourage the authors to make the results more self-contained so that hypothesis are better stated, which will allow to future papers to make reference of these results. I would further motivate the authors to mention future extensions to mixture of datasets as they have sketched in this rebuttal - I believe this would be highly appreciated in the community and would ignite further research efforts.

I am increasing my score to 5: Borderline accept.

2024-08-13

Thank you for these constructive concerns and suggestion!

We appreciate the feedback on clarity. We would state our assumptions in a clearer and more uniformed way. (Moreover, we would like to propose weaker assumptions which could lead to almost the same result, as proposed in our rebuttal with Reviewer 7rY6. This will be stated clearly in appendix, while current assumptions in our main paper would be further polished for better clarity). The derivation will also be further polished.

As for mixed datasets, we recognize that these scenarios are indeed very important and are rising trends for Time Series Forecasting community. We would state more about mixed datasets, both theoretically and experimentally, in our modified paper; and we hope our work may provide further insight for the community and promote further related researches.

审稿意见

评分: 7置信度: 42024-07-30

The paper introduces a theoretical framework for scaling laws in time series forecasting, focusing on the impact of dataset size, model complexity, and look-back horizon on model performance.

The authors make two main contributions: They propose a novel theory that explains scaling behaviors in time series forecasting from an intrinsic space perspective, emphasizing the critical role of the look-back horizon. The theory identifies an optimal horizon and demonstrates that beyond this point, performance degrades due to inherent limitations of dataset size. Conducting comprehensive empirical investigations to verify scaling behaviors related to dataset size, model size, and look-back horizon across various models and datasets. These experiments establish a robust scaling law for time series forecasting and validate the proposed theoretical framework.

The paper examines how the optimal look-back horizon changes with different amounts of training data and model sizes. It also explores the relationship between channel-dependent and channel-independent models in the context of the proposed theory. The authors test their theoretical predictions using a range of popular time series forecasting models (including linear models, MLPs, iTransformer, and ModernTCN) on multiple datasets (such as ETTh1/h2, ETTm1/m2, Traffic, Weather, and Exchange). Their experiments demonstrate the existence of scaling laws in time series forecasting and provide evidence supporting their theoretical framework, particularly regarding the influence of the look-back horizon.

优点

The paper demonstrates several notable strengths. Its originality lies in introducing a novel theoretical framework for scaling laws in time series forecasting, particularly addressing the impact of look-back horizon, an aspect previously unexplored in scaling law theories. The quality of the work is evident in its combination of thorough theoretical analysis with extensive empirical validation across multiple models and datasets, including Linear, MLP, iTransformer, and ModernTCN models tested on various datasets such as ETTh1/h2, ETTm1/m2, Traffic, Weather, and Exchange.

The paper maintains clarity through clear mathematical derivations, well-defined assumptions for the theoretical framework, and presentation of experimental results with error bars and statistical analyses. The significance of this work is apparent in its potential to inform future model design and hyperparameter selection in time series forecasting, which could lead to performance improvements across various applications.

缺点

Despite its strengths, the paper has a few areas that could be improved. The authors acknowledge that their experiments are conducted on datasets smaller than some recently proposed large datasets, which may limit the generalizability of their findings. The theoretical framework relies on several simplifying assumptions about the intrinsic space properties, which may not always hold in practice. A more in-depth discussion on the implications of these assumptions could enhance the paper's robustness. Additionally, while the theory provides valuable insights, the paper could benefit from more concrete guidelines on how practitioners can apply these findings to improve their models, which would increase the practical impact of the work.

问题

It would be interesting to understand how the proposed theory extends to multi-variate time series forecasting and if there are any specific considerations for such cases.

The paper mentions that downsampling can sometimes improve performance, and it would be valuable for the authors to elaborate on how this relates to their theory of optimal horizon.

Lastly, given the changing nature of many real-world time series, it would be beneficial to know how sensitive the optimal horizon is to changes in the data distribution over time (concept drift) and whether the theory accounts for non-stationary time series.

局限性

The authors demonstrate transparency by addressing limitations. They acknowledge that their theory primarily covers time series forecasting and may not generalize to other time series tasks. They also note that their experiments are conducted on datasets smaller than some recently proposed large datasets, and that the theory doesn't consider self-supervised or pretrained models. Additionally, they mention that the impact of prediction length on optimal horizon is not thoroughly explored.

While the authors are commendably upfront about these limitations, they could potentially enhance this section by discussing implications for computational resources and energy consumption related to finding optimal horizons in practice. Given the theoretical nature of the work, the absence of a discussion on potential negative societal impacts may be appropriate, but a brief mention of this could further strengthen the paper's consideration of broader impacts.

作者回复

2024-08-04

Dear Reviewer urDc,

Thank you for a detailed and inspiring review! We truly appreciate your feedback and concerns! Here are our responses:

1.How the proposed theory extends to multi-variate?

While our deduction is primarily written in a single-variable manner, it can be easily adapted to the multi-variable case. As mentioned in Section 3.2.1, we made the Information-preserving assumption:

There exists a mapping $\phi$ from the original length- $L$ sequence space $\mathcal{O}(L)$ to $\mathcal{M}(L)$ , along with an inverse mapping $\phi^{-1}:\mathcal{M}(L)\to \mathcal{O}(L)$ and a constant $e \ll 1$ that is independent of $L$ , such that for any $x \in \mathcal{O}(L)$ , $\|x - \phi^{-1}(\phi(x))\|_2^2 \leq e$ .

For a certain time-series dataset, we assume it has a fixed multi-variable dimension $d_{\text{multi}}$ . By substituting $\mathcal{O}(L)$ with $\mathcal{O}(L \times d_{\text{multi}})$ , we observe the multi-variable format:

There exists a mapping $\phi$ from the original length- $L$ dimension $d_{\text{multi}}$ sequence space $\mathcal{O}(L \times d_{\text{multi}})$ to $\mathcal{M}(L)$ , along with an inverse mapping $\phi^{-1}:\mathcal{M}(L) \to \mathcal{O}(L \times d_{\text{multi}})$ and a constant $e \ll 1$ that is independent of $L$ , such that for any $x \in \mathcal{O}(L \times d_{\text{multi}})$ , $\|x - \phi^{-1}(\phi(x))\|_2^2 \leq e$ .

Another assumption to be checked is the Zip-f assumption on different dimensions of the intrinsic space for the multi-variable case. In Figure 8 in Appendix G in our original submission, the single-variable case is validated by conducting PCA on Channel-Independent time series. We further use the intermediate vector of iTransformer as the feature vector in the intrinsic space of multivariable time series and conduct PCA on them. This can be found in Figure 1 in our additional page of PDF uploaded to our “global” rebuttal response, which validates the Zip-f assumption of the ** Non-Linear Multivariable** case.

Other deductions still hold for multivariate cases (up to a constant factor). Hence our assumptions and deduction can be extended to the multivariate cases.

2. Down-sampling and theory of optimal horizon

Downsampling can be viewed as a projection to a subspace in the intrinsic space and thus has a similar effect as decreasing the horizon. Experimental results (as shown in experiments in previous works like FITS, PatchTST and in our work) show that the projected subspace of higher frequency tends to fall on the large-eigenvalue directions, or the `invisible' dimensions masked by the projection tends to be the more unimportant ones. Although the precise effect of downsampling is unknown or may need further assumptions and methods for precise consideration, it is acceptable that we may approximate the effect of downsampling to be similar to a projection to the first $d_{eff}<d_I(H)$ dimensions in the intrinsic space. The overall loss could be then expressed as:

loss_{new} = L(d_{eff})\text{ where $d_{eff} < d_I(H)$.}

Hence, if the original intrinsic dimension is larger than (local) optimal $d$ , $\partial loss/\partial d>0$ , hence reducing $d$ from $d_I(H)$ to $d_{eff}$ would help reduce loss. Otherwise if $d_I(H)$ is already smaller than (local) optimal $d^*$ , meaning $\partial loss/\partial d<0$ , we would expect no performance improvement from reducing $d$ from $d_I(H)$ to $d_{eff}$ .

3. How sensitive the optimal horizon is to changes in data distribution over time? Does the theory accounts for non-stationary time series?

This is indeed a very important question in time series forecasting. In our work we mainly consider the case where training and testing distributions approximately follow the same distribution, because this is a basic case and we can make the gap between training set and test set smaller with normalization methods (like RevIN) or online-training methods. The former methods are included in our theory; however, it is challenging for our theory to model the online-training or test-time-training methods. This is partially because these methods may employ different training and optimization strategies that are significantly different in modeling.

For traditional deep-learning methods (i.e. training on training set, test on test set, no hyper-parameter tuning, etc on test set, do not use timestamp as a parameter of model input, which is the usual case for time series forecasting models like Linear, iTransformer, ModernTCN, etc), if we assume the test set is fixed and we sample training set each time for training. If we assume sampling training set has a distribution difference compared to the test set that has expectation $0$ but a certain variance, then this shift can be modeled into a constant loss term in our theory model, thus making no effect on hyper-parameter choosing (on expectation of training-set sampling).

However, when considering more complex scenarios where distribution differences may be correlated with certain observable parameters (e.g. timestamps), or when online training methods are employed, analyzing the effect of the look-back horizon becomes more intricate. In such cases, both the exact look-back horizon (the length of past data used as input to the current model) and the implied look-back horizon (the length of past data that continues to influence the model during online training) may impact the model's performance. This likely involves a distribution shift in the intrinsic space over time, necessitating distinct considerations for these two horizons. We believe addressing this complexity would be a valuable direction for future research.

4. Others

Thank you for the constructive advice! We would add more detailed description about computational resources and energy consumption in our experiments.

作者回复

2024-08-04

Dear reviewers,

Thank you for the detailed reviews! In this global Rebuttal section, we would like to further clarify (1) Contribution and Novelty of our work, (2) contents in the Extra Page Pdf (which mainly contains more experimental results validating Zip-f law for more complicated cases) and (3) papers that are cited in our rebuttal in more detail.

1.Contribution and Novelty

1.1. We summarize our contribution as follows.

1.1.1 Novel theoretical framework elucidates Scaling Laws from the perspective of look-back horizon, dataset size and model size, with a specific emphasis on the task of Time Series Forecasting. By focusing on Time Series, our theory provides an innovative approach for understanding Scaling Laws for Time Series Forecasting, advancing theoretical comprehension and providing inspiration for further theoretical investigations in this area.

1.1.2 Comprehensive empirical investigation into the scaling behaviors of Time Series Forecasting. To our knowledge, we are among the first to carry out experiments to validate Scaling Laws for Time Series Forecasting. Notably, no previous work has validated scaling law on Time Series Forecasting that examines the effect of different look-back horizon in detail. Previous works for Scaling Laws discussing the impact of look-back horizon (or equivalently, context length) draws mainly focus on LLMs in NLP[8] which exhibits differently compared to Time Series Forecasting tasks. This work promises to bridge this gap and provide valuable insights into the scaling dynamics unique to Time Series Forecasting.

1.2. Our theoretical and experimental findings have a novel and positive impact on the time series community.

Besides the theory and experimental validation for Scaling Law for Time Series Forecasting considering look-back horizon, dataset size and model size, there are more potential positive impacts, two of which are listed as follows and there could be more:

1.2.1. The time series community has been using 'benefit from increasing horizon' as a metric for 'better models' for at least 2 years[1,2,3,4], while our work shows that it may not be necessary for good models to always benefit from increasing horizon. As shown in Section 4.2. in our original submission, longer look-back horizon may give worse test results for all models, it is not the case that better model should benefit from further extending look back horizon. Moreover, this behavior is actually very different from some behaviors observed in long-context large language models[8]. Our work could provide insight about understanding look-back horizon for the Time Series community both theoretically and experimentally.

1.2.2. Our theory could potentially explain the benefits and disadvantages of commonly used down-sampling blocks (including low-pass-filter[5], Patching[9], etc.), as shown in Appendix F in our original submission.

2.Extra Page Pdf

We further conduct PCA on intermediate vectors of the iTransformer model and obtain results, further validating the Zip-f assumption in our paper for intrinsic space for Non-Linear Channel-Dependent Multivariable cases. Please refer to Figure 1 in our Extra Page Pdf for more details.

Meanwhile, to study the property of Mixed Datasets, we further conduct PCA on the features obtained by a simple MLP trained on a Mixed Dataset of various time-series datasets under Channel-Independent settings (note that different datasets have different number of multivariables) and validate the Zipf-law. Please refer to Figure 2 in our Extra Page Pdf for more details.

Please refer to our attached PDF for more details.

3.Papers cited in our rebuttals (for this global rebuttal and reviewer-specific rebuttals):

These papers are cited in our rebuttals and could possibly act as reference or supplement to our rebuttals:

A. Zeng, M. Chen, L. Zhang, and Q. Xu, “Are transformers effective for time series forecasting?” in AAAI 2022.
H. Wang, J. Peng, F. Huang, J. Wang, J. Chen, and Y. Xiao, “Micn: Multi-scale local and global context modeling for long-term series forecasting,” in ICLR 2023.
L. donghao and wang xue, “ModernTCN: A modern pure convolution structure for general time series analysis,” in ICLR 2024.
Y. Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long, “itransformer: Inverted transformers are effective for time series forecasting,” in ICLR 2024.
Z. Xu, A. Zeng, and Q. Xu, “Fits: Modeling time series with 10k parameters,” in ICLR 2024.
William Toner and Luke Darlow, "An Analysis of Linear Time Series Forecasting Models", in ICML 2024.
T. M. Buzug, J. von Stamm, and G. Pfister, “Characterising experimental time series using local intrinsic dimension,” Physics Letters A, vol. 202, no. 2-3, pp. 183–190, 1995.
Wenhan Xiong and Jingyu Liu and Igor Molybog and Others, "Effective Long-Context Scaling of Foundation Models".
Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, “A time series is worth 64 words: Long-term forecasting with transformers,” in ICLR, 2023.

Thank you again for your reviews, and we are looking forward to possible further discussions.

评论- Papers cited (continue)

2024-08-10

These papers are cited in our dicussions and could possibly act as reference or supplement:

U. Sharma and J. Kaplan, “A neural scaling law from the dimension of the data manifold,”, 2020, published on JMLR, 2022.

11.Eric J. Michaud and Ziming Liu and Uzay Girit and Max Tegmark, "The Quantization Model of Neural Scaling", in NeurIPS, 2023.

12.Y. Bahri, E. Dyer, J. Kaplan, J. Lee, and U. Sharma, “Explaining neural scaling laws,” 2021.

13.Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown and Others, "Scaling laws for neural language models", 2020.

14.Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya and Others, "Training Compute-Optimal Large Language Models", 2022.

评论- Thank you, Authors!

2024-08-13

Dear Authors,

Thank you for your responses to Rviewers and your categorized overall response. On our end, we will collectively continue to review these.

I hope the Reviewers' reviewers and discussions have helped strengthen your work and highlighted its significance and areas for improvement.

Thank you for your submission and continued effort in the review process!

Best,

Area Chair Fwdj

评论- Summary of Revisions

2024-08-14

We would like to express our sincere gratitude to all reviewers and ACs for their patience and detailed feedback. Their comments are very valuable and constructive and have enhanced our work.

As summarized in our global rebuttal, in our work, we introduce a theoretical framework of scaling laws in time series forecasting, focusing on the impact of dataset size, model complexity, and look-back horizon; experimentally, we validate the existence of Scaling Law for Time Series Forecasting and further validate our theory framework. As stated both in our submission and during the global rebuttal session, our theory and experiment results could have positive impacts on the community, two of which are listed as follows:

(1) We provide a more comprehensive understanding of the impact of the look-back horizon that has not been completely understood yet.

(2) We provide novel perspective toward commonly-used components like down-sampling. We hope our work may inspire not only large foundational datasets and models, but also new models targeting datasets of limited size in the field of TSF.

We truly appreciate constructive concerns and suggestions raised by all reviewers, according to which we have made supplements and enhancements to our new version of the paper. Our new version of paper features enhancements in the following aspects compared to our original submission:

Theory: clearer presentation and more supplement for Theory

T1.We state our assumptions more clearly. Especially, we would add the Zip-f assumption into our bullet points, and make more precise discussions about intrinsic dimension. Meanwhile, we present theory deduction in a clearer way, both in main paper and in appendix. (Reviewer 7rY6, Reviewer Vwbb)

T2.We further propose weaker assumptions that could lead to similar results, the process of which is added to appendix. (reviewer 7rY6)

T3.We explain more in the theory deduction part in main paper about how our theory adapts to multivariable cases. (T3 and E1 would supplement each other, providing a more comprehensive view.) (Reviewer urDc)

T4.We extend more discussions about Mixed datasets, and the conditions required to apply our theory in these cases in appendix, hoping to provide insights about understanding works on these datasets (T4 and E1 would also supplement each other, providing a more comprehensive view). (Reviewer Vwbb)

T5.We give a more detailed explanation about the down-sampling case in appendix. (Reviewer urDc)

Experiment: more precise and comprehensive Experiment

E1.We include additional PCA results for intermediate vectors for iTransformer on Multi-variable Non-linear Channel-Dependent cases, and for MLP on a mixed dataset case, further validating the Zip-f distribution assumption for multivariable cases and encouraging future research on large mixed dataset. (Reviewer oD7J, Reviewer urDc, Reviewer 7rY6, Reviewer Vwbb)

E2.We would use more comprehensive metrics and compare against valid functions, further justifying the fit of our theory model on the experiment results. (Reviewer 7rY6)

Supplement: amendment for miscues presents, emphasize for main argument and contribution

S1.we have checked the typos carefully and further polish our paper. (Reviewer Vwbb, Reviewer 7rY6)

S2.We add discussions about possible limitations related to over-optimized Time Series DNN models in the limitation part discussed in the conclusion section. (Reviewer oD7J)

S3.We discuss more the computational resources related to optimal horizon searching in appendix. (Reviewer urDc)

S4.We emphasize more about our main argument, as well as contributions and potential novel impacts in introduction and conclusion sections. (Reviewer SJaN)

Overall, the valuable suggestions from reviewers are very helpful for us in revising our paper to a better shape. We would be delighted to have further discussions if possible.

最终决定Accept (poster)

2024-09-25

The paper introduces a novel theoretical framework for scaling laws in time series forecasting, addressing the relationship between model size, dataset size, and look-back horizon. The authors empirically validate their theory across a variety of datasets, demonstrating the practical applicability and potential of their approach to enhance forecasting accuracy in diverse scenarios.

Strengths

The integration of scaling laws into the time series forecasting domain is an innovative contribution that fills a critical gap in the existing literature (Reviewers urDc, Vwbb).
The empirical validation across multiple datasets and models further strengthens the credibility and generalizability of the proposed framework (Reviewer oD7J).
The theoretical underpinnings are well-grounded and offer a new perspective on optimizing model performance by adjusting the look-back horizon and model complexity (Reviewer 7rY6).

Weaknesses

The theoretical framework, while innovative, relies on several assumptions that are not fully justified or clearly presented. This lack of clarity could limit the framework’s generalizability (Reviewer 7rY6, Reviewer Vwbb).
The presentation of the paper is somewhat lacking, with certain mathematical derivations and assumptions not being sufficiently detailed or explained (Reviewers 7rY6, SJaN).
The experiments, although thorough, are conducted on relatively small datasets, which may raise questions about the generalizability of the findings to larger, more complex datasets (Reviewer SJaN, Reviewer Vwbb).

Based on the comprehensive reviews and the authors’ rebuttal, the paper presents a novel contribution to time series forecasting.

The authors must improve the clarity of their theoretical framework and provide further justifications for their assumptions in the camera-ready version of this work if it is accepted. Regardless of the decision, these actions would strengthen the paper, and I hope to see them done.