PaperHub
6.0
/10
Poster4 位审稿人
最低6最高6标准差0.0
6
6
6
6
2.5
置信度
正确性2.8
贡献度3.0
表达2.3
ICLR 2025

Context-Alignment: Activating and Enhancing LLMs Capabilities in Time Series

OpenReviewPDF
提交: 2024-09-22更新: 2025-03-02
TL;DR

LLMs for time series tasks

摘要

关键词
Time SeriesLarge Language ModelsContext-Alignment

评审与讨论

审稿意见
6

The authors propose a new method for adapting LLMs for time-series analysis tasks. They test on both time-series generation (forecasting) and time-series classification. Their paper proposes the use of Dual-Scale Context-Alignment GNNs which are used to perform context alignment of the time-series and text tokens/patches. They combine this strategy with example prompts which they are calling (Demonstration Examples Prompt Technique) and show that using DECA and the dual scale context alignment improves the performance of LLM’s compared to existing LLM benchmarks for time-series analysis.

优点

  1. Their method of context alignment seems new and interesting
  2. The course and fine-grained context alignments are logical and fit well in combination with patching which is a very popular embedding strategy for time-series data
  3. They several different experiments using different tasks and experiment settings to show their model performs at a high level consistently

缺点

  1. This paper focuses on enhancing LLM’s for time-series analysis and it takes a pretrained GPT model and compares it to other LLM-based approaches. This means this approach is ultimately a pretrained model approach (with the LLM being a pretrained model). The authors need to be comparing to other pretrained models and not just LLM based ones. Since its not being used for text generation-based tasks there’s no reason to limit this evaluation to only LLM based approaches and there are many pretrained models which follow the same experimental method.
  2. The proposed method hinges heavily on the use of GNN’s and while they do run an ablation study, from what I understand they don’t compare other network architectures for performing the context alignment. Given that the GNN is a major part of the proposed implementation, an ablation showing other networks as course and fine-grained context aligners is necessary. For example, Linear layers, CNNs and self-attention could be a good start.
  3. I would like more details on the training, their method does add a layer of complexity in adapting LLM’s for time-series analysis and the amount of added compute for finetuning these models (time and device) should be clearly stated.

问题

  1. While your methodology considers the strengths of GNN’s as context aligners, I would like to see the choice of GNN’s to be validated experimentally as opposed to other options
  2. In my opinion the terminology is unnecessarily complicated, and it takes away from the impact of the paper. One key example is “VCA w/o DSCA-GNNs” used in table 1. It seems to me that this is simply just prompting since there are no context aligners used. It may be more clear if you denote VCA as context alignment without DSCA-GNNs and mark in table 1 VCA + DSCA-GNNs.
  3. What is the difference between “(DECA) DEMONSTRATION EXAMPLES BASED CONTEXT-ALIGNMENT” and few-shot prompting as in NLP. It seems like this term was invented to enhance the complexity of the paper but at the cost of reader understanding, especially if that reader has an NLP background.
  4. How does the prompt effect the context alignment? Are there some prompts that harm context alignment? Since prompting is a key component of this paper, it would be useful to know how their new method performs with different prompt types.
  5. Since DECA involves using examples for context alignment a fair comparison against other baselines would be the VCA with the GNN as a context aligner. I think that should be included beside DECA in the results chart
评论

[1] Jin M, Wang S, Ma L, et al. Time-llm: Time series forecasting by reprogramming large language models[J]. arXiv preprint arXiv:2310.01728, 2023.

[2] Pan Z, Jiang Y, Garg S, et al. S2 S^ 2 IP-LLM: Semantic Space Informed Prompt Learning with LLM for Time Series Forecasting[C]//Forty-first International Conference on Machine Learning. 2024.

[3] Sun C, Li H, Li Y, et al. TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series[C]//The Twelfth International Conference on Learning Representations.

[4] Bian Y, Ju X, Li J, et al. Multi-Patch Prediction: Adapting Language Models for Time Series Representation Learning[C]//Forty-first International Conference on Machine Learning.

[5] Zhou T, Niu P, Sun L, et al. One fits all: Power general time series analysis by pretrained lm[J]. Advances in neural information processing systems, 2023, 36: 43322-43355.

评论

Question 5: Since DECA involves using examples for context alignment a fair comparison against other baselines would be the VCA with the GNN as a context aligner. I think that should be included beside DECA in the results chart


Response: Thanks for your valuable suggestion. We present the results of Vanilla Context-Alignment (VCA) for the long-term prediction and classification tasks here, and the results of other tasks are in Appendix G.4. Compared to other LLM-based methods, VCA retains a significant advantage, confirming that the simplest method based on Context-Alignment paradigm surpasses previous methods.

Since we establish the basic steps of utilizing LLMs for TS analysis tasks: activate first, and then enhance, thus, Demonstration Examples based Context-Alignment (DECA, renamed as FSCA in the revision) is our final version (It builds upon the ability of VCA to activate LLMs, further enhancing performance with Few-shot prompting).

Moreover, we wish to emphasize that although DECA uses few-shot prompting techniques in forecasting tasks, it still ensures fairness. DECA segments input TS data to construct examples without introducing additional data.

Classification tasks, bold is the best, italic is the second best.

MethodsGPT4TSTime-LLMS2\text{S}^{2}IP-LLMVCADECA
EthanolConcentration34.234.635.339.2-
FaceDetection69.267.968.569.070.4
Handwriting32.732.033.138.4-
Heartbeat77.278.077.578.579.5
JapaneseVowels98.698.198.698.9-
PEMS-SF87.987.288.491.3-
SelfRegulationSCP193.292.891.493.194.2
SelfRegulationSCP259.457.258.360.561.1
SpokenArabicDigits99.299.599.099.8-
UWaveGestureLibrary88.189.388.791.3-
Average74.073.773.976.0-

Long-term forecasting, bold is the best, italic is the second best.

MethodsDECAVCAS2\text{S}^{2}IP-LLMTime-LLMGPT4TS
MetricMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
ILI1.3800.7831.4280.7991.5520.8261.7130.8581.9250.903
Weather0.2240.2620.2300.2680.2280.2650.2370.2690.2370.270
ECL0.1590.2520.1630.2570.1660.2620.1670.2640.1670.263
Traffic0.3860.2630.3890.2710.4050.2860.4070.2890.4140.294
ETTh10.3940.4240.4170.4320.4180.4360.4260.4350.4270.426
ETTh20.3160.3750.3350.3820.3550.3990.3610.3980.3540.394
ETTm10.3420.3780.3490.3800.3460.3820.3540.3840.3520.383
ETTm20.2500.3140.2590.3180.2620.3260.2750.3340.2660.326
Avg0.4310.3810.4460.3880.4660.3980.4920.4040.5180.407
评论

Question 4: How does the prompt effect the context alignment? Are there some prompts that harm context alignment? Since prompting is a key component of this paper, it would be useful to know how their new method performs with different prompt types


Response: Thanks for your insightful feedback. We supplement experiments with two types of prompts for comparison: data domain and input statistics (following Time-LLM [1]). The following are examples of different prompt types:

  1. The original prompt is concise: ``Predict future sequences using previous data."
  2. For the data domain prompt, using the ETTh dataset as an example:``[Data domain:] The Electricity Transformer Temperature (ETT) plays a vital role in the long-term management of electric power systems. ETTh1 and ETTh2 are for 1-hour-level. Each data point comprises the target 'oil temperature' along with six power load characteristics. [Task:] Predict future sequences using previous data."
  3. For input statistics prompt: ``[Input statistics:] The input features a minimum value of <min_val>, a maximum of <max_val>, and a median of <median_val>. The overall trend is <upward or downward>. [Task:] Predict future sequences using previous data."

Results indicate that the data domain prompt performs almost identically to the original prompt. The input statistics prompt can slightly improve performance; however, it is important to note that each iteration requires the recalculation of statistics features and the regeneration of corresponding embeddings by the LLM tokenizer. This significantly slows down the running speed: training time for 1 iteration has increased from 0.587 to 1.431 seconds.

VariantETTh1ETTm1ETTm1ETTm2
GPT4TS0.4270.3540.3520.266
DECA(Original)0.3940.3160.3420.250
DECA(Domain)0.3960.3160.3430.252
DECA(Statistics)0.3920.3130.3460.246
评论

Question 2: In my opinion the terminology is unnecessarily complicated, and it takes away from the impact of the paper. One key example is “VCA w/o DSCA-GNNs” used in table 1. It seems to me that this is simply just prompting since there are no context aligners used. It may be more clear if you denote VCA as context alignment without DSCA-GNNs and mark in table 1 VCA + DSCA-GNNs. Question 3: What is the difference between “(DECA) DEMONSTRATION EXAMPLES BASED CONTEXT-ALIGNMENT” and few-shot prompting as in NLP. It seems like this term was invented to enhance the complexity of the paper but at the cost of reader understanding, especially if that reader has an NLP background.


Response: We apologize for any confusion caused by the name DECA. To avoid confusion with the "Few-shot forecasting task" mentioned in Section 4.4, and to prevent the misconception that our method is applicable only to this task, we changed the name before submission. In the revised version, we have reverted the method name from Demonstration Examples based Context-Alignment (DECA) back to Few-Shot prompting based Context-Alignment (FSCA). Thank you for your valuable feedback.

Additionally, your understanding of "VCA w/o DSCA-GNNs" as ``this is simply just prompting since there are no context aligners used’’ is correct. However, we are unclear about your suggestion to "denote VCA as context alignment without DSCA-GNNs," since VCA employs DSCA-GNNs to achieve context alignment.

We apologize for any ambiguity or confusion. This might have arisen from our introduction of a new paradigm, Context-Alignment, and the various frameworks or variants defined around this paradigm. We will provide a clearer definition in the revised version. Below is a brief summary:

  • Context-Alignment: A new paradigm for activating the capabilities of LLMs in TS tasks, encompassing both logical and structural alignment.

  • Dual-Scale Context-Alignment GNNs (DSCA-GNNs): DSCA-GNNs is the proposed framework that implements Context-Alignment by utilizing dual-scale edges for logical alignment and dual-scale nodes for structural alignment.

  • Vanilla Context-Alignment (VCA): A straightforward method based on DSCA-GNNs, constructing the dual-scale graph structure for TS input data and task description prompt.

  • Few-Shot Prompting based Context-Alignment (FSCA): An advanced method based on DSCA-GNNs, enhancing VCA through the few-shot prompting techniques.

评论

Weakness 3: I would like more details on the training, their method does add a layer of complexity in adapting LLM’s for time-series analysis and the amount of added compute for finetuning these models (time and device) should be clearly stated.


Response: Thank you for your valuable suggestion. We have supplemented the analysis with metrics for parameter count and execution speed. The computational costs in our DECA method primarily arise from two parts. Firstly, dual-scale GNNs include two learnable matrixes (Eq. 3), as demonstrated by the comparison between “w/o Dual-Scale GNNs” and DECA in the table, where this component adds a minor increase in computational load. Secondly, in constructing coarse-grained inputs, two learnable linear layers are required to map fine-grained node embeddings to coarse-grained ones. The input dimension of this layer considers the number of input TS patches, thus it is the primary source of increased overhead.

Overall, the additional computational costs of DECA are acceptable relative to its performance improvements. We have also added a comparison of experimental efficiency with other LLM-based methods; our approach is second only to GPT4TS [5], which merely adds linear layers at the input and output of the LLM. Other efforts to enable LLMs to understand TS data involve token-alignment, aligning TS data with embeddings from the vocabulary. Moreover, they introduce additional complex operations, such as Time-LLM [1] needing to regenerate prompts and obtain corresponding embeddings in each iteration, and S2S^{2}IP-LLM [2] requiring decoupling of TS inputs and prompt retrieval. Thus, they require more training parameters and have slower training speeds. Our approach proposes a context-alignment paradigm utilizing inherent LLM advantages, achievable through GNN and linear layer without the need for costly token-alignment or additional complex operations, while still delivering state-of-the-art results.

Experimental details: The results in the table were obtained using the ETTh1 dataset, with an input length of 512, a forecast length of 336, batch size set at 128, using the Adam optimizer, and conducted on NVIDIA H800 80GB GPU.

MethodTraining ParamsTraining Params PercentagesTraining Time for 1 iteration(s)Inference Time for 1 iteration(s)
GPT4TS17.33M17.60.4570.215
Time-LLM70.85M46.372.8941.723
S2\text{S}^{2}IP-LLM56.95M41.252.4151.316
DECA w/o Coarse-grained Branch12.43M13.290.3480.155
DECA w/o Dual-Scale GNNs35.83M30.60.5560.322
DECA37.02M31.30.5870.331
评论

Weekness 2: The proposed method hinges heavily on the use of GNN’s and while they do run an ablation study, from what I understand they don’t compare other network architectures for performing the context alignment. Given that the GNN is a major part of the proposed implementation, an ablation showing other networks as course and fine-grained context aligners is necessary. For example, Linear layers, CNNs and self-attention could be a good start.

Question 1: While your methodology considers the strengths of GNN’s as context aligners, I would like to see the choice of GNN’s to be validated experimentally as opposed to other options


Response: Thank you for your constructive suggestion. We have added experiments to further demonstrate the reliability of our method. We replace the GNNs in the dual-scale GNNs framework with alternative networks (linear layers, CNNs and self-attention) to validate that graph structures are a superior choice within the Context-Alignm paradigm. Supplementary results show that GNN-based methods indeed outperform other implementations. This is because the unique node-edge structure of GNNs enables them to better represent structural-logical relationships compared to other networks; here, dual-scale nodes describe the hierarchical structure, while edges depict logical relationships. Therefore, our proposed dual-scale framework based on GNNs more effectively aligns TS within a context understandable by LLMs, thus activating the capabilities of pre-trained LLMs for TS tasks. Besides, to address your Question 1, we incorporate another popular variant of GNN, GraphSAGE, to validate the robustness of our framework across various graph networks.

VariantETTh1ETTm1ETTm1ETTm2
GPT4TS0.4270.3540.3520.266
DECA(GCN)0.3940.3160.3420.250
DECA(GraphSAGE)0.3970.3210.3370.247
DECA(Atten)0.4350.3470.3620.271
DECA(MLP)0.4070.3340.3490.269
DECA(CNN)0.4110.3400.3540.262
评论

Weakness 1: This paper focuses on enhancing LLM’s for time-series analysis and it takes a pretrained GPT model and compares it to other LLM-based approaches. This means this approach is ultimately a pretrained model approach (with the LLM being a pretrained model). The authors need to be comparing to other pretrained models and not just LLM based ones. Since its not being used for text generation-based tasks there’s no reason to limit this evaluation to only LLM based approaches and there are many pretrained models which follow the same experimental method.


Response: Thank you for your valuable suggestions, which significantly enhance the generality of our method across various large models. We have added experimental results on additional pre-trained LLMs (BERT & T5), and pre-trained vision model (BEiT). It is evident that our approach maintains stable performance across different LLMs. However, the results on BEiT notably underperform compared to LLMs. This discrepancy is due to BEiT’s inability to understand the logic and structure within language, rendering our method unsuitable for such pre-trained models.

It is important to emphasize that our work, along with many related studies ( e.g., Time-LLM [1], S2S^{2}IP-LLM [2], TEST [3], aLLM4TS [4], and so on), adopts the perspective of GPT4TS [5]—that pre-trained LLMs contain exploitable generic knowledge—and utilizes LLMs to complete TS tasks. We believe it is driven by the generalization capability of LLMs in various domain downstream tasks, inspiring researchers to use LLMs for TS tasks. The mainstream method involves aligning TS data with the language embeddings in the vocabulary to facilitate LLMs' understanding of TS data. This process also requires the base pre-trained model to have language knowledge. Our method introduces Context-Alignment, requiring pre-trained models to understand the logic and structure within language. Therefore, our work, including other related studies, is fundamentally based on pre-trained language models.

MethodsGPT2BERTT5BEiT
MetricMSEMAEMSEMAEMSEMAEMSEMAE
ETTh10.3940.4240.4160.4320.4030.4260.4430.457
ETTh20.3160.3750.3360.3840.3220.3770.3940.420
ETTm10.3420.3780.3480.3810.3390.3750.3710.388
ETTm20.2500.3140.2610.3190.2480.3180.2800.335
评论

We are very grateful to Reviewer N4CB for the evaluation of our context alignment as new and interesting, and for affirming our implementation approach and experimental performance. Reviewer N4CB has provided us with invaluable feedback. We are committed to addressing your concerns and enhancing the quality of our paper.

NOTE:We summarize the research background, challenges, and key distinctions of our method, which can be found at the beginning of the response page.

评论

Thank you for addressing many of my concerns with regarding to model ablations which justify the proposed architecture.

As for weakness 1, I still have some reservations here. The argument "that pre-trained LLMs contain exploitable generic knowledge—and utilizes LLMs to complete TS tasks. We believe it is driven by the generalization capability of LLMs in various domain downstream tasks, inspiring researchers to use LLMs for TS tasks" is valid and motivates the use of LLM's for time-series analysis. I still believe however that because pretrained models for time-series have become more prominent and popular, you should still provide performance references to current pretrained time-series models and not exclusively LLM's considering they are easily benchmarked on the same datasets and tasks. This is especially prudent because the function of these two model subtypes are similar (time-series generation and classification). This would help to illustrate from a practical standpoint why augmenting time-series analysis with text is an important direction.

评论

Thank you for your valuable feedback. We believe both technical routes are worth exploring. Here, referencing prior research and our own understanding, we explain why we focus on utilizing pre-trained LLMs rather than pre-trained TS models:

  • From a data perspective:

    1. Training pre-trained TS models require extensive datasets, which are more challenging to gather in the TS field compared to NLP; LLM-based models can achieve desired outcomes with smaller datasets specific to downstream tasks[4].

    2. TS datasets vary significantly in frequency and cyclical patterns, leading to large differences in data distribution and posing challenges in knowledge transferability across different domains[2,4,5]. However, LLMs contain exploitable generic knowledge.

  • From a model pespective: Training pre-trained TS models is generally time-consuming; for instance, training MOMENT Small takes up to 300 GPU hours[2]. In contrast, LLM-based methods require little or even no training, making them more general and convenient.

  • From a multimodal perspective: LLM-based models typically freeze LLM pre-trained weights, maintaining text processirng capabilities. Thus, we can use NLP techniques like prompts to guide the model (eg. our method). Additionally, in tasks like weather and financial forecasting, incorporating real-time data from social media can enhance accuracy[5,6]. Further research into leveraging complementary insights across different modalities in multimodal large models could not only improve time series forecasting performance but also enhance interpretability.


Reference

[1] Gao S, Koker T, Queen O, et al. Units: Building a unified time series model[J]. arXiv preprint arXiv:2403.00131, 2024. (Accepted by NeurIPS2024)

[2] Goswami M, Szafer K, Choudhry A, et al. MOMENT: A Family of Open Time-series Foundation Models[C]//Forty-first International Conference on Machine Learning.

[3] Chen S A, Li C L, Arik S O, et al. TSMixer: An All-MLP Architecture for Time Series Forecast-ing[J]. Transactions on Machine Learning Research.

[4] Sun C, Li H, Li Y, et al. TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series[C]//The Twelfth International Conference on Learning Representations

[5] Ye J, Zhang W, Yi K, et al. A Survey of Time Series Foundation Models: Generalizing Time Series Representation with Large Language Mode[J]. arXiv preprint arXiv:2405.02358, 2024.

[6] Wang X, Feng M, Qiu J, et al. From News to Forecast: Integrating Event Analysis in LLM-Based Time Series Forecasting with Reflection[J]. arXiv preprint arXiv:2409.17515, 2024. (Accepted by NeurIPS2024)

评论

Thank you for addressing my concerns. Based on this I am happy to increase my score.

评论

We sincerely apologize for the misunderstanding in our previous response. We have now included comparative results with pre-trained TS models (UniTS-ST[1], MOMENT[2], TSMixer[3]). As shown in the following tables, while UniTS-ST (NeurIPS2024), the best performing among the pre-trained TS models, surpasses some previous LLM-based methods, our method still demonstrates superior performance. We attribute this to our method's effective use of the deep logical and structural understanding of LLMs, which better harnesses LLM capabilities for TS tasks. This further confirms the potential of LLMs in TS applications.


Baseline results of pre-trained TS models UniTS-ST,MOMENT,TSMixer are obtained from their original papers. UniTS-ST is the latest paper accepted by NeurIPS2024. TSMixer doesn’t offer the results of classification tasks.

The results of long-term forecasting tasks, bold is the best, italic is the second best.


MethodsFSCAS2\text{S}^{2}IP-LLMTime-LLMGPT4TSUniTS-STMOMENTTSMixer
MetricMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
Weather0.2240.2620.2280.2650.2370.2690.2370.2700.2160.2590.2280.2700.2250.264
ECL0.1590.2520.1660.2620.1670.2640.1670.2630.1560.2530.1650.2600.1600.257
Traffic0.3860.2630.4050.2860.4070.2890.4140.2940.4090.2780.4150.2930.4080.284
ETTh10.3940.4240.4180.4360.4260.4350.4270.4260.4050.4260.4180.4360.4120.428
ETTh20.3160.3750.3550.3990.3610.3980.3540.3940.3310.3870.3520.3950.3550.401
ETTm10.3420.3780.3460.3820.3540.3840.3520.3830.3370.3760.3440.3790.3470.375
ETTm20.2500.3140.2620.3260.2750.3340.2660.3260.2540.3150.3820.3760.2670.322
Avg.0.2960.3240.3110.3370.3180.3390.3170.3370.3010.3280.3290.3440.3110.333

The results of classification tasks, bold is the best, italic is the second best.

MethodsGPT4TSTime-LLMS2\text{S}^{2}IP-LLMFSCAUniTS-STMOMENT
Ethanol Concentration34.234.635.339.237.635.7
FaceDetection69.267.968.570.470.563.3
Handwriting32.73233.138.429.730.8
Heartbeat77.27877.579.580.072.2
JapaneseVowels98.698.198.698.997.871.6
PEMS-SF87.987.288.491.393.189.6
SelfRegulationSCP193.292.891.494.293.984.0
SelfRegulationSCP259.457.258.361.161.147.8
SpokenArabicDigits99.299.599.099.898.998.1
UWaveGestureLibrary88.189.388.791.387.790.9
Average7473.773.976.475.068.4
评论

Thank you for your positive feedback and for acknowledging the updates we made to the paper. We appreciate your decision to increase the score and am glad that the changes met your expectations. Once again, thank you for your time, effort, and support.

审稿意见
6

This paper introduces Context-Alignment, a novel paradigm for enhancing the capabilities of LLMs in time series tasks. The authors argue that leveraging LLMs' strengths in natural language processing, particularly their understanding of linguistic logic and structure, is key to improving their performance on time series data. They propose a Dual-Scale Context-Alignment Graph Neural Networks (DSCA-GNNs) framework that aligns time series data with linguistic components, enabling structural and logical alignment. This framework is used to develop Demonstration Examples based Context-Alignment (DECA), which integrates seamlessly into pre-trained LLMs to enhance their awareness of logic and structure. Extensive experiments across various time series tasks, including forecasting and classification, demonstrate DECA's effectiveness, especially in few-shot and zero-shot scenarios, highlighting the importance of context alignment in activating and enhancing LLMs' potential in time series applications.

优点

The paper strengthens the representation of time series data through GNNs, addressing a gap in prior research on applying LLMs to time series tasks. The method shows promising results and could inspire future work in this area.

缺点

  • The paper's incorporation of GNNs is a functional approach, yet it overlooks a comparison of modeling time consumption, which is a critical aspect of efficiency that should be measured against baselines that do not employ this method.
  • Additionally, while the paper presents a generalized embedding technique, further validation across a broader range of time series scenarios is needed to establish its robustness. Testing the method in other contexts, such as time series anomaly detection and imputation, would strengthen the claims of its effectiveness.

问题

I have raised several concerns within the weaknesses. Please address the issues I've mentioned there.

评论

Weakness 1: The paper's incorporation of GNNs is a functional approach, yet it overlooks a comparison of modeling time consumption, which is a critical aspect of efficiency that should be measured against baselines that do not employ this method.


Response: Thanks for your constructive suggestion. We have added comparisons of experimental efficiency with other LLM-based methods, including parameters amount and execution speed. As shown in the table, our method is second only to GPT4TS [1], which merely adds linear layers at the input and output of LLMs. Other mainstream approaches require token-alignment to make LLMs comprehend TS data (aligning TS data with word embeddings in the vocabulary). Moreover, they often incorporate additional operations. For instance, Time-LLM [2] regenerates prompts and obtains corresponding embeddings in each iteration, while S2S^{2}IP-LLM [3] involves decoupling TS inputs and conducting prompt retrieval.

In contrast, our method utilizes the intrinsic advantages of LLMs to propose the Context-Alignment paradigm, eliminating the need for token-alignment and additional operations to achieve SOTA results. In our DECA method (renamed as FSCA in the revision), computational costs mainly arise from two aspects: first, the trainable weight matrix in the GCN that involves straightforward matrix multiplication (Eq. 3); second, two learnable linear layers that map fine-grained node embeddings to coarser ones when constructing coarse-grained inputs. Thus, despite introducing a dual-scale GNNs framework, our approach still consumes less time compared to other baseline methods.

Experimental details: The results in the table were obtained using the ETTh1 dataset, with an input length of 512, a forecast length of 336, and batch size set at 128, using the Adam optimizer, and conducted on NVIDIA H800 80GB GPU.

MethodTraining ParamsTraining Params PercentagesTraining Time for 1 iteration(s)Inference Time for 1 iteration(s)
GPT4TS17.33M17.60.4570.215
Time-LLM70.85M46.372.8941.723
S2\text{S}^{2}IP-LLM56.95M41.252.4151.316
DECA w/o Coarse-grained Branch12.43M13.290.3480.155
DECA w/o Dual-Scale GNNs35.83M30.60.5560.322
DECA37.02M31.30.5870.331
评论

We sincerely thank Reviewer oFgL for recognizing our efforts in addressing a gap in processing TS tasks with LLMs, for affirming our experimental results and value on future work. Reviewer oFgL's suggestions on our experimental analysis, particularly regarding cost comparisons and additional tasks, have helped us present our research more comprehensively and enhance its quality. We hope your concerns will be addressed.

NOTE:We summarize the research background, challenges, and key distinctions of our method, which can be found at the beginning of the response page.

评论

[1] Zhou T, Niu P, Sun L, et al. One fits all: Power general time series analysis by pretrained lm[J]. Advances in neural information processing systems, 2023, 36: 43322-43355.

[2] Jin M, Wang S, Ma L, et al. Time-llm: Time series forecasting by reprogramming large language models[J]. arXiv preprint arXiv:2310.01728, 2023.

[3] Pan Z, Jiang Y, Garg S, et al. S2 S^ 2 IP-LLM: Semantic Space Informed Prompt Learning with LLM for Time Series Forecasting[C]//Forty-first International Conference on Machine Learning. 2024.

评论

Weakness 2: Additionally, while the paper presents a generalized embedding technique, further validation across a broader range of time series scenarios is needed to establish its robustness. Testing the method in other contexts, such as time series anomaly detection and imputation, would strengthen the claims of its effectiveness.


Response: Thanks for your constructive suggestion.

We have supplemented our experiments on anomaly detection and imputation. It is important to note that for the anomaly detection task, only the test dataset is labeled, while the training dataset does not provide labels. Therefore, to ensure a fair comparison, we implement this using our method VCA (without few-shot prompting). The imputation is performed using our advanced method DECA (with few-shot prompting, renamed as FSCA in the revision). Our method demonstrates superior performance, indicating its effective applicability across multiple TS tasks.

The results of anomaly detection of F1-score for each dataset, bold is the best, italic is the second best.

MethodsDECAS2\text{S}^{2}IP-LLMTime-LLMGPT4TSiTransformerDLinearPatchTSTTimesNetFEDformerStationaryETSformer
SMD87.0486.7485.9386.8986.5277.1084.6284.6185.0884.7283.13
MSL84.6183.0984.2482.4583.3084.8878.7081.8478.5777.5085.03
SMAP73.4673.1173.8172.8869.6769.2668.8269.3970.7671.0969.50
SWaT93.7893.8593.4194.2387.4387.5285.7293.0293.1979.8884.91
PSM97.6597.3197.0297.1396.6993.5596.0897.3497.2397.2991.76
Average87.3186.8286.8886.7284.7282.4682.7985.2484.9782.1082.87

The results of imputation, bold is the best, italic is the second best.

MethodsDECAS2\text{S}^{2}IP-LLMTime-LLMGPT4TSiTransformerDLinearPatchTSTTimesNetFEDformerStationaryETSformer
MetricMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
ETTh10.0630.1710.0740.1770.0800.1850.0690.1730.1020.2070.2010.3060.1150.2240.0780.1870.1170.2460.0940.2010.2020.329
ETTh20.0390.1320.0440.1390.0530.1500.0480.1410.0570.1540.1420.2590.0650.1630.0490.1460.1630.2790.0530.1520.3670.436
ETTm10.0240.0950.0220.0960.0250.0950.0280.1050.0350.0130.0930.2060.0470.1400.0270.1070.0620.1770.0360.1260.1200.253
ETTm20.0210.0870.0260.0920.0340.1010.0210.0840.0340.1100.0960.2080.0290.1020.0220.0880.1010.2150.0260.0990.2080.327
评论

I appreciate the additional experiments and explanations, and I'd like to raise my overall assessments. I believe that strengthening the representation of time-series data through GNNs holds significant value for future research.

评论

We would like to express sincere gratitude for your constructive feedback and raising your rating. The additional experiments and explanations you suggested have significantly improved my work. I appreciate your positive assessment and look forward to exploring the potential of Context-Alignment in time-series tasks. Your guidance has been invaluable.

审稿意见
6

This paper aims at addressing the alignment of Time series not just on the token level as done by previous works but at a level that enable LLMs to contextualize and comprehend TS in the same manner as it does for natural language. To this aim, the authors proposed Dual-Scale Context-Alignment GNNs that achieves context level alignment comprising structural alignment and logical alignment thereby activating LLMs’ potential capabilities in time series tasks.

优点

  • The paper addresses an important question of context alignment which promises better use of LLMs in the time series domain.
  • The idea of using GNNs to introduce context alignment brings novelty to this work.

缺点

  • Since the authors use GNNs for context alignment, a befitting diagram showing the nodes and edges would have made it easier to follow the text.
  • Some typos here and there in the paper.
  • Datasets not described clearly in the experimental setup.

问题

  • I am not completely sure if averaging the error metrics across different prediction lengths is the best way to report and compare the results.
  • Is there a justification as to why this averaging approach is adopted?
评论

Question 1: I am not completely sure if averaging the error metrics across different prediction lengths is the best way to report and compare the results.

Question 2: Is there a justification as to why this averaging approach is adopted?


Response: Thank you for your valuable question. We have presented the complete results in Appendix C “Full results” of the original manuscript. We hope this could help you further understand the performance of our work. Due to space limitations, like other studies in this field, we only show average metrics in the main text, as averages facilitate quicker comparison of different methods' performance.

评论

Weakness 3: Datasets not described clearly in the experimental setup.


Response: Thank you for your valuable suggestions.

In the experimental section of the main text, we briefly mention the datasets used for each task due to space limitations. A detailed description of the datasets, including statistics and more, is provided in Appendix A.2. We have refered to this detail in the Section 4 Experiment of revision (Line 301-302). If you have any further questions about the dataset, we are happy to assist you as best as we can.

评论

Weakness 2: Some typos here and there in the paper.


Response: We apologize for the typos. We have carefully reviewed and corrected several typos in the revised version. Thank you for your attention to detail, it is crucial for improving our work.

评论

Weakness 1: Since the authors use GNNs for context alignment, a befitting diagram showing the nodes and edges would have made it easier to follow the text.


Response: Thank you very much for your valuable advice, which is crucial for improving the clarity of our work. To illustrate our methods more clearly, we have included a schematic diagram in Appendix E including the presentation of the graph structure. Here is a brief description:

1.For VCA, we first tokenize the input TS sequence and task prompt to obtain feature embeddings, depicted as light blue and light green blocks respectively, which is the fine-grained sequence. We then establish a fine-grained graph: directed edges from all input TS tokens to the first prompt token. Through learnable linear layers, the fine-grained nodes are mapped to coarse-grained nodes. The coarse-grained branch is represented by dark blue and dark green blocks, with similar edge constructed. In the training phase, information from the coarse-grained branch is transmitted to the fine-grained branch through the learnable mapping.

2.For FSCA (we have renamed DECA to FSCA in the revision) in the forecasting tasks, the overall process is similar to VCA, but we divide the input TS sequence into subsequences. In the diagram, 2 subsequences are used as a few-shot example: the first subsequence serves as a sequence used for prediction, and the second subsequence acts as the ground truth for the first. Based on this, we construct directed edges as shown in the diagram. It's important to note that both the first and second subsequences must be connected to the final prompt, as the entire sequence is required for the final prediction

3.For FSCA in the classification tasks, we extract one sample for each category in the training set as the fixed example, with the remaining process similar to VCA or FSCA.

评论

We greatly appreciate Reviewer 2N38's recognition of our better use of LLMs and the novelty of our idea. Additionally, Reviewer 2N38 has provided valuable feedback on the presentation of our paper, and we are committed to making it clearer and avoiding ambiguities.

NOTE:We summarize the research background, challenges, and key distinctions of our method, which can be found at the beginning of the response page.

评论

Thank you for your valuable suggestions! We have incorporated the necessary corrections and detailed explanations in the revised version, which are highlighted in red. The specific locations of these revisions have been provided in our previous itemized response.

Should you have any further questions, we are eager to address them and hope to receive your positive feedback on our updates. We deeply appreciate the time and effort you have invested in reviewing our work. Thank you once again!

评论

Dear Reviewer 2N38,

Thank you for your feedback during the review process! We believe that our detailed response has addressed your concerns. If you have any concerns or questions, please do not hesitate to let us know before the author discussion period ends (less than two days). We will be happy to answer them during the discussion.

Thank you!

审稿意见
6

This work aims at leveraging and improving LLMs for time series tasks. Specifically, the authors propose context alignment, a technique that utilizes dual-scale GNNs in addition to the basic LLM architectures that helps LLM comprehend time series data. Few-shot prompting techniques in regular LLMs are also used in the time series design. Through various experiments on different time series benchmarks, the authors show the performance advantage of the proposed method over baseline models.

优点

This paper is tackling an interesting problem of leveraging and improving pretrained LLMs to do time series tasks. The authors explored a diverse set of benchmarks and baselines and showed a superior performance of the proposed method.

缺点

This work motivates and explains the advantage of the proposed method by LLMs' deep understanding of linguistic logic and structure rather than superficial embedding processing. However, such explanations lack support from experiments. More analysis can be included on the LLM side. For example, does an inferior LLM basic architecture (either old design or small model sizes) or a badly trained LLM (undertrained or untrained) lead to a bad time series performance?

This work freezes most of the LLM structure and tunes dual-scale context-alignment GNNs between Transformer layers. There seems to be lacking analysis on the effect of fully tuning the attention and feed-forward layers as well, and the subsequent necessity of the dual-scale context-alignment GNNs in that case.

One of the interesting attributes of LLMs is the scaling effect (e.g., on model sizes, training data sizes, few-shot prompting amounts). The scaling aspect of LLMs and the proposed DECA seems to be unknown in the context of time series.

The writing can be improved for clarity, e.g., by providing qualitative examples besides key equations, especially in the few-shot prompting part, and adding an algorithm table.

问题

N/A

评论

We wish to express our sincere gratitude to Reviewer Dkrj for the interesting evaluation of the problem we addressed, and for acknowledging our extensive experimentation and effectiveness. We also greatly appreciate suggestions for further experimental analysis. Rest assured, we are committed to addressing these concerns and improving our work.

NOTE:We summarize the research background, challenges, and key distinctions of our method, which can be found at the beginning of the response page.

评论

[1] Zhou T, Niu P, Sun L, et al. One fits all: Power general time series analysis by pretrained lm[J]. Advances in neural information processing systems, 2023, 36: 43322-43355.

[2] Pan Z, Jiang Y, Garg S, et al. S2 S^ 2 IP-LLM: Semantic Space Informed Prompt Learning with LLM for Time Series Forecasting[C]//Forty-first International Conference on Machine Learning. 2024.

[3] Jin M, Wang S, Ma L, et al. Time-llm: Time series forecasting by reprogramming large language models[J]. arXiv preprint arXiv:2310.01728, 2023.

[4] Sun C, Li H, Li Y, et al. TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series[C]//The Twelfth International Conference on Learning Representations.

[5] Bian Y, Ju X, Li J, et al. Multi-Patch Prediction: Adapting Language Models for Time Series Representation Learning[C]//Forty-first International Conference on Machine Learning.

评论

Weakness 4: The writing can be improved for clarity, e.g., by providing qualitative examples besides key equations, especially in the few-shot prompting part, and adding an algorithm table.


Response: Thank you for your valuable suggestions, which are crucial for enhancing the reading experience of our paper.

In the revised Appendix E&F, we have supplemented the qualitative examples and algorithm tables. We have also added a diagram illustrating data processing and graph construction (including the few-shot prompting and VCA part).

Here, we further explain the few-shot prompting part with an example for your clarity, which can also be found in the revision.

Assuming we have inputs processed through patching and token embedding, these include TS embeddings of length 8 and task description prompt embeddings of length 2 (the prompt is “Predict future sequences using previous data:" in our method, here the length is an example) :

Firstly, for the fine-grained branch, let's take the example where the input TS embeddings are divided into 2 subsequences, each containing 4 embeddings. Thus, TSsub1TS_{sub}^{1} is [e1,1,e1,2,e1,3,e1,4][\mathbf e_{1,1},\mathbf e_{1,2},\mathbf e_{1,3},\mathbf e_{1,4}], and TSsub2TS_{sub}^{2} is [e2,1,e2,2,e2,3,e2,4][\mathbf e_{2,1},\mathbf e_{2,2},\mathbf e_{2,3},\mathbf e_{2,4}], where ei,j\mathbf e_{i,j} indicates jj-th embedding in subsequence ii. Similarly, zi,j\mathbf z_{i,j} refers to jj-th embedding in the task prompt of subsequence ii. Here, [z1,1,z1,2]=[z2,1,z2,2][\mathbf z_{1,1}, \mathbf z_{1,2}]=[\mathbf z_{2,1}, \mathbf z_{2,2}]. Ultimately, Eq.5 is instantiated as: [e1,1,e1,2,e1,3,e1,4,z1,1,z1,2,e2,1,e2,2,e2,3,e2,4,z2,1,z2,2][\mathbf e_{1,1},\mathbf e_{1,2},\mathbf e_{1,3},\mathbf e_{1,4}, \mathbf z_{1,1}, \mathbf z_{1,2}, \mathbf e_{2,1},\mathbf e_{2,2},\mathbf e_{2,3},\mathbf e_{2,4}, \mathbf z_{2,1}, \mathbf z_{2,2}].

Secondly, we need to construct a graph structure for this input before it enters LLM. The basic logic for constructing the graph is that TSsub2TS_{sub}^{2} serves as the ground truth for TSsub1TS_{sub}^{1} (The latter subsequence serves as the correct label for the former subsequence). Specifically, starting with all elements in TSsub1TS_{sub}^{1}, construct directed edges to the first item of the corresponding task description, z1,1\mathbf z_{1,1}. Subsequently, from the last item of the task description, z1,2\mathbf z_{1,2}, construct directed edges to all elements in TSsub2TS_{sub}^{2}. Since all TS subsequences are used to predict future sequences, the first token of the last prompt, z2,1\mathbf{z}_{2,1}, needs to establish edge connections with both TS subsequences.

Thirdly, for the coarse-grained branch, it is essential to inform the LLM that a time series should be treated as a whole. Thus, TSsubiTS_{sub}^{i} must be mapped to individual node embedding by a linear layer. To align the scales, the prompt embeddings are also mapped to a node embeding. The coarse-grained sequences can be denoted as [e~1,z~(1),e~2,z~(2)][\tilde{\mathbf{e}}_1, \tilde{\mathbf{z}}^{(1)}, \tilde{\mathbf{e}}_2, \tilde{\mathbf{z}}^{(2)}] (instantiation of Eq.6). Additionally, the graph construction logic is consistent with that of the fine-grained branch.

评论

Weakness 3: One of the interesting attributes of LLMs is the scaling effect (e.g., on model sizes, training data sizes, few-shot prompting amounts). The scaling aspect of LLMs and the proposed DECA seems to be unknown in the context of time series.


Response: Thank you very much for your advice. We conducted additional experiments to verify the effectiveness of our method under various conditions. We have added the relevant content in Appendix G.2 of the revised version.

  1. On model sizes, we have conducted ablation experiments on the number of GPT-2 layers in variants Tab.6. As the number of layers increases, performance declines, consistent with observations made with GPT4TS[1] and aLLM4TS [5].

  2. On training data sizes, we have trained models with 5% and 10% of the data in few-shot forecasting tasks, and further extended our experiments to settings using 25%, 50%, and 75% of the training data. Increasing data volumes continuously improve outcomes, especially notable at the 50% data point.

  3. Regarding the amounts of few-shot prompting examples, experiments with various amounts reveal that as the number of examples increases, there is a modest improvement in short prediction length (96), but effectiveness diminishes with more examples. In contrast, for long prediction lengths like 336 or 720, more examples lead to worse outcomes. This could be due to our examples being derived from divisions of the TS input (where the total length of TS is 512). More divisions mean shorter lengths per example, yet the required prediction length is much longer, leading to adverse effects due to this mismatch.

The results of training data sizes

Training data ratioETTh1ETTm1
5%0.5750.435
10%0.5380.435
25%0.4860.411
50%0.4090.366
75%0.3980.350
100%0.3940.342

The results of the amounts of few-shot prompting examples

Examples amount1234
MetricMSEMAEMSEMAEMSEMAEMSEMAE
ETTh1
960.3490.3890.3430.3850.3410.3820.3560.394
1920.3900.4150.3870.4160.3930.4190.4020.428
3360.4020.4320.4070.4390.4140.4450.4400.456
7200.4330.4600.4460.4710.4620.4880.4850.495
Avg0.3940.4240.3960.4280.4030.4340.4210.443
ETTm1
960.2820.3430.2770.3400.2750.3410.2960.352
1920.3240.3690.3260.3740.3310.3850.3410.377
3360.3560.3860.3660.3910.3700.3950.3910.408
7200.4050.4170.4120.4250.4280.4320.4510.438
Avg0.3420.3780.3450.3830.3510.3880.3700.394
评论

Weakness 2: This work freezes most of the LLM structure and tunes dual-scale context-alignment GNNs between Transformer layers. There seems to be lacking analysis on the effect of fully tuning the attention and feed-forward layers as well, and the subsequent necessity of the dual-scale context-alignment GNNs in that case.


Response: Thank you for your valuable suggestions regarding our work. We have added the relevant content in Appendix G.3 of the revised version.

In fact, GPT4TS [1] has already demonstrated experiments using fully tuning LLMs for TS tasks, which is the most straightforward approach (adding only two linear layers to the input and output ends of the LLMs) to leverage LLMs in this domain. However, fully tuning requires higher computational overhead, and yields suboptimal results. GPT4TS demonstrated through experiments and theoretical analysis that LLMs possess inherent generic knowledge that can be utilized directly, and that fully tuning LLMs would compromise this generic knowledge. Consequently, our approach, along with other TS analysis methods based on pre-trained LLMs (e.g. S2S^{2}IP-LLM[2], Time-LLM [3], TEST [4], aLLM4TS [5], and so on), focuses on freezing most of the LLM structure to explore efficient ways of activating their potential for TS tasks. However, we believe that the fully tuning experiment is necessary, given our significant differences from GPT4TS. Thank you for your reminder. We supplemented the results of our method with fully tuning, which yielded similarly suboptimal performance to GPT4TS.

MethodETTh1ETTm1
GPT4TS0.4270.352
GPT4TS (Fully tuning)0.4690.406
DECA0.3940.342
DECA (Fully tuning)0.4570.383
评论

Weakness 1: This work motivates and explains the advantage of the proposed method by LLMs' deep understanding of linguistic logic and structure rather than superficial embedding processing. However, such explanations lack support from experiments. More analysis can be included on the LLM side. For example, does an inferior LLM basic architecture (either old design or small model sizes) or a badly trained LLM (undertrained or untrained) lead to a bad time series performance?


Response: Thank you for your constructive review suggestions. We agree that this experiment is interesting and essential, as it provides more favorable validation for our method. We have added the relevant content in Appendix G.1 of the revised version.

As the table illustrates, we add experiments on the LLM side, using randomly initialized GPT-2 pre-trained weights in various proportions for under-trained and untrained scenarios. As the initialization proportion rises, the LLM's contextual understanding weakens, resulting in reduced performance of our method. When the model's capability is weak, our results are as poor as those of GPT4TS [1] (the most direct method to utilize LLM for TS tasks) and S2S^{2}IP-LLM [2] (a token-alignment based method). However, as the LLM's capability improves, our method significantly outperforms GPT4TS and S2S^{2}IP-LLM, achieving a lower MSE. In Appendix G.1 of the revision includes trend line graphs of the experimental results.This result demonstrates that our proposed Context-Alignment paradigm, which emphasizes LLMs' deep understanding of linguistic context, more effectively activates the potential of pre-trained LLMs in TS tasks.

Besides, the original text includes experimental support for this explanation on the method side. Variant A.1, shown in Table 6, removes the Dual-Scale Context Alignment GNN framework, and Variant A.2 involves random initialization of GNN connectivity (i.e., flawed logic guidance), which significantly decreases performance. The more severe impact observed in A.2 suggests that disrupted logic and structure have more severe negative effects, highlighting LLMs' deep understanding of linguistic context.

Random initialized ratioDECA on ETh1GTP4TS on ETh1S2\text{S}^{2}IP-LLM on ETh1DECA on ETh2GTP4TS on ETh2S2\text{S}^{2}IP-LLM on ETh2
0%0.3940.4270.4180.3160.3540.355
20%0.4010.4350.4250.3290.3580.363
40%0.4070.4420.4380.3350.3670.365
60%0.4210.4450.4530.3380.3690.381
80%0.4650.4770.4810.4040.4160.411
100%0.5340.5310.5380.4410.4450.437
评论

We sincerely thank you for your valuable questions, and we are more than happy to address them. We have supplemented the experiments and included the relevant corrections and explanations in the revised version (marked in red). The specific locations of the revision have been added in our previous itemized response. Based on your comments, we believe that you have well understood the background, motivation, and key contributions of our work. We hope you might slightly consider increasing your confidence or potentially raising your evaluation score.

Once again, we truly appreciate your time and effort for our paper.

评论

Dear Reviewer Dkrj,

Thank you for your feedback during the review process! We believe that our detailed response has addressed your concerns. If you have any concerns or questions, please do not hesitate to let us know before the author discussion period ends (less than two days). We will be happy to answer them during the discussion.

Thank you!

评论

We appreciate all reviewers for recognizing the contribution of our paper. In particular, reviewers 2N38, oFgL, and N4CB for recognizing the novelty of our paper, with N4CB further affirming logical model design. Following our supplementary experiments and responses, reviewers oFgL and N4CB have improved their overall assessments. Our thanks go to all reviewers and the AC for their support.

We summarize the research background, challenges, and key distinctions of our method below, then discuss each comment specifically in the reviewers' responses.

Research Background: Time series (TS) datasets span multiple domains such as medical, industrial, transportation, power, etc. Thus, there is a critical need for models like LLMs that can generalize across diverse domains. Recent research, including our method, aims to efficiently use pre-trained LLMs for time series tasks. Particularly, GPT4TS [1] has proven, both theoretically and empirically, that LLMs possess generic knowledge useful for these tasks.

Challenges: The significant differences between the training text data for LLMs and TS data hinder the effective activation of LLM capabilities in TS tasks.

Distinction from Mainstream Token-Level Alignment Methods: Token-alignment methods utilize a vocabulary and employ various techniques to align TS inputs with vocabulary words that describe temporal features, such as rise, fall, periodic, steady, short, long, and so on, facilitating comprehension by LLMs. Differing from these methods, we leverage the deep understanding of linguistic context by LLMs, and inspired by insights from the field of linguistics, to propose the Context-Alignment (including logical alignment and structural alignment). Context-Alignment aligns TS with a linguistic component to enable LLMs to contextualize and comprehend TS data, thereby activating their capabilities.

Distinction from Directly Enhancing LLMs' Capabilities on TS: We argue that without LLMs understanding TS data, enhancements are less interpretable and less effective. Thus, we first suggest a two-step route: activate, then enhance. Our Vanilla Context-Alignment (VCA) activates LLM capabilities through Context-Alignment, while our final method, DECA, extends VCA with Few-shot Prompting to boost performance, following our proposed steps scheme.

Summary of revisions: Following reviewer feedback, we primarily made the following modifications:

  1. Renamed DECA to FSCA (Few-Shot Prompting based Context-Alignment) as suggested by Reviewer N4CB.

  2. Added qualitative examples, algorithm tables, and a schematic diagram of the graph structure in Appendices E,F, as recommended by Reviewers Dkrj and 2N38.

  3. Incorporated additional experiments and analyses in Appendix G, focusing on computational efficiency, comparisons with TS foundation models, and ablation studies such as alternative network structures to GNNs, based on broader reviewer suggestions.

[1] Zhou T, Niu P, Sun L, et al. One fits all: Power general time series analysis by pretrained lm[J]. Advances in neural information processing systems, 2023, 36: 43322-43355.

AC 元评审

This paper proposes a novel approach to leveraging LLMs for time series tasks by aligning time series data to linguistic contexts through dual-scale context-alignment GNNs (DSCA-GNNs). The authors validate their method with extensive experiments across forecasting, classification, and anomaly detection, demonstrating improved performance over several baselines, particularly in few-shot and zero-shot scenarios.

Strengths: The paper tackles an interesting problem of adapting LLMs for time series tasks. The proposed context alignment induces "structural and logical alignment". The paper includes a wide range of experiments with meaningful results.

Weaknesses: Despite its strengths, the paper has several limitations. It lacks sufficient comparisons to non-LLM-based pretrained models in its original submission, which was mostly addressed during the rebuttal.

Decision: Based on strong empirical performance thorough revisions during the rebuttal, I recommend acceptance. The contributions set a strong foundation for future work in integrating LLMs with time series data.

审稿人讨论附加意见

The review process highlighted several important points:

  1. Comparison with non-LLM-based pretrained models: Reviewers suggested comparisons to pre-trained time series models (e.g., UniTS-ST, MOMENT), which the authors incorporated during the rebuttal. The added experiments demonstrated the superiority of the proposed method in most tasks.

  2. Evaluation of GNN choice: Concerns about GNNs' effectiveness compared to other architectures (e.g., CNNs, self-attention) were addressed with additional ablation studies. These confirmed that GNNs performed best for context alignment.

  3. Complexity and computational overhead: The authors clarified computational costs, showing that while their method only introduces modest overhead, it remains competitive with baseline methods.

最终决定

Accept (Poster)