LAST SToP for Modeling Asynchronous Time Series
Using Large Language Models (LLMs) to model Asynchronous Time Series
摘要
评审与讨论
The authors propose a method for modeling temporal event sequences by finetuning pretrained language models. The paper shows that by using a novel prompt tuning method they are able to outperform several baselines and ablations.
给作者的问题
- How does StoP compare to other prompt tuning techniques? It would appear that it is applicable to all kinds of NLP tasks, not just those involving time series. Your experiments show that it's really doing something different (and apparently better) than SP. Similarly, how do other techniques perform on this task?
- "For each of these datasets, the semantic meaning of the event type is unknown, and only the index of the event type is available." However, in Appendix A.4 you go on to to test the case where the event description is replaced with gibberish. This leaves me a little confused - in the main experiments (e.g. Table 1) does the model have access to semantic information about the events?
- If you tune LLMTime and LLMProcess rather than doing zero shot is the comparison still as favorable? I know that these models are originally proposed as zero-shot, but it seems like it would be a fairer comparison if you tuned them as well.
- Why are you using QLoRA rather than finetuning the whole model?
- How does your method compare to simpler, non-neural baselines?
论据与证据
On the whole, yes. I am generally skeptical of methods that adopt language models for time series forecasting, as prior work has shown that they are outperformed by simple linear methods . However, the authors of this paper do compare to relevant TPP methods (Table 2), conduct strong ablations, and compare to a random baseline.
One other result from is that randomly initialized language models (i.e. those without any pretraining at all) surprisingly perform just as well as pretrained models. The paper would be strengthened by showing that this is not the case for this task.
Tan, Mingtian, Mike A. Merrill, Vinayak Gupta, Tim Althoff, and Thomas Hartvigsen. 2024. “Are Language Models Actually Useful for Time Series Forecasting?” arXiv. http://arxiv.org/abs/2406.16964.
方法与评估标准
Yes, the evaluation methods and datasets are standard and appropriate.
理论论述
N/A
实验设计与分析
I have a few questions about the experiments that I've listed below. On the whole the experiments (including the "bonus" experiments in the appendix) are interesting and well thought out.
补充材料
Yes, I read the Appendix.
与现有文献的关系
The contributions of this paper will be interesting to anyone working in the field of language models for time series, which is an active area of research. The StoP method is potentially of broad interest to the NLP community, although this is not the primary focus of the paper.
遗漏的重要参考文献
See "Claims and Evidence"
其他优缺点
The paper is very well written and contains exhaustive experiments.
其他意见或建议
- No need to reintroduce the notation in lines 210-215, would help paper flow
- Your description of LoRA in Section 4.2 is confusing. My understanding is that you're simply applying the original technique, but the description reads as though you're making a novel contribution. This should be clarified.
We appreciate the reviewer's positive evaluation, thoughtful suggestions, and recognition of our contributions, including the LASTS representation and the Stochastic Soft Prompting (SToP) mechanism. We respond to each comment below:
Random Initialization: Thank you for highlighting this relevant literature. We evaluated pretraining on the Breakfast dataset's forecasting task and observed clear benefits: fine-tuning a randomly initialized model yielded F1-score 0.14 vs 0.26, Accuracy 0.21 vs 0.39, and MAE 39.29 vs 32.55. This highlights the value of text pretraining in our asynchronous setting, given the rich natural language input. We’re happy to include this as an additional baseline in the main table of the camera-ready version if the reviewer recommends it.
Redundant Notation (Lines 210–215): Thank you for pointing this out. We will remove the redundant notation.
Clarification on LORA: Our goal in this section was to show that LASTS integrates easily with existing PEFT methods like LoRA and Soft Prompting for adapting an LLM backbone. We acknowledge the wording may have caused confusion and will revise it to clarify that we use the original LoRA implementation from [1]
[1] https://github.com/huggingface/peft
Questions
- We answer this in two parts:
- SToP Generality: We agree that SToP may have broader applicability, but the focus of our current work is specifically on asynchronous time series. Exploring SToP in other NLP domains is beyond the scope of this paper and is left for future work.
- Comparison to Other Prompt Tuning Techniques: We include the most widely-used PEFT methods—Soft Prompting and QLoRA—as strong representatives for LLM adaptation; please see our response to Reviewer L5nG titled "Comparison to Other Prompt-Based Adaptation Methods" for further details.
- Clarification on use of semantic information: We appreciate the reviewer’s careful reading
- The three datasets in Table 1—Breakfast, MultiTHUMOS, and EPIC-KITCHENS—contain semantic information that is available to the model and used in our experiments. (L270-274)
- For the five TPP datasets shown in Table 2, event names are replaced by categorical indices, and neither our model nor any TPP baseline uses semantic information. (L264-267)
- The experiments in Appendix A.4 using gibberish and textual descriptions are controlled ablations designed to isolate the effect of semantic content.
These experiments demonstrate that our model is flexible and effectively utilizes semantic information when it is available. We will make this clarification more prominent in the camera version of the manuscript.
-
Finetuning LLMTime, LLM Processes: We focused on using these models in their zero-shot capacity, as they were specifically proposed and pretrained for that purpose. Since these methods are compared against LASTS in the same zero-shot setting, we believe our comparisons are fair and meaningful, especially to assess generalization without additional tuning. We also highlight this zero-shot comparison in our main paper (L381–384), Figure 5, and Appendix A.6. We agree that fine-tuning LLMTime and LLMProcesses could be interesting baselines, but it may require additional effort to determine the optimal fine-tuning recipe and ensure a fair evaluation between models. Therefore, we will consider it as part of our future work..
-
QLoRA vs Full Finetune: We chose QLoRA because it allows parameter-efficient adaptation with low memory cost, aligning with our goal of scalable deployment. As Table 1 and Appendix A.9 show, QLoRA performs competitively, and SToP improves upon it while using only 0.02% of model parameters. While full fine-tuning could potentially yield slightly improved performance, it would be computationally impractical given the limitations of our current hardware.
-
Simpler non neural baselines: We thank the reviewer for this suggestion. Our primary baselines were state-of-the-art TPP models and recent LLM/PEFT techniques. Most recent literature on asynchronous time series modeling has shifted toward neural approaches, and we follow this trend to maintain consistency with comparable works. In non-neural TPPs, the features and functional forms used to model event intensities, dependencies, and histories often need to be manually designed or based on assumptions. This restricts the model's ability to generalize to complex datasets. In contrast, neural models offer greater flexibility in modeling the intensity function, enabling them to capture intricate relationships within the data. This adaptability makes neural TPPs better suited to handle diverse and complex datasets, improving their generalization and performance.
Once again, thank you for your encouraging and constructive review. We are grateful for the thoughtful feedback.
This paper introduces a novel framework for modeling asynchronous time series data using Large Language Models (LLMs). Unlike regular time series with evenly spaced time points, asynchronous time series consist of timestamped events occurring at irregular intervals, each described in natural language. This work demonstrates the potential of LLM-based approaches for asynchronous time series analysis across multiple tasks and domains, offering a flexible alternative to traditional methods while leveraging the world knowledge embedded in LLMs.
给作者的问题
What can we exactly learn from the Figure 5 and Figure 13?
论据与证据
- LASTS as an effective framework for asynchronous time series modeling
- Evidence: Comprehensive evaluations across multiple datasets (Breakfast, MultiTHUMOS, EPIC-KITCHENS and five standard TPP datasets) show consistent performance improvements.
- The zero-shot performance of LASTS exceeds other zero-shot baselines (Tables 1, 2, Figure 4).
- Comparison with specialized TPP models shows competitive or superior performance (Table 2).
- Stochastic Soft Prompting (SToP) outperforms other PEFT methods
- Evidence: Detailed comparative results show SToP consistently outperforming SP and QLoRA across datasets and tasks (Table 1).
- Appendix Tables 6-7 quantify the performance gains (average 12.69% M-F1 improvement over SP and 13.55% over QLoRA).
- Training efficiency measurements show ~25% faster training than standard soft prompting.
- Multi-task capability without task-specific designs
- Evidence: The same LASTS representation is successfully applied to forecasting, imputation, and anomaly detection without architectural changes (Table 1).
- Performance on all three tasks significantly exceeds baselines when using the adapted models.
- Ability to handle large event spaces
- Evidence: Successful modeling of EPIC-KITCHENS dataset with ~20,000 unique event descriptions.
- Table 2 indicates traditional TPP methods encounter OOM errors on this dataset
方法与评估标准
The methods are well-designed for the problem, and the evaluation criteria are appropriate and comprehensive, covering diverse datasets, tasks, and comparison points. The authors have made sensible choices in metrics and baseline comparisons that allow for a fair assessment of their contributions.
理论论述
There is no theoretical contribution.
实验设计与分析
-
The Chronos's performance is bad based on the results shown in Table1. It should be more analysis or more TS foundations, such as TEMPO, involved in the experiments to understand the fundamental results.
-
Can the NeuralODE-based solution such as LipCDE[1] be used to address such irregular time series task?
[1] Cao, D., Enouen, J., Wang, Y., Song, X., Meng, C., Niu, H., & Liu, Y. (2023, June). Estimating treatment effects from irregular time series observations with hidden confounders. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 6, pp. 6897-6905).
补充材料
The supplementary material is the same as the submission.
与现有文献的关系
The key contributions of LAST SToP relate to the broader scientific literature by: (1) addressing limitations of traditional Temporal Point Processes that struggle with large event spaces and natural language descriptions; (2) bridging the gap between foundation models for time series (like Chronos) and asynchronous time series modeling; (3) extending Parameter-Efficient Fine-Tuning approaches like soft prompts with a novel stochastic training method that parallels techniques from other domains like dropout and Matryoshka Representations; and (4) demonstrating that LLMs can effectively model complex temporal data beyond their traditional text domain, complementing efforts like TimeLLM and LLMTime which focused on regular time series.
遗漏的重要参考文献
As mentioned in the previous answers, there could be more NeuralODE-based related works.
其他优缺点
- Creative Integration of Concepts
- The paper creatively combines soft prompting approaches with stochastic training techniques, drawing inspiration from areas like dropout and Matryoshka representations to create a novel adaptation mechanism.
- The authors' approach to viewing asynchronous time series as natural language data is an elegant shift in perspective that leverages the strengths of LLMs.
- Practical Applicability
- The method addresses real-world challenges in handling asynchronous time series data, which appear in numerous important domains (healthcare, finance, e-commerce, social media).
- The parameter-efficient nature of SToP (using only 0.02% of model parameters) makes it practical for deployment in resource-constrained environments.
- Technical Innovation in Training
- The stochastic prefix selection during training is a simple yet effective innovation that produces measurable benefits in representation quality and training speed.
- The observed coarse-to-fine structure in learned prompts suggests an interesting emergent property that could have broader applications in prompt tuning.
- Comprehensive Analysis
- The paper provides thorough analyses (t-SNE visualizations, cosine similarity measurements, model probing) that give insights into why their method works.
- The scaling experiments across different model sizes (1B, 3B, 8B) demonstrate the approach's robustness and future potential.
Weaknesses
- Limited Analysis of Domain-Specific Performance
- While the paper tests on datasets from different domains, there's limited analysis of how performance varies across domains and why certain domains might benefit more from the approach.
- A deeper exploration of domain-specific challenges and how the method addresses them would strengthen the paper.
- Theoretical Foundations
- The paper lacks theoretical grounding for why SToP works better than standard soft prompting. While empirical results are strong, a more formal analysis would strengthen the contribution.
- The connection between the stochastic training procedure and the emergence of coarse-to-fine structure could be better explained. Practical Implementation Details
其他意见或建议
-
The paper would be strengthened by including some concrete examples of model predictions compared to ground truth, especially for cases where the model performs particularly well or poorly.
-
Visualizing how predictions differ across methods could provide intuition about the advantages of LASTS.
Thank you for recognizing the creative integration of concepts, practical applicability, innovation in training techniques, and comprehensive analysis in our work. Responses to key points raised:
Chronos: Limited analysis; inclusion of TEMPO: Chronos performs poorly as expected, given its reliance on time series–specific augmentations and synthetic data, which make it ill-suited for asynchronous time series due to their fundamental differences (L046–L020). We included Chronos as a representative general-purpose TS model to highlight how such assumptions hinder performance, unlike models like LLMTime and LLMProcesses that incorporate fewer biases. which likely contributes to their stronger performance.
Thanks for pointing us to TEMPO; it faces similar limitations due to its reliance on seasonal and trend decomposition—concepts not meaningful for asynchronous sequences. These comparisons underscore the need for purpose-built approaches like LASTS. We will revise the manuscript to clarify this discussion.
NeuralODEs, LipCDE: Thanks for bringing up NeuralODE-based approaches like LipCDE. NeuralODEs are generally ill-suited for modeling asynchronous time series due to two key limitations:
- In ODE systems, the future trajectory is entirely determined by the initial state, which implies that modeling long asynchronous time series would require the initial state to encode all future observations—an unrealistic assumption for most real-world data.
- The continuous trajectories assumed by NeuralODEs make them ill-suited for capturing abrupt changes or irregular time gaps, which are common in asynchronous time series with discrete, sudden shifts.
Thus, NeuralODEs are only appropriate when underlying dynamics are deterministic and continuous, which is rarely true in practice. To our knowledge, no NeuralODE-based models have been evaluated on the datasets used in this paper.
Weaknesses
- Domain-Specific Analysis: We agree that domain-specific analysis can offer additional insights. In our work, we chose to focus on generalizability across 8 diverse datasets spanning 7 distinct domains given the broad ICML audience which makes it difficult to derive domain-specific conclusions. However, we make following observations to be included in teh camera-ready version:
- Online Shopping (amazon): As discussed in L371–373, the Amazon dataset includes a mix of unrelated event types grouped under one label, which possibly hurts time prediction.
- Cooking: Datasets like Breakfast and EPIC-KITCHENS show strong performance, as our model benefits from rich natural language descriptions and meaningful event sequences.
- Sports (MultiTHUMOS): This dataset features more Markovian event transitions, making forecasting relatively easier but anomaly detection harder.
- We answer this in two parts:
-
Theoretically, why SToP is better than SP: Our focus in this work is to demonstrate empirically, across multiple tasks and datasets, that SToP outperforms soft prompting—both in performance (Table 1) and in the structure of learned tokens (Section 4.5). We agree that a theoretical foundation would be valuable, but given the complexity, we are exploring this as future work.
-
Coarse-Fine Structure Emergence: This emergence is a direct consequence of random prefix length selection during training which encourages early tokens to capture general patterns, while later tokens refine the representation. Similar behaviours are observed in prior work discussed in our manuscript (L254-260) :
- Soundstream's residual vector quantization (RVQ), where selecting a random number of quantizers during training leads to coarse-to-fine audio reconstruction, and
- Matryoshka Representations, which explicitly optimize for informative prefixes.
We will incorporate this discussion in the final version of the paper.
- Practical Implementation Details: We're happy to clarify any remaining implementation details, revise the manuscript as needed, and will release the code upon publication.
Other Comments: Examples and Visualization: Thank you for the suggestion. Due to space constraints, we cannot include this analysis in the rebuttal but will incorporate it in the camera-ready version.
Questions
- What do we learn form Fig 5 & 13: We hypothesize the emergence of a coarse-to-fine structure in SToP, where earlier tokens capture diverse high-level features, and later tokens refine them:
- Fig5 the t-SNE projection shows that the first 100 tokens in SToP are more spread out compared to the clustered tokens in standard soft prompting. This indicates that SToP learns more diverse representations in earlier tokens.
- Fig13: shows that the cosine similarity between adjacent tokens is lower at the beginning of the SToP prompt and gradually increases, consistent with a coarse-to-fine pattern. No such organization is observed in standard soft prompting.
Thank you for your thoughtful review.
This paper presents LASTS (Language-modeled Asynchronous Time Series), a novel framework for modeling asynchronous time series data using Large Language Models (LLMs). The approach addresses the challenges of irregular timing and diverse event types by representing asynchronous time series as natural language prompts, allowing LLMs to leverage their broad world knowledge for reasoning across different domains and tasks. The authors introduce Stochastic Soft Prompting (StoP), a parameter-efficient adaptation technique that significantly improves model performance. Unlike traditional soft prompting, StoP randomly selects prefixes of the prompt during training, encouraging the learning of diverse representations and improving generalizability. Through extensive experiments on real-world datasets, the paper demonstrates that LASTS achieves state-of-the-art performance across forecasting, anomaly detection, and data imputation tasks. The framework outperforms existing methods including temporal point process models, foundation models for time series, and other LLM-based approaches.
给作者的问题
-
Did you perform chronological splitting of the datasets for train/validation/test, or was it done randomly? This is particularly important for time series data to avoid data leakage and better reflect real-world deployment scenarios. If random splitting was used, how might this affect the validity of your results compared to chronological splitting?
-
Beyond comparing with traditional soft prompting and QLoRA, have you considered comparing with other prompt-based adaptation methods like prefix tuning or adapter layers? How might these comparisons affect your claims about the superiority of StoP?
论据与证据
Yes, I think most claims in the paper are well-supported.
- LASTS effectively leverages LLMs for asynchronous time series analysis
This is supported by extensive experiments across multiple datasets and tasks (forecasting, anomaly detection, imputation) and demonstrated through comparisons with traditional temporal point process models and other LLM-based approaches.
- Stochastic Soft Prompting (StoP) improves performance over traditional soft prompting
This is supported by quantitative results showing improvements in Macro-F1 scores across dataset and visualized through t-SNE projections demonstrating more diverse token representations in StoP.
- LASTS is parameter-efficient
This is supported by implementation details showing only 1.6M trainable parameters for prompt tuning.
- LASTS outperforms existing methods
This is supported by comprehensive comparisons with temporal point process models, foundation models for time series, and other LLM-based approaches.
方法与评估标准
Yes I do think they make sense for the problem studied.
For the LASTS framework, I think the approach of representing asynchronous time series as natural language prompts makes sense given the irregular timing and diverse event types characteristic of such data.
For the StoP technique, I think its modification to traditional soft prompting addresses the need for more diverse representations in prompt-based adaptation and the coarse-to-fine structure learned by StoP appears suitable for capturing both general task information and specific details in asynchronous time series.
理论论述
There are no theoretical claims in the paper.
实验设计与分析
I checked the soundess of the experimental designs and analyses of the paper. I think they are overall sound.
-
Experimental designs: The authors select diverse approaches as the baselines: including random baselines, foundation models for time series (Chronos), LLM-based approaches (LLMTime, LLMProcesses) and TPP models. The authors cover three text-based action datasets and five standard TTP datasets in the experiments. As for the evaluation metrics, the authors use M-F1, MAE and RMSE to evaluate the performance of models.
-
Analyses: The paper includes ablation studies comparing different prompt representations (time first vs. event first) and different time representations (inter-arrival times vs. durations). These analyses help establish the effectiveness of the chosen representations. The comparison between StoP and traditional soft prompting is thorough, with both quantitative results and qualitative analysis of learned representations. The analysis of training speed differences between StoP and traditional soft prompting adds practical value. The evaluation of few-shot learning with varying numbers of examples (k=0 to k=10) is well-designed and provides useful insights into how many examples are needed for optimal performance. The identification of k=5 as the optimal few-shot setting is justified by the results.
补充材料
No, I didn't.
与现有文献的关系
The paper's contributions represent meaningful advancements while building on established foundations in the field. The authors have successfully connected their work to prior literature, demonstrating how LASTS and StoP address limitations in existing approaches and extend the capabilities of LLMs to new types of data and tasks.
遗漏的重要参考文献
No
其他优缺点
None
其他意见或建议
None
We sincerely thank the reviewer for their thoughtful summary and generous assessment of our work. We are glad that the core contributions—LASTS and Stochastic Soft Prompting (SToP)—were found meaningful and well-supported. Below, we address the specific questions raised:
Train/Validation/Test Splits
We follow the standard protocol for each dataset as adopted in prior work (e.g., [Xue et al., 2024]). The splitting is done at the sequence level—given a dataset of N independent sequences, we use 80% for training, and 10% each for validation and testing, sampled independently from a shared distribution. The model thus learns patterns governing sequences of events and their interarrival times from the training set, and generalizes to unseen sequences in the validation and test sets. We use the standard protocol because it enables us to compare our results with those already published.
We acknowledge that traditional time series tasks require chronological splits to account for drift. However, in our datasets (e.g., EPIC-KITCHENS, MultiTHUMOS), the sequences are short, self-contained, and typically do not exhibit long-term temporal drift. This mitigates the need for chronological splitting. Nevertheless, we appreciate the reviewer’s concern and will make our data split methodology more explicit in the final version of the paper.
Comparison to Other Prompt-Based Adaptation Methods
Thank you for this suggestion. Many adapters have been proposed recently, and it would not be feasible to compare all of them. Therefore, we decided to focus our comparison on widely-used PEFT techniques: Soft Prompting and QLoRA. QLoRA is the standard adapter-based method and strong baseline, and we highlight that LASTS is compatible with such adapter techniques. Although we have not yet tried prefix tuning, there are no methodological limitations that would prevent its use with stochastic training strategy (SToP), and we plan to investigate this in future work.
Once again, thank you for your encouraging and constructive review. We are grateful for the thoughtful feedback.
This paper presents LASTS, a novel framework that uses large language models (LLMs) to model asynchronous time series—sequences of events that occur at irregular intervals and are described in natural language. Unlike traditional methods that rely on fixed time intervals and predefined event categories, LASTS leverages the semantic richness of event descriptions and irregular timing to enable LLMs to perform tasks such as forecasting, anomaly detection, and data imputation. The authors also propose Stochastic Soft Prompting (StoP), a new prompt tuning technique that improves model performance by randomly truncating soft prompts during training, resulting in more diverse and generalizable representations. Through extensive experiments on real-world datasets, including action recognition and temporal point process benchmarks, LASTS consistently outperforms existing methods and demonstrates strong adaptability across different tasks. The approach offers a flexible and efficient alternative to conventional time series models and highlights the potential of LLM-based solutions for complex temporal reasoning.
给作者的问题
N/A
论据与证据
Most claims in the submission are supported by clear and convincing evidence. The authors provide comprehensive experimental results across multiple datasets and tasks (forecasting, imputation, anomaly detection) to demonstrate the effectiveness of their method. They also include strong baselines for comparison, such as traditional TPP models, foundation models, and other LLM-based approaches. The performance gains of the proposed LASTS framework and Stochastic Soft Prompting are consistently shown. However, one potential limitation is the relatively weaker time prediction performance compared to some TPP models, which the authors acknowledge but do not fully address with additional modeling strategies.
方法与评估标准
Yes, the proposed methods and evaluation criteria are appropriate for the problem of modeling asynchronous time series. The use of natural language-based prompts aligns well with the irregular and semantically rich nature of such data. The selected tasks—forecasting, imputation, and anomaly detection—are relevant and practical. The chosen datasets, including both real-world temporal point process and action recognition data, provide a comprehensive benchmark. However, incorporating more diverse anomaly detection baselines or real-world industrial datasets could further strengthen the evaluation.
理论论述
N/A
实验设计与分析
The experimental design appears generally sound, with a reasonable choice of tasks, datasets, and evaluation metrics. The use of multiple baselines and both zero-shot and fine-tuned settings adds credibility to the results. At a glance, there are no obvious flaws, but a deeper look would be needed to verify the robustness of all experimental components.
补充材料
I briefly looked through the supplementary material. Most of the appendix appears to provide supporting details such as prompt templates, dataset preprocessing, and additional quantitative results. These sections mainly serve to reinforce the main paper rather than introduce new claims or critical insights. While useful for completeness and reproducibility, the appendix does not present fundamentally new contributions beyond what is already discussed in the main text.
与现有文献的关系
The key contributions of the paper build on and extend several existing directions in the broader scientific literature. First, it advances the emerging line of work that explores using large language models for time series tasks by adapting them to irregular, event-based sequences rather than traditional regularly sampled data. Second, it connects to research on prompt-based learning and parameter-efficient fine-tuning, introducing a novel variant—Stochastic Soft Prompting—that aligns with broader trends in reducing adaptation cost for large models. Finally, the work challenges the conventional reliance on specialized architectures like temporal point processes by demonstrating that general-purpose language models, when properly prompted, can handle a wide range of temporal reasoning tasks, contributing to the growing movement toward more unified, flexible modeling approaches across domains.
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
N/A
We sincerely thank the reviewer for their detailed review. We appreciate the recognition of our method's novelty, the thoroughness of our experimental evaluation, and the relevance of our chosen tasks and benchmarks. We also thank the reviewer for highlighting how our work aligns with broader scientific trends—adapting LLMs to new domains, advancing parameter-efficient fine-tuning techniques, and challenging the reliance on specialized architectures.
We acknowledge the two specific areas of improvement identified and address them below:
-
Relatively weaker time prediction performance compared to TPP models: We agree with the reviewer, however, our design prioritizes simplicity, general applicability across tasks (forecasting, imputation, and anomaly detection), and effective use of diverse natural language event descriptions—without explicitly modeling time. This leads to our model ranking as best on 13 and in the top 2 on 17 out of 18 evaluations in Table 2. Introducing explicit time modeling is a valuable next step, which we are actively exploring as future work.
-
We answer this in two parts:
-
Anomaly detection baselines: This task remains underexplored in the context of asynchronous time series. While we have adopted someTime Series methods (e.g., Chronos, LLM Processes) to the asynchronous setting for forecasting and imputation, they cannot be easily extended to anomaly detection as they are very forecasting focused and anomaly detection is not easily recast as a forecasting problem. We see our work as an early step toward bridging this gap.
-
Inclusion of industrial datasets: We used publicly available datasets from standard benchmarks widely adopted in the literature to ensure fair comparison with both traditional TPP models and modern time series forecasting methods. We agree that incorporating additional real-world industrial datasets could further strengthen the evaluation.
We thank the reviewer again for their insightful feedback, and are happy to answer any additional questions.
The paper models event sequences by fine-tuning pretrained language models. They proposes STOP, a stochastic soft prompting method that adapts the trained model to downstream tasks, and it is able to outperform several baseline methods. The paper is interesting to the subfield at the intersection of LLMs and time series.
All the reviewers give positive feedback to this paper, recognizing its technical design, experiment design, and interesting results.
I recommend to accept this paper.