PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
5
5
5
4.0
置信度
创新性3.0
质量3.0
清晰度3.0
重要性3.0
NeurIPS 2025

TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
time series analysismultimodal retrieval

评审与讨论

审稿意见
4

The paper proposes a multimodal framework that aligns time series data with corresponding textual context for cross-modal retrieval, forecasting, and classification. It designs tokens during the encoder pretraining stage amortized for downstream tasks. It also introduces a dual-level alignment strategy using channel identity tokens and contrastive learning at both the sample and channel levels. As such, it supports Text-to-TS and TS-to-Text cross-modal retrieval and also acts as a plug-and-play module for retrieval-augmented forecasting and classification, where the efficacy is validated in the experiments.

优缺点分析

Strengths:

  1. The proposed framework formulates and tackles cross-modal retrieval between time series and texts, which enables multiple downstream tasks. The experiments with ablations demonstrate its efficacy.

  2. The idea of dual-level contrastive learning for cross-modal alignment, including the design of hard negative mining is well-motivated and somewhat novel, in terms of enabling fine-grained semantic retrieval between time series and texts.

Weaknesses:

  1. The availability of aligned time series and textual data (including not only global descriptions but also channel-specific descriptions) is assumed, which limits its deployment in real-world scenarios such as finance where texts and the stock prices of finance entities are not highly correlated, and industrial sensors .

  2. This paper evaluates its method and baselines on multiple multi-modal time series datasets. However, it lacks discussions and baseline comparisons for a few methods with multi-modal modeling in forecasting and classification subtasks. For example, the TimeCMA [1] method also proposes channel-wise similarity retrieval in cross-modality alignment that facilitates forecasting performance.

  3. The design of channel-biased attention and RoPE for channel dependencies encoding is not easy to read, and it is somewhat not well-motivated in terms of multi-modal modeling.

[1] TimeCMA: Towards LLM-Empowered Multivariate Time Series Forecasting via Cross-Modality Alignment

问题

  1. Please refer to weaknesses 1 and 2:

1-1) The assumption regarding highly aligned time series - text pairs, is a bit too strong in real-world time series contexts. Do the authors consider any countermeasures to mitigate this issue?

1-2) Could the author provide more evaluations regarding the aforementioned baselines that shares similar idea in terms of cross-modal retrieval?

  1. For the evaluation in Table 1, the encoder of TRACE is replaced by other foundation models with frozen parameters, demonstrating significant performance gain. Did the authors consider other moderated time series encoder, such as PatchTST, iTransformer during the pre-training stage?

局限性

Yes

最终评判理由

Some of my concerns are resolved during the rebuttal. The discussion of related methods and the clarity needs improvements.

格式问题

N/A

作者回复

Thanks for your constructive feedback. We have addressed your concerns as follows.

Weakness 1: The availability of aligned time series and textual data may limit its deployment in real-world scenarios.

While perfectly aligned multimodal datasets may require additional effort, such data are increasingly accessible in practice recently. For instance, the TimeMMD [2] already provides multi-domain, multimodal time series data across Health, Energy, and Environment sectors. Furthermore, many time series datasets (e.g., weather, finance, traffic) can be enriched with textual context through simple and scalable methods, including web scraping, report parsing, or LLM-based description generation (as we did in our curated NOAA dataset). Therefore, multimodal alignment is practical with modest preprocessing (We refer to Appendix B for our detailed preprocessing steps on weather dataset).

In addition, we note the growing trend in both academia and industry towards integrating language and time series for improved performance in forecasting and reasoning tasks. As the benefits of multimodal modeling become more evident, we expect more such datasets to emerge. TRACE offers a general and extensible framework that not only works with existing aligned datasets but also encourages scalable dataset construction and integration of textual context into time series modeling.

Weakness 2 and Question 1.2: The paper omits comparison with multimodal methods like TimeCMA, which proposes channel-wise retrieval for cross-modal alignment and improved forecasting.

The core objectives, architectures, and applications of TRACE and TimeCMA are fundamentally different. Specifically, TimeCMA's reliance on LLM-generated synthetic prompts for multivariate time series forecasting presents a key limitation: the lack of genuine alignment between different modalities. TimeCMA generates its own textual modality by wrapping numerical data in a prompt. This means the "textual" information isn't independently sourced or inherently meaningful in the same way an actual weather narrative would be. TimeCMA is inherently built for forecasting and classification tasks. Achieving cross-modal retrieval with TimeCMA would require substantial model modifications, making direct comparisons to dedicated retrieval models challenging. We will add a more detailed discussion on TimeCMA as relevant work in our revision.

Weakness 3: The design of channel-biased attention and RoPE for modeling channel dependencies is not easy to read and not clearly motivated in the context of multimodal modeling.

Due to space constraints, we used compact in-line formulations for both the channel-biased attention mask and RoPE integration. We acknowledge this and will provide detailed formulation and motivational context in our revision.

Here, we elaborate on the motivation behind both components. Our goal is to build interpretable and modular representations by encouraging each Channel Identity Token (CIT) to first abstract information from its own channel (via masked attention to intra-channel patch tokens). This ensures that each CIT captures localized semantics independently, which is crucial in multimodal alignment where channel-level descriptions may map to different textual spans. The global CLS token, which is allowed to attend across all tokens, then aggregates these per-channel abstractions to enable inter-channel reasoning. This staged structure balances local disentanglement (for channel grounding) and global integration (for sample-level alignment and retrieval).

RoPE injects relative positional information while preserving sequence length generalization, and we apply it independently within each channel. This ensures that each CIT benefits from phase-aware encoding of local temporal dynamics, which is especially important when channel-level trends vary (e.g., temperature vs. humidity). Unlike absolute or learned position embeddings, RoPE is scale-invariant and better suited for cross-modal alignment across variable-length sequences.

We will better communicate these motivations to aid readability in our revision.

Question 1.1: The assumption regarding highly aligned time series - text pairs, is a bit too strong in real-world time series contexts.

While full alignment is ideal, TRACE is designed to be robust in semi- or weakly-aligned settings, and we already implement several countermeasures:

  • Synthetic descriptions from raw TS: In cases where aligned text is unavailable, channel-level descriptions can be automatically generated using LLMs with structured prompts based on raw statistics (e.g., min/max/mean/volatility). This is how we constructed our NOAA dataset (see Appendix B.4).
  • Optional text conditioning: TRACE can operate in TS-only mode during inference, or use retrieved text from similar samples, even if not perfectly aligned. This makes it applicable to domains like finance or industry where explicit text pairs are sparse.
  • Flexible supervision: Our contrastive training allows text–TS alignment even with partial or noisy matches, encouraging shared semantics without requiring perfect ground-truth pairing.

Question 2: Did the authors consider other moderated time series encoders in Table 1?

Yes, we did explore alternative time series encoders like PatchTST and iTransformer during the pre-training stage, and our architectural design in TRACE is, in fact, informed by their respective strengths. Specifically, given a multivariate time series input of shape [N,T][N, T], where NN is the number of channels and TT is the temporal length, PatchTST applies token-wise attention across time (i.e., T×TT \times T), treating each channel independently. In contrast, iTransformer focuses on channel-wise attention (N×NN \times N), but ignores temporal locality. Both approaches are partially expressive. Instead, TRACE simultaneously attends over channel identity tokens (CITs) and temporal tokens, which is a practice increasingly standard in time-series foundation models.

This architectural choice is empirically validated. Our early experiments demonstrate that token-only variants (akin to PatchTST) or channel-only variants (akin to iTransformer) struggle to learn discriminative representations for downstream tasks (also shown in Tables 2 and 3). Moreover, dual-level cross-modal alignment consistently fails under these alternatives, as the resulting encoders cannot generate sufficiently distinct channel-level and sample-level embeddings, which are critical for our contrastive alignment framework. We will add these discussions and experimental results to the appendix in our final version.

We thank the reviewer for constructive comments. We hope the above responses clarify your questions. We are more than happy to provide more explanations if needed.

评论

Thanks for your responses, which address some of my questions and concerns, including the motivation of channel-based design, and the choices of time series encoders. I remain positive about this paper and appreciate the provided clarifications.

审稿意见
5

This paper proposes a generic multimodal retriever, TRACE, that grounds time series in aligned textual descriptions. It supports text-to-ts and ts-to-text retrieval modes. By retrieving relavant pairs, TRACE improves predictive accuracy in downstream applications. Besides, it can also serve as a encoder of multimodal time series data. Experiments across several datasets demonstrate the effectiveness of the method.

优缺点分析

Strengths:

  1. The idea of retrieving across time series and textual data is compelling.
  2. To enhance the alignment of time series and textual data, it introduces channel-level aligment and hard negative mining strategies.
  3. To evaluate the performance of TRACE, the paper curated a new dataset, with multivariate time series.
  4. The experiments are comprehensive and the results demonstrate the effecitiveness of TRACE.

Weaknesses:

  1. The proposed method requires curated multimodal datasets, which are not easy to achieve in time series domain.

  2. Although the paper have taken much efforts to curate Weather dataset, it seems no metrics to demonstrate the quality of the dataset intuitively.

  3. More baselines should be included in Table 4, especially multimodal time series models, e.g., TimeCAP [1], Multimodal PatchTST, Multimodal iTransformer from [2]

    [1] TimeCAP: Learning to contextualize, augment, and predict time series events with large language model agents.

    [2] Time-mmd: A new multi-domain multimodal dataset for time series analysis.

问题

  1. It says that the study uses ChatGPT to generate channel-level textual descriptions for selected TS samples. Have you evaluated the overall quality of the descriptions? How do you ensure the quality of the generated descriptions?
  2. How to obtain the soft token of a time series snippet when TRACE is used as a general-purpose retriever? Does the soft token affect the performance of TRACE?
  3. I also wonder how information leakage is avoided during experiments or dataset construction. For instance, textual descriptions of events may include the ground truth of the forecasting target. Does TRACE implement any mechanisms to prevent such leakage?

局限性

yes

最终评判理由

The authors' rebuttal has effectively addressed my initial concerns. The paper presents a novel and compelling idea. Another significant contribution is the curation and introduction of a new multimodal time-series dataset, filling an important gap. The experimental evaluation, now strengthened by the inclusion of additional baselines during rebuttal, provides convincing evidence for the method's effectiveness. Therefore, I increase my rating and recommend acceptance.

格式问题

None

作者回复

Thanks for your positive feedback on the idea and experimental results. We have addressed your concerns as follows.

Weakness 1: The proposed method requires curated multimodal datasets, which are not easy to achieve in the time series domain.

While perfectly aligned multimodal datasets may require additional effort, such data are increasingly accessible in practice recently. For instance, the TimeMMD [2] already provides multi-domain, multimodal time series data across Health, Energy, and Environment sectors. Furthermore, many time series datasets (e.g., weather, finance, traffic) can be enriched with textual context through simple and scalable methods, including web scraping, report parsing, or LLM-based description generation (as we did in our curated NOAA dataset). Therefore, multimodal alignment is practical with modest preprocessing (We refer to Appendix B for our detailed preprocessing steps on weather dataset).

In addition, we note the growing trend in both academia and industry towards integrating language and time series for improved performance in forecasting and reasoning tasks. As the benefits of multimodal modeling become more evident, we expect more such datasets to emerge. TRACE offers a general and extensible framework that not only works with existing aligned datasets but also encourages scalable dataset construction and integration of textual context into time series modeling.

Weakness 2: The paper presents a curated Weather dataset but lacks intuitive or quantitative metrics to demonstrate its quality or reliability.

We provide multiple forms of evidence to support the quality of the curated Weather dataset, detailed in Appendix B. For your reference:

  • Coverage: The dataset includes diverse weather events (e.g., flash floods, heatwaves, snowstorms) across different regions and time periods.
  • Examples: Realistic channel-level textual descriptions are paired with corresponding time series, demonstrating semantic alignment and meteorological plausibility.
  • Statistics: With over 74,000 samples across 7 variables, it is among the largest publicly available multimodal weather datasets.

Weakness 3: More baselines should be included in Table 4, especially multimodal time series models.

  • For TimeCAP: We have already reached out to the authors to request the official code and are awaiting a response to ensure a fair and reproducible evaluation.
  • For Multimodal PatchTST and iTransformer: We have implemented the MM-TSlib of these models and included initial results on the Weather dataset (see below). It is worth noting that models in Table 4 do not take textual context as the input for time series forecasting. Therefore, we also implemented TRACE w. Text by concatenating the textual embedding with the time series for downstream forecasting. Our experiments show that TRACE still outperforms them across multiple forecasting horizons (H=7 and H=24) in multimodal scenarios.
ModelH=7 MAE ↓H=7 MSE ↓H=24 MAE ↓H=24 MSE ↓
Multimodal PatchTST0.4780.5890.5230.636
Multimodal iTransformer0.4530.5610.5080.616
TRACE w. Text0.4580.5570.5000.609

We commit to including comprehensive comparisons with all three models, including TimeCAP, once the official implementation becomes available. These results will be added to the camera-ready version of the paper.

Question 1: Have you evaluated the overall quality of the descriptions? How do you ensure the quality of the generated descriptions?

To ensure both factual accuracy and semantic utility, we applied a controlled generation pipeline with multiple layers of validation:

  • Structured Prompting with Constraints: As detailed in Appendix B.4, we designed a strict template that limits generation to factual, numeric summaries per channel (e.g., min, max, volatility, trends), and enforces keyword annotations from a closed set (e.g., {Clear, Rainy, Foggy}). The prompt explicitly avoids speculative or narrative content.

  • Grounded Generation Using Raw Data: The full time series values are provided as input to ChatGPT, which ensures that every generated description is directly derived from actual statistics, thus enabling traceability and reproducibility.

  • Qualitative Examples: Appendix B.4 provides representative samples. For instance:
    “The temperature ranged from a low of -21.57°C to a high of 26.44°C, with noticeable trends showing a gradual warming as the months progressed from winter to summer, indicating a clear seasonal transition. ” These examples illustrate coherent, granular, and faithful text grounded in underlying trends.

  • Empirical Validation: The alignment between text and time series is quantitatively validated in Section 4.2. TRACE achieves high cross-modal retrieval accuracy (e.g., Text-to-TS P@1 of 89.63%), and the UMAP plots in Appendix D.3 show tight clustering between paired text and time series embeddings, hence indirectly confirming consistency.

  • Scale and Balance: As shown in Appendix B.5, over 74,000 multivariate samples were processed. Channel-level text was generated for all event-labeled samples and half of the non-event samples to ensure diversity and reduce leakage.

We agree with the reviewer that showcasing a side-by-side comparison with real descriptions would further strengthen the case, and we will include such qualitative comparisons in the camera-ready version.

Question 2: How to obtain soft tokens from time series and how these tokens affect TRACE’s performance?

The process of obtaining the soft tokens from a set of retrieved time series snippets (with their associated textual descriptions) is detailed in Appendix D.5 in the supplementary. After retrieving the top relevant time series–text pair, TRACE encodes both modalities into embeddings: hts(i)h_{\text{ts}}^{(i)} for the time series (from [CLS] token) and ztext(i)z_{\text{text}}^{(i)} for the text using a frozen SentenceTransformer. These are fused via a trainable MLP over [hts(1);ztext(1);......;hts(R);ztext(R)][h_{\text{ts}}^{(1)};z_{\text{text}}^{(1)};......;h_{\text{ts}}^{(R)}; z_{\text{text}}^{(R)}] to form the final soft tokens.

Importantly, this soft token is not part of TRACE itself, but serves as a lightweight augmentation signal for any downstream time series foundation model. Therefore, the performance of TRACE itself is not affected by the learned soft token. The purpose of soft tokens is to demonstrate that TRACE, as a retriever, can produce semantically rich, task-relevant representations to condition external models in retrieval-augmented settings. As shown in Table 2, stronger soft tokens provide richer semantic cues, leading to better predictions, particularly in models like Time-MoE and Timer-XL. This highlights TRACE’s utility as a plug-and-play retriever that delivers highly informative prompts for downstream time series models.

Question 3: How does TRACE prevent information leakage in textual description?

Yes. TRACE implemented specific safeguards during dataset construction and modeling to prevent any information leakage within the textual description as follows.

  • Time Series Anchoring: As described in Appendix B.1–B.2, each input time series is truncated to end exactly at the event start time (or the reference time for non-events). Forecasting targets lie strictly in the future, beyond the input window.

  • No Use of Post-Event Text in Training: While NOAA event reports may contain post-hoc narrative content, TRACE does not use these raw reports directly in retrieval or training. Instead, the only text used during alignment and RAG stages are:

    • (1) Channel-level descriptions synthesized from the observed input window only, and
    • (2) Sample-level context summaries, similarly generated or aligned to pre-event data.
  • Synthetic Description Control: As detailed in Appendix B.4, LLM-generated descriptions are constrained to summarize only the observed (pre-event) portion of the time series. No future values are included in the generation prompt.

This setup ensures strict temporal alignment and rules out any leakage of ground truth labels or future signals into the input space. TRACE’s design adheres to the forecasting causality constraint throughout training and evaluation.

We thank the reviewer for constructive comments. We hope the above responses clarify your questions. We are more than happy to provide more explanations if needed.

评论

Thank you for the response. I appreciate the clarification on dataset curation, data quality control, and soft tokens. The authors also provide more results of multimodal time series models.

评论

Dear reviewer,

Thank you for your thoughtful review! Please take a moment to acknowledge the authors' rebuttal and specify if it addresses your concerns.

Best, AC

审稿意见
5

The paper proposes TRACE, a multimodal retrieving framework designed for aligning and embedding time-series data with textual context. Recognizing that time-series data in real-world scenarios (e.g. healthcare, weather, and energy) frequently accompanies domain-specific textual information, TRACE aims to bridge this multimodal gap through semantically grounded embeddings. TRACE uses a two-stage training process: first pre-training a time-series encoder via masked reconstruction with specialized Channel Identity Tokens (CITs) and Channel-biased Attention, followed by fine-grained cross-modal alignment using dual-level hard negative mining. The model demonstrates two functions. One, as a standalone encoder offering state-of-the-art performance for forecasting and classification, and two, as a general-purpose retriever enabling retrieval-augmented generation (RAG) for enhancing downstream models. Extensive experiments confirm TRACE’s effectiveness across multiple retrieval and forecasting scenarios, surpassing several state-of-the-art methods.

Contributions:

  • Novel Multimodal Retriever between TS-Text: TRACE is a novel multimodal retriever explicitly designed to ground time-series embeddings with fine-grained textual context at both channel and sample levels.

  • Superior Performance: Extensive empirical validation demonstrates that TRACE consistently achieves state-of-the-art results. It outperforms existing SOTA models in cross-modal retrieval tasks (both text-to-time-series and vice versa). In retrieval-augmented generation setups, it boosts downstream model performance.

  • Hierarchical Hard Negative Mining: The paper introduces a unique hierarchical mining strategy that effectively discriminates between semantically similar but contextually divergent negatives, enhancing the semantic precision of embeddings.

  • Versatile Integration: TRACE’s architecture enables straightforward and efficient integration with existing time-series foundation models through retrieval-augmented generation, requiring minimal additional fine-tuning.

优缺点分析

Strengths

  • Clear Problem Definition and Proposed Solution: Paper identifies the gap that time-series foundation models are usually unimodal and NLP‑style RAG cannot ingest continuous signals, then supplies a two‑stage pipeline and an additional RAG stage that closes it.

  • Architectural & Training Novelty: Channel‑identity tokens + channel‑biased attention give each variable a fixed semantic slot. Proposed dual‑granularity InfoNCE aligns both pairs of CLS and report text, CIT and channel text. Soft‑prompt RAG injects retrieved multimodal context into frozen decoders with a small number of extra parameters.

  • Broad empirical coverage. Tests three tasks (retrieval, forecasting, classification) on four different domains and compares against both scratch networks and six public foundation models (Chronos, Timer‑XL, Time‑MoE, Moment, etc.).

  • Achieves substantially superior empirical performance compared to prior state-of-the-art: Retrieval precision significantly improved from best prior 55.73% to 90.08% P@1 for TS-to-Text and 64.67% to 89.63% for Text-to-TS. It has the smallest forecasting error (MAE and MSE) across datasets from different domains most of the time. By fine-tuning a pretrained model, TRACE makes classification accuracy raised notably by more than 10%, surpassing other prior methods.

Weaknesses:

  • Synthetic or Overly Simple Channel Texts: The channel-level textual descriptions are either artificially generated or probably extremely short (sometimes just sensor IDs), reducing external validity. Real-world scenarios involving richer, natural-language texts might challenge the proposed dual-level alignment method.

  • Lack of Purpose-built Cross-modal Baselines: Some comparisons in the experiments are restricted to adapted models (e.g., Timer-XL, Moment) not originally designed for cross-modal alignment. Missing are purpose-built baselines (e.g., TimeRAF), potentially inflating reported retrieval performance gains.

  • Quadratic Complexity of Hard-negative Mining: The hierarchical negative mining strategy seems to scale quadratically with batch size (O(B²)), potentially becoming a computational bottleneck for large-batch training scenarios. No acceleration methods or approximations are discussed.

  • No Support for Irregular or Sparse TS Data: The Rotary Positional Encoding (RoPE) approach implicitly assumes regular global time steps, and there is no experimental validation provided for event-driven or sparsely sampled time series. It may limit TRACE's applicability.

问题

Q1: Channel‑Identity Tokens scale linearly with the number of variables. How does training/inference memory behave if C >> 1000 (e.g. high‑frequency trading data)? Is it possible that CITs be compressed (hashing, shared prototypes) without losing retrieval fidelity?

Q2: Please correct me if I am wrong. Rotary PE assumes a single global sampling interval. How would the encoder handle irregularly‑sampled or event‑driven series. Can RoPE be replaced by other position embedding techniques to address this issue?

Q3: In Stage 1, the masked‑reconstruction loss predicts raw values (MSE). Would frequency‑domain or season‑trend decomposition losses produce richer features and better alignment downstream?

Q4: Channel‑biased attention forbids CITs from attending to other channels. Have you tried soft gating instead of a hard mask so causal inter‑dependencies aren’t completely blocked?

Q5: Instead of averaging retrieved CLS/text vectors, have you explored attention‑based fusion or set transformers so the model can weigh items by relevance?

局限性

The authors have partially addressed the limitations and potential negative societal impact of their work.

Limitations

  • Quantitative Robustness to Noisy Texts: Missing explicit quantitative analysis demonstrating how performance degrades when textual labels are noisy or auto-generated.
  • Recommendation: Provide clear experiments quantifying robustness or degradation under noisy or missing text conditions in the appendix.

最终评判理由

Thanks the author for the detailed rebuttal. Most of concerns have been answered or addressed. Raising the rating to Accept.

Best, Reviewer Lbrb

格式问题

No concerns.

作者回复

Thanks for your constructive feedback. We have addressed your concerns as follows.

W1: The channel-level descriptions may be overly simplistic, potentially limiting realism and their alignment quality.

To ensure the accuracy and usefulness of LLM-generated channel-level descriptions, we applied a carefully controlled and validated generation process. As detailed in Appendix B.4:

  • Factual Grounding: Each prompt includes raw time series data, and the model is instructed to summarize statistical properties (e.g., min/max, volatility) only from the observed input. This ensures the descriptions are grounded in actual trends, not speculative narratives.
  • Prompt Structure and Constraints: We use a strict prompt format that enforces concise, factual summaries, reducing the chance of hallucination.
  • Qualitative and Empirical Validation: TRACE achieves strong retrieval performance (Text-to-TS P@1 = 89.63%) and high cross-modal alignment in embedding space (Appendix D.3), demonstrating that the synthetic texts are semantically informative and align well with the time series.
  • Scalability and Diversity: As described in Appendix B.5, we generated over 74,000 such descriptions with balanced event/non-event coverage, confirming both scale and robustness.

W2: The experimental comparisons rely on adapted unimodal models rather than purpose-built cross-modal baselines like TimeRAF.

While TimeRAF is a strong retrieval-augmented forecasting method, it is not a cross-modal model, and the method does not involve text encoders or cross-modal contrastive alignment. In contrast, TRACE is explicitly designed for bidirectional and fine-grained alignment between time series and text, with a dual-level (sample + channel) contrastive objective that supports both text-to-TS and TS-to-text retrieval. These tasks are outside the scope of TimeRAF and thus not directly comparable.

To further strengthen our evaluation, we implemented Multimodal PatchTST and Multimodal iTransformer from MM-TSlib and compared them against TRACE w Text on the Weather dataset:

ModelH=7 MAE ↓H=7 MSE ↓H=24 MAE ↓H=24 MSE ↓
mm-PatchTST0.4780.5890.5230.636
mm-iTransformer0.4530.5610.5080.616
TRACE w. Text0.4580.5570.5000.609

We appreciate the suggestion and will explore adapting retrieval-augmented forecasters like TimeRAF into TRACE’s multimodal retrieval framework in future work.

W3: The hierarchical hard-negative mining strategy has quadratic complexity for batch size (O(B2)\mathcal{O}(B^2)), raising concerns about its scalability and computational cost.

While the full similarity computation for hard-negative mining does scale as O(B2d)\mathcal{O}(B^2d) (sample-level) and O(B2C2d)\mathcal{O}(B^2C^2d) (channel-level), we mitigate the practical overhead in several ways:

  • Efficient Hard-Negative Filtering: As described in Appendix D.9.1, we use a hierarchical mining strategy that combines lightweight prototype filtering with partial sorting. This reduces the actual mining cost to O(BlogR+BClogR)\mathcal{O}(B\log R + BC \log R), which is negligible compared to full pairwise similarity computation and does not scale quadratically in practice.

  • Empirical Efficiency: Table 7 (Appendix D.9.2) shows that TRACE trains faster than larger baselines like FSCA and Moment (6.05s vs. 11.7s for Moment), while activating only 0.12M parameters during fine-tuning. This confirms that our dual-level contrastive training, including hard-negative mining, is not a bottleneck at realistic batch sizes.

  • Future Scalability Plans: We plan to integrate scalable approximations for even larger settings, including:

    • Memory-bank-based mining (e.g., MoCo-style gradient caching),
    • Approximate nearest neighbor (ANN) search using tools like FAISS,
    • Stochastic sub-batch or masked-channel mining, reducing sampling complexity to O(BlogB)\mathcal{O}(B\log B) or better.

These extensions can be integrated modularly into TRACE's framework without architectural changes.

W4 and Q2: RoPE assumes regular sampling intervals. This may limit its applicability to irregularly sampled or event-driven time series.

It is correct that RoPE, in its standard form, assumes regularly sampled sequences. However, we would like to emphasize:

  • Common Assumption in the Literature: This is a shared limitation among most Transformer-based time series models (e.g., TSFormer, FSNet, Timer-XL), which use absolute or relative positional encodings tailored to uniformly sampled data. Our focus on regular time series aligns with the standard evaluation setting adopted in prior works, including those in TimeMMD and weather forecasting.

  • Modular Design of TRACE: TRACE is designed with modularity in mind. RoPE is not a fundamental dependency—it was selected due to its superior performance in structured, densely sampled domains (e.g., weather, energy). The architecture can be extended to support irregular or sparse data by swapping in alternative temporal encoding mechanisms.

  • Future Extensions: TRACE can readily incorporate time-aware mechanisms such as:

    • Delta-timestamp attention bias (e.g., Informer, RAIM),
    • Learned timestamp embeddings or time-to-vector encoders (e.g., Neural ODE-style Time2Vec),
    • Adapted RoPE with relative phase shifts or interpolation strategies to handle sparse or non-uniform time steps.

We plan to explore these directions in future work to broaden TRACE’s applicability to irregularly sampled or event-driven settings, and appreciate the reviewer’s suggestion to highlight this in the final version.

Q1: How does training/inference memory behave if C >> 1000?

We thank the reviewer for the insightful question on scalability. While our CITs scale linearly with the number of channels, handling extreme-channel scenarios (C >> 1000) is a broader challenge for the entire time series field. Such inputs create ultra-long token sequences that exceed the capacity of current Transformer-based time-series models. We agree this is a crucial area for future research. Fortunately, TRACE's modular design is well-suited for future extensions to address this, such as by compressing channels via clustering or sharding them into manageable groups. We consider this a promising direction for future work.

Q2: Please refer to our response to Weakness 4

Q3: Would incorporating frequency-domain or season–trend decomposition losses help the downstream alignment?

This is an insightful question. Indeed, frequency-domain or season–trend-aware objectives can sometimes promote better modeling of temporal structures. However, we note the following:

In many models (e.g., PatchTST, DLinear), seasonal–trend decomposition (e.g., STL, Fourier transforms) is applied as a preprocessing step rather than as part of the loss function. This modularity allows for flexible architecture choices and avoids hard-coding inductive biases into training.

  • TRACE’s masked MSE loss operates directly on raw values and does not assume any specific temporal structure (e.g., periodicity or stationarity). This makes it more broadly applicable across diverse domains and datasets, especially where such assumptions may not hold (e.g., medical signals, financial trends).

  • As shown in Table 1 (retrieval) and Table 3 (classification), as well as the UMAP visualization in Appendix D.3, TRACE achieves strong representation quality and alignment performance with the current loss setup. The simplicity and generality of the masked MSE loss proved to be both stable and effective in our experiments.

That said, exploring season–trend-aware objectives as auxiliary tasks or plug-in modules could be a promising extension, and we thank the reviewer for suggesting this direction.

Q4: Would soft gating allow better modeling of causal interdependencies?

We thank the reviewer for this observation. As correctly noted, our current design employs a hard attention mask where each Channel Identity Token (CIT) is restricted to attend only to patch tokens within its own channel. This enforces channel-specific abstraction, which is beneficial for alignment and interpretability.

However, the reverse is not restricted—patch tokens are allowed to attend to all CITs. This design enables implicit cross-channel interactions through shared CITs, allowing the model to propagate interdependencies while maintaining a disentangled structure. We agree that introducing soft gating mechanisms, such as learnable attention masks, cross-channel adapters, or attention dropout, could allow more flexible modeling of causal relationships across channels without fully discarding the benefits of CIT-driven modularity.

We consider this an avenue for future work and thank the reviewer for the constructive suggestion.

Q5: Have you considered using attention-based fusion to better weigh relevance among retrieved items?

We appreciate the reviewer’s suggestion. While more expressive fusion mechanisms (e.g., Set Transformers, gated attention, or cross-attention pooling) may give certain benefits, we chose mean pooling for its simplicity, training stability, and computational efficiency. This choice aligns with our paper’s core focus on retrieval and alignment quality, rather than optimizing the fusion architecture.

Importantly, TRACE is designed to be modular. The aggregated embedding (i.e., soft prompt) is decoupled from the encoder and can be easily replaced with more advanced fusion modules, such as attention-based pooling or set-wise encoders, without retraining the time-series or text encoders. We consider exploring such plug-and-play fusion enhancements a promising direction for future work.

审稿意见
5

The author proposes a new multimodal retrieval system TRACE, which implements Channel Identity Tokens (CIT) within a masked autoencoder framework to capture multivariate time series patterns while maintaining channel disentanglement and also implementing hierarchical hard negative mining at both channel and sample levels to employ cross-modal alignment. This work addressed the critical gap between time series analysis and their semantic grounding. The key technical contributions of the paper are:

  1. The framework introduces a comprehensive system for cross-modal retrieval between time series and text.
  2. The framework introduces RAG capabilities for time series foundation models through soft token interfaces

The paper provides a comprehensive evaluation across customer weather datasets, TimeMMD benchmark subsets, multiple baseline comparisons against both traditional time series models and recent foundational models. The inclusion of extensive ablation studies that validate architectural choices and hyperparameter sensitivity instills confidence in the system's design and performance.

优缺点分析

Strengths

  1. TRACE can be used for downstream tasks (forecasting and classification) in both scenarios - external data is available (RAG mode), or no external database exists (standalone decoder mode). This dual functionality implies that the learned representations are robust and can be helpful in multiple deployment scenarios.
  2. TRACE allows frozen base models to incorporate retrieved context without architectural modifications.
  3. The learned multimodal embedding space is already aligned with textual descriptions reducing the dependency on additional textual signals.
  4. TRACE demonstrates superior performance across multiple evaluation dimensions: Cross-modal retrieval, RAG enhancement, Standalone

问题

  1. What is the optimal ratio between the flattened sequence length L and the original temporal length T for different types of time series data? How does this ratio affect model performance, and are there recommended guidelines for practitioners? Do you observe performance degradation when L >>T (over-segmentation) or when L <<T (under-segmentation)?

  2. In Figure 4, Why are Relative PE results missing at d=768?
    a. Were the results excluded due to poor performance?
    b. Were there implementation difficulties?

  3. How is CLS token initialized and learned to aggregate global information?

局限性

Yes

格式问题

No formatting issues found

作者回复

Thanks for your constructive feedback. We have addressed your concerns as follows.

Question 1: What is the optimal ratio between LL and TT for different types of time series? How does this ratio affect performance, and are there any practitioner guidelines?

The ratio between the flattened token sequence length LL and the original temporal length TT is controlled by the patch length PP (i.e., L=T/PL = T/P). As shown in Figure 4, we conduct a detailed ablation over PP and observe that mid-range patch lengths (P=68P = 6–8) consistently yield the best performance across tasks. This corresponds to a flattened ratio L/T1/6L/T ≈ 1/6 to 1/81/8, though the exact value depends on the number of channels and sequence length.

Key observations:

  • Over-segmentation (LLTT / very small PP): Leads to overly fine-grained tokenization, exposing the model to high-frequency noise and increasing computational cost. This often results in overfitting and unstable training.

  • Under-segmentation (LLTT / very large PP): Causes excessive compression and loss of temporal detail, harming both reconstruction quality and fine-grained alignment.

We recommend tuning PP based on the dominant temporal resolution of the dataset (e.g., shorter PP for hourly signals, longer PP for daily/monthly). Moderate patching offers a good trade-off between local detail and global context, and we found it to be robust across diverse domains (weather, energy, health).

Question 2: In Figure 4, why are the results for Relative Positional Encoding (PE) missing at embedding dimension d = 768? Were they excluded due to poor performance? Were there implementation difficulties?

Thank you for pointing this out. The omission of Relative PE results at d = 768 was due solely to space and figure clarity constraints, not performance concerns or technical limitations. They show slight improvement over Relative PE at d=384d = 384 but still underperform Rotary PE, which remains our default due to its consistent superiority across tasks.

We appreciate the reviewer’s attention to detail and will make sure these results are explicitly presented in the final version.

Question 3: How is the [CLS] token initialized and trained to capture global information in TRACE?

We follow standard practice and initialize the [CLS] token using nn.init.normal_, with parameters learned during pretraining. The [CLS] token is fully trainable and optimized end-to-end as part of the masked reconstruction and contrastive objectives.

Importantly, during encoding, the [CLS] token is allowed to attend to the entire flattened sequence, including all channel identity tokens (CITs), which summarize per-channel information, and all patch tokens, which encode localized temporal patterns. This design enables the [CLS] token to act as a global aggregator, capturing context across time and variables. Its resulting embedding is used for sample-level alignment, retrieval, and forecast conditioning.

最终决定

TRACE proposes a multimodal retriever that grounds time-series representations in aligned text and supports both text-to-TS and TS-to-text retrieval. It appears the method also works as a context-aware encoder and a plug-and-play RAG module, using channel-level alignment and hard-negative mining to boost forecasting and classification across domains. Reviewers say the idea is novel and the results are strong across retrieval, forecasting, and classification, with a useful curated weather dataset. It seems concerns on data alignment, missing baselines, and efficiency were addressed in rebuttal, and all reviewers end up recommending acceptance.

I recommend accept. All reviewers recommend accept and raise or keep positive scores after rebuttal, and the paper seems technically solid with clear novelty and broad empirical gains. The remaining concerns on realism of texts and scalability seem manageable and are acknowledged with concrete plans. The contribution seems timely and likely impactful for multimodal time-series learning and RAG.