6.8

/10

Poster4 位审稿人

最低3最高5标准差0.8

3.3

置信度

创新性2.3

质量2.5

清晰度3.5

重要性2.5

NeurIPS 2025

One Filters All: A Generalist Filter For State Estimation

Shiqi Liu,Wenhan Cao,Chang Liu,Zeyu He,Tianyi Zhang,Yinuo Wang,Shengbo Eben Li

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

摘要

Estimating hidden states in dynamical systems, also known as optimal filtering, is a long-standing problem in various fields of science and engineering. In this paper, we introduce a general filtering framework, $LLM-Filter$, which leverages large language models (LLMs) for state estimation by embedding noisy observations with text prototypes. In a number of experiments for classical dynamical systems, we find that first, state estimation can significantly benefit from the knowledge embedded in pre-trained LLMs. By achieving proper modality alignment with the frozen LLM, LLM-Filter outperforms the state-of-the-art learning-based approaches. Second, we carefully design the prompt structure, System-as-Prompt (SaP), incorporating task instructions that enable LLMs to understand tasks and adapt to specific systems. Guided by these prompts, LLM-Filter exhibits exceptional generalization, capable of performing filtering tasks accurately in changed or even unseen environments. We further observe a scaling-law behavior in LLM-Filter, where accuracy improves with larger model sizes and longer training times. These findings make LLM-Filter a promising foundation model of filtering.

关键词

State EstimationLarge Language ModelBayesian Filtering

评审与讨论

审稿意见

评分: 4置信度: 22025-06-26

This paper proposes a framework based on large language models to estimate hidden states of a dynamical systems. The approach consists of an encoder and a decoder which takes care of the modality alignment and a prompt to improve the LLM performance. The authors argue that this approach beats conventional methods in performance but also in generalization to cross systems and overall robustness without a significant increase in runtime.

优缺点分析

Strengths:

The approach is simple, and the experiments indicate strong performance. The adaptability of the approach is a strength. Overall, the use of large language models in dynamical systems is important and could lead to significant improvements in performance.

Weaknesses:

It is generally unclear what the technical originality of this paper is. The proposed ObsEmbedding and StateProjection are standard for time series, as the authors mention. The prompting is also straightforward. Apart from the experimental evidence provided, I do not see any additional contributions. There is no in-depth analysis, explanation, or intuition regarding why the proposed approach works better than conventional learning-based filters or online Bayes filters.

问题

Your method significantly outperforms other learning-based filters. Have you evaluated it in a synthetic setting where the Kalman filter is known to be optimal? If so, what is the performance gap between your method and this optimal benchmark? This would be very useful to understand the limits of your approach.
The paper would benefit from a discussion on sample efficiency. Could you clarify the data regime of your experiments (e.g., low-sample vs. high-sample) and analyze how the relative performance of your method against baselines changes with the number of available samples? Such an analysis could be particularly insightful, as it might also explain why fine-tuning with LoRA leads to performance degradation.
I am surprised that frozen LLMs lead to this level of performance. Could you explain in more detail why you think LLMs can produce a representation that is only a simple MLP away from predicting the system's hidden state? In particular, would it be possible to run experiments to gain some understanding of what kind of representations the LLM produces?

局限性

Yes.

最终评判理由

My initial judgment remains but I am increasing the score to 4 as the results are extremely impressive compared to baselines.

格式问题

作者回复

2025-07-31

Thank you very much for your comments once again. I hope this clarifies your concerns.

[Question 1]: Have you evaluated it in a synthetic setting where the Kalman filter is known to be optimal? If so, what is the performance gap between your method and this optimal benchmark? This would be very useful to understand the limits of your approach.

Thank you for the insightful suggestion. We have added an experiment in a synthetic linear Gaussian system, the Tracking system, where the classical Kalman filter is known to be optimal. This system is part of the cross-system evaluation in Section 5.2, with detailed system dynamics provided in Appendix B. All training and data configurations follow those in the main paper. The results are summarized below:

Method	KF	KalmanNet	LLM-Filter	LLM-Filter-O
Tracking	0.3231	0.3262	0.3247	0.3288

As expected, the Kalman filter achieves the best RMSE. LLM-Filter performs competitively, slightly outperforming KalmanNet and LLM-Filter-O. The improvement over LLM-Filter-O highlights the benefit of incorporating external prior knowledge via prompts.

[Question 2]: The paper would benefit from a discussion on sample efficiency. Could you clarify the data regime of your experiments (e.g., low-sample vs. high-sample) and analyze how the relative performance of your method against baselines changes with the number of available samples? Such an analysis could be particularly insightful, as it might also explain why fine-tuning with LoRA leads to performance degradation.

That's a good suggestion. In our main experiments, all methods are trained using 20,000 samples per system, as detailed in Appendix B. This data volume was selected because performance across methods, including ours, saturates beyond this point. To better assess sample efficiency, we conducted additional experiments on the Selkov system by subsampling the training data to 10%, 40%, 70%, 90%, and 100% of the original size. The results are summarized in the table below:

Selkov	10%	40%	70%	90%	100%
LLM-Filter	10.0823	1.6534	0.4795	0.4072	0.4061
LLM-Filter-O	NaN	1.8012	0.6837	0.6357	0.6369
MEstimator	NaN	3.0156	0.9233	0.8952	0.8864
RStateNet	20.1987	2.8662	0.7410	0.7256	0.7202
ProTran	12.5344	5.1287	1.0511	1.0290	1.0219
KalmanNet	NaN	2.2204	1.1783	1.1705	1.1662

As shown, performance degrades rapidly with reduced training data, and most methods fail to converge at the 10% level. However, LLM-Filter consistently outperforms all baselines across all data regimes.

Regarding the observed performance drop when fine-tuning with LoRA, we attribute this to the limited expressiveness of low-rank parameter updates in capturing the precision required for accurate state estimation.

[Question 3]: I am surprised that frozen LLMs lead to this level of performance. Could you explain in more detail why you think LLMs can produce a representation that is only a simple MLP away from predicting the system's hidden state? In particular, would it be possible to run experiments to gain some understanding of what kind of representations the LLM produces?

Thank you for the thoughtful question. First, we observed that in the time-series forecasting domain, some studies [1, 2] have leveraged frozen LLMs. These works suggest that the pretrained knowledge embedded in LLMs can be effectively applied to forecasting tasks. This inspired us to ask: could pretrained LLMs also contain knowledge that is helpful for filtering tasks?

To investigate this, we attempted to isolate the contribution of pretrained knowledge. As suggested by Reviewer #HiuW in Question 3, we compared a pretrained LLM with a randomly initialized LLM of the same size. Training the randomly initialized model from scratch led to divergence, whereas the pretrained LLM successfully performed the filtering task and demonstrated generalization in the cross-system task. This suggests that pretrained LLMs encode useful knowledge and priors that benefit the filtering task.

Additionally, as shown in Figure 6, removing the SaP results in significant performance degradation. This suggests that LLMs are able to extract and utilize structured information about the system from the prompt, further supporting their utility in state estimation.

Despite these observations, we acknowledge the difficulty in providing a system-level explanation of the internal representations LLMs produce. We tried various visualization techniques, such as attention heatmaps, hidden state visualizations, and adversarial perturbations. Unfortunately, none of these methods provided a clear explanation of how LLMs work in estimation tasks.

We have reviewed a broad body of related literature and found that even successful applications of LLMs to specialized domains (e.g., [1–6]) typically do not offer clear explanations of why LLMs perform so well. We would greatly appreciate any insights or suggestions you might have for deeper interpretability or representation analysis in this context.

[Weakness 1]: It is generally unclear what the technical originality of this paper is. The proposed ObsEmbedding and StateProjection are standard for time series, as the authors mention. The prompting is also straightforward. Apart from the experimental evidence provided, I do not see any additional contributions. There is no in-depth analysis, explanation, or intuition regarding why the proposed approach works better than conventional learning-based filters or online Bayes filters.

Thank you for the opportunity to clarify our contributions. The primary contribution of this work lies not in proposing novel architectural modules or theoretical frameworks, but in demonstrating a new paradigm for filtering: leveraging the in-context learning capabilities of LLM to build a generalizable filter. This approach enables adaptation to a wide range of dynamical systems without requiring system-specific retraining, thereby addressing a major limitation of conventional learning-based filters that rely heavily on system-specific data and training procedures.

To be honest, our motivation originated from practical challenges in robotic state estimation, particularly in quadruped robots. We initially employed KalmanNet [7], a representative learning-based filter. However, we observed that KalmanNet's performance degrades significantly when the environment changes (e.g., from indoor flat surfaces to outdoor sandy terrain), and adapting it to new settings required additional data collection and retraining, highlighting the limitations in generalization.

Simultaneously, we observed that the control and time-series domains have explored the use of pretrained LLM knowledge to improve generalization. Inspired by OpenVLA [2], we initially attempted to represent continuous system states as discrete tokens within the LLM’s vocabulary space. However, we found that this discretization introduced significant precision loss, making it unsuitable for high-accuracy tasks like continuous-state estimation in filtering.

Following insights from time-series forecasting works [4, 5], we then designed a modality alignment architecture to connect observations and states to the LLM. However, early versions of this setup struggled to generalize across systems, largely because the model lacked explicit knowledge of the underlying system dynamics. This led us to introduce the SaP, which encodes system-specific information in a format interpretable to the LLM. This component was crucial for enabling the LLM to effectively reason over both the observation history and the system characteristics.

We believe our work contributes both a practical solution and a new perspective on using pretrained LLMs for general filtering, which has not been previously explored in this form.

Reference

[1] Gruver, N., Finzi, M., Qiu, S., & Wilson, A. G. (2023). Large language models are zero-shot time series forecasters. Advances in Neural Information Processing Systems, 36, 19622-19635.

[2] Liu, Y., Qin, G., Huang, X., Wang, J., & Long, M. (2024). Autotimes: Autoregressive time series forecasters via large language models. Advances in Neural Information Processing Systems, 37, 122154-122184.

[3] Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., ... & Finn, C. (2024). Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246.

[4] Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., ... & Han, K. (2023, December). Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning (pp. 2165-2183). PMLR.

[5] Liu, X., Hu, J., Li, Y., Diao, S., Liang, Y., Hooi, B., & Zimmermann, R. (2024, May). Unitime: A language-empowered unified model for cross-domain time series forecasting. In Proceedings of the ACM Web Conference 2024 (pp. 4095-4106).

[6] Shojaee, P., Meidani, K., Gupta, S., Farimani, A. B., & Reddy, C. K. (2024). Llm-sr: Scientific equation discovery via programming with large language models. arXiv preprint arXiv:2404.18400.

[7] Revach, G., et al. KalmanNet: Neural network aided Kalman filtering for partially known dynamics. IEEE Transactions on Signal Processing, 70, 1532-1547.

2025-08-06

Thank you for your explanations. The additional experiments in Q1 and Q2 show that the performance is indeed extremely impressive. I think the inclusion of these experiments in the manuscript would increase the quality of the paper. My reservations regarding the technical contribution of the paper still remain.

2025-08-06

We sincerely thank you for recognizing the significance of the additional experiments and for the constructive suggestion to include them in the manuscript.

Regarding the concerns about the technical contribution, we would like to reiterate that our work opens a promising direction by demonstrating that LLMs originally designed for language tasks can be effectively adapted to state estimation problems through careful modality alignment and system encoding. This approach bridges two traditionally separate fields, natural language processing and state-space filtering, and opens new opportunities for future research.

To the best of our knowledge, our work is the first to propose a general filtering framework that leverages the in-context learning capabilities and pretrained knowledge of LLMs to generalize across diverse dynamical systems without system-specific retraining. This represents a new paradigm for filtering and addresses a key limitation of existing learning-based and Bayesian methods, which typically require system-specific architectures or retraining pipelines.

Once again, we appreciate the reviewer’s insightful feedback and hope the revised manuscript clearly communicates the novelty and potential impact of our contribution.

2025-08-08

We sincerely thank you for feedback.

Our work does not primarily focus on innovating model architectures. Instead, we address the challenge of generalizing filtering techniques for diverse systems. Rather than manipulating discrete characters or tokens, we operate directly in the continuous embedding space of LLMs, thereby aligning the filtering task with the LLM’s native modality. Empirically, we find that LLMs struggle to interpret explicit mathematical equations. Consequently, our in-context prompting is exemplar-driven rather than equation-driven: each system is characterized by a handful of input–output demonstrations, letting the filter adapt on the fly without any system-specific retraining.

As the review period is concluding shortly, we kindly request confirmation on whether our clarifications have adequately addressed your concerns regarding the technical contribution. Should any points require further elaboration, we remain ready to provide additional information immediately.

审稿意见

评分: 5置信度: 42025-06-30

Summary

This paper introduces a "universal filter" framework for dynamic system state estimation called LLM-Filter, which repurposes large language models (LLMs) as filters through three key steps: First, it segments continuous, noisy observation sequences and embeds them into the same representation space as LLM tokens using an ObsEmbedding module. Second, during inference, it organizes system/task descriptions and a few examples into a prompt, which is concatenated with the observation embedding as context tokens and input to the frozen LLM core layer, a process termed System-as-Prompt (SaP). Third, it maps the token representation predicted by the LLM back to the continuous state space using a lightweight MLP, training only the ObsEmbedding and StateProjection modules while keeping the LLM weights frozen. Across multiple benchmarks, including low-dimensional nonlinear systems and high-dimensional chaotic atmospheric models, LLM-Filter achieves significantly lower RMSE than state-of-the-art learning-based filters like KalmanNet and ProTran, as well as various Bayesian filters such as EnKF and PF, all while maintaining millisecond-level inference efficiency. Extended experiments highlight its robustness in model mismatch and cross-system generalization scenarios, with a clear scaling law showing error reduction as LLM size increases. The main contributions include proposing the universal LLM-Filter framework that enables direct state estimation with frozen pre-trained LLMs, designing the SaP mechanism for quick adaptation to new systems without retraining, verifying performance and generalization through substantial error reductions on multiple systems, and open-sourcing the code and datasets to provide a reproducible benchmark and promote LLM applications in engineering control.

优缺点分析

Strengths

Rigorous Methodology: The paper instantiates a clear three-stage pipeline—ObsEmbedding, System-as-Prompt, and StateProjection—with frozen LLM weights and only two lightweight trainable modules. This modular design simplifies training and ablation studies, demonstrating careful experimental control.
Comprehensive Evaluation: Benchmarks span low-dimensional nonlinear systems, high-dimensional chaotic atmospheric simulations, model-mismatch scenarios, and cross-system generalization. Metrics (e.g., RMSE) are reported against both classical Bayesian filters and state-of-the-art learned filters, showing statistically significant improvements.
System-as-Prompt Paradigm: While in-context learning is well known in NLP, applying it as a “prompt-based filter” with formal embedding/projection modules is novel. It reconceives LLMs as black-box dynamical system solvers rather than pure sequence predictors.
Scaling Law Analysis: The finding that filtering accuracy improves monotonically with LLM scale (and limited training) provides fresh insight into how model scale interacts with downstream control tasks.

Weekness

Theoretical Justification: The paper lacks a deeper theoretical analysis of why LLM latent spaces align so well with continuous state spaces. A formal connection or bounds on estimation error would strengthen the work’s foundations.
Limited Noise Models: Experiments focus on Gaussian observation noise; robustness to heavy-tailed or adversarial noise is untested, leaving open questions about performance in more challenging real-world settings.

问题

Question about baseline parity and training budget: KalmanNet is reported to diverge on two high-dimensional tasks, and all learning baselines are trained from scratch for every system, while LLM-Filter freezes 7 B parameters and tunes only ≈ 4 M trainable weights (Table 5). Could you please detail the exact wall-clock training time, compute, and early-stopping criteria for each baseline and clarify whether any hyper-parameter sweeps used validation data inaccessible to the baselines after split?
Question about observation-to-token embedding fidelity: The ObsEmbedding groups L numeric readings into one token, and performance is claimed to be “stable” across window/segment lengths (Table 11), but Figure 10 shows marked variance on VL20 when hidden dims shrink. Could you please provide visualisations of reconstruction error when mapping observations → token → projection and discuss whether discretisation causes information loss for high-frequency dynamics?
Question about scaling-law interpretation vs. parameter count confound: Figure 7 shows lower RMSE with larger backbones, yet the ablation replaces the LLM with a 4-layer Transformer of ~10 M params (Table 3). Could you please run a control where you keep ~7 B parameters but randomly initialise the core layers to disentangle pre-training knowledge from model capacity?
Question about runtime and memory claims: LLM-Filter peaks at 27 GB memory yet runs at ~1 ms/step (Table 6, hardware: H800-80G), while RStateNet is two orders of magnitude smaller but reported slower on high-dim tasks (Table 6). Could you please specify whether baselines were executed on identical GPU kernels, precision (fp16/bf16), and batch sizes and include a throughput vs. latency plot for varying batch sizes?

局限性

Please see the Weekness.

最终评判理由

The authors’ response has addressed most of my concerns, so I will retain my original score.

格式问题

None

作者回复

2025-07-31

Thank you very much for your insightful question. We appreciate the opportunity to clarify and hope the following explanation addresses your concerns.

[Question 1]: Baseline parity and training budget.

Thank you for your question. We provide a detailed breakdown of the training protocols and implementation details of each baseline below to clarify training budget, early stopping, and tuning fairness.

Method	Implementation	Early Stopping	Training Time	HP Tuning
KalmanNet	Official repo TSP KalmanNet	✘ (not supported)	~15 min	Grid search on val set
RStateNet	Re-implemented	✅ (patience=3)	~13.69 mins	Grid search on val set
MEstimator	Re-implemented	✅ (patience=3)	~12.16 mins	Grid search on val set
ProTran	Re-implemented	✅ (patience=3)	~11.03 mins	Grid search on val set
LLM-Filter	Ours	✅ (patience=3)	~30.16 mins	Grid search on val set

For KalmanNet, we directly used the official implementation. This version does not include built-in early stopping, and saves the best-performing checkpoint on the validation set and reports its test results.
For RStateNet, MEstimator, and ProTran, we re-implemented the architectures and applied the same early stopping strategy used for LLM-Filter (patience=3, based on validation loss).
For all methods, we conducted modest grid searches on core hyperparameters (e.g., learning rate, hidden size), using the same training and validation data. Test data was strictly held out until final evaluation.

Regarding the divergence of KalmanNet on high-dimensional systems, we emphasize that this issue is not due to insufficient training time or suboptimal hyperparameter tuning. Instead, we believe it arises from KalmanNet’s architectural reliance on Kalman structure and a hidden state representation whose dimensionality scales quadratically with system size. These structural choices, while suitable for low-dimensional linear systems, significantly limit scalability and representational capacity for high-dimensional nonlinear systems.

We hope this addresses your concern and assures a fair comparison was made across all baselines.

[Question 2]: Observation-to-token embedding fidelity.

Thank you for your question. While visual inspection might suggest instability, the actual numerical results show that performance variance decreases as the hidden dimension shrinks. The complete RMSE values across different segment lengths and hidden dimensions are provided in the table below:

Hidden Dim	128	256	512	1024
$T = 10$	0.6825	1.0353	0.9307	1.0100
$T = 20$	0.6200	0.5610	0.4632	1.0823
$T = 40$	0.9834	0.6873	0.7509	0.4525
$T = 50$	0.7705	0.7226	0.9686	1.0144
Variance	0.0252	0.0406	0.0532	0.0861

Overall, the performance variance remains within a small range (generally under 0.1), indicating that our method is stable across window lengths. Additionally, increasing the hidden dimension generally leads to improved performance, but also results in slightly higher variance across different window lengths. This is because larger hidden dimensions give the MLP more capacity to capture complex patterns from LLM embeddings, leading to better accuracy. However, the increased capacity can also make the model more sensitive to training data and hyperparameters, resulting in slightly higher variance.

We apologize if we misunderstood your request. We assumed that "reconstruction error visualization" refers to visualizing the residuals between the estimated states and ground-truth values. We have plotted the estimation vs. ground truth trajectories for the Selkov system to analyze the residuals. Due to rebuttal constraints, we are unable to include figures here, but we will add them in the final version of the paper. We apologize for this limitation.

We would also like to clarify that our work focuses on the discrete filtering problem [2], where both the system and observations are discretized. For each dynamical system, we apply discretization beforehand, as described in Appendix B, which introduces unavoidable information loss equally to all methods. To ensure fairness, all baseline methods we compare against are also discrete filtering approaches and are trained and evaluated on data collected from the same discretized system.

[Question 3]: Scaling-law interpretation vs. parameter count confound.

This is a great question, which also helps explain why large models are effective. We initially considered conducting an ablation study by randomly initializing the 7B-parameter LLM backbone to disentangle the effects of model capacity from pretraining knowledge. However, in practice, we found that training such a large model from scratch using our task-specific dataset leads to divergence, after trying all possible skills, such as warmup schedules, gradient clipping, and regularization techniques. This instability may partly be due to our current hardware limitations, which make it infeasible to fully train a 7B-parameter model from scratch. More fundamentally, it aligns with prior observations that training large models such as LLaMA-7B typically requires extensive data, multi-stage curricula, and significant computational resources to achieve stable convergence [1].

A promising direction for future work is to train a foundation filtering model from scratch by first collecting a comprehensive and unified dataset encompassing multiple dynamical systems. Following this, a multi-stage training procedure could be employed, beginning with large-scale pretraining on the diverse dataset to learn general filtering capabilities, and then fine-tuning on task-specific datasets to adapt to particular system dynamics.

[Question 4]: Runtime and memory claims.

Thank you for your question. All methods were evaluated on the same hardware: a single NVIDIA H800-80G GPU. For LLM-Filter, we used bf16 precision, while for the smaller baseline models, we followed PyTorch's default fp32 precision.

We would also like to clarify that the runtimes reported in Table 6 reflect per-step inference latency with batch size = 1, which is typical for real-time filtering tasks. As such, batch size does not influence the latency values presented in the table.

It is important to note that inference speed is not solely determined by model size. Differences in network structure are also crucial. For example, LLMs often benefit from optimized inference kernels, fused attention mechanisms, and high degrees of parallelism, which can result in faster execution even with larger parameter counts. In contrast, small custom models such as RStateNet may not take advantage of such low-level optimizations, potentially resulting in slower inference despite their compact size.

[Weakness 1]: Theoretical Justification.

Thank you for your insightful comment. We acknowledge that establishing formal connections or deriving estimation error bounds would certainly strengthen the theoretical foundation of this work, and we view this as a promising direction for future research.

We also wish to clarify that the primary contribution of this work lies not in proposing novel theoretical analysis, but in demonstrating a new paradigm for filtering: leveraging the in-context learning capabilities of LLM to build a generalizable filter. This approach enables adaptation to a wide range of dynamical systems without requiring system-specific retraining, thereby addressing a major limitation of conventional learning-based filters that rely heavily on system-specific data and training procedures.

[Weakness 2]: Limited Noise Models.

Thank you for your comment. We would like to clarify that a classical observation-disturbance scenario, featuring heavy-tailed noise, has been included in Appendix E.2.

As shown in Table 8, although LLM-Filter did not achieve the best performance in this setting, it performed comparably to specialized robust filtering algorithms. This result highlights LLM-Filter’s ability to generalize well even in the presence of significant outliers or system disturbances, making it a promising solution for diverse and challenging scenarios.

Reference

[1] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

[2] Särkkä, S., et al. Bayesian filtering and smoothing (Vol. 17). Cambridge university press.

2025-08-08

Thank you for your valuable feedback on our manuscript "One Filters All: A Generalist Filter For State Estimation".

In our response, we supplemented experiments and addressed your concerns regarding baseline parity, embedding fidelity, and so on. As the review period concludes tomorrow, we wish to confirm whether our clarifications adequately resolved your questions? We understand your schedule may be demanding, and we are happy to provide additional information promptly if needed.

Thank you again for your essential contribution to improving this work.

2025-08-08

Thank you for your valuable feedback on our manuscript "One Filters All: A Generalist Filter For State Estimation".

In our response, we supplemented experiments and addressed your concerns regarding baseline parity, embedding fidelity, and so on. As the review period is concluding shortly, we wish to confirm whether our clarifications adequately resolved your questions? We understand your schedule may be demanding, and we are happy to provide additional information promptly if needed.

Thank you again for your essential contribution to improving this work.

2025-08-09

Thank you for your response—it has resolved most of my concerns, so I will keep my original score.

2025-08-09

Thank you very much for your thoughtful and constructive feedback throughout the review process. We sincerely appreciate your recognition of the value in demonstrating a new paradigm for filtering. We respect your perspective on technical novelty and hope our work can inspire future research that builds on this direction.

审稿意见

评分: 3置信度: 32025-06-30

The work proposes adapting an LLM for state estimation. By designing an embedding scheme and a text-based prompt, the proposed LLM-Fiter outperforms other learning-based baselines. Notably, the LLM-Filter also performs robustly in model mismatch and unseen settings. Furthermore, performance scales with LLM base model size. In an ablation study, when the LLM is replaced with non-pretrained alternate neural models, the performance drops significantly.

优缺点分析

Strengths:

The proposed work significantly outperforms previous SoTA for dynamical systems' state estimation.
Evaluation on different architectures / different base LMs provide strong evidence that the proposed method is effective when true state observations are available.

Weaknesses (disclaimer: I am not an expert in the particular subfield of dynamical systems' state estimation, and am open to further discussion regarding comparison against previous work.)

The paper assumes that we have access to the true states $\mathbf{X}$ in Eq 4. But I wonder if this is a very strong assumption.
Moreover, there have been prior papers on neural approximation of hidden state models (e.g. Krishnan et al (2016) and Lin and Eisner (2016)), which may be discussed in the paper. In particular, Lin and Eisner considered a smoothing scenario where the entire sequence of $\mathbf{Y}$ is given before hand, which can likely further improve accuracy in this paper's setting as well.

问题

For very long sequences it may not be feasible to feed the whole embedding sequence $\mathbf{e}_{1:N_s}$ into the Transformer. Did the authors consider what are good ways to compress the context in this case? Also does the quality drop significantly?

局限性

yes

最终评判理由

My concerns remain that the strong assumptions limit the impact of this paper.

格式问题

n/a

作者回复

2025-07-31

Thank you very much for your insightful question. We appreciate the opportunity to clarify and hope the following explanation addresses your concerns.

[Question 1]: For very long sequences it may not be feasible to feed the whole embedding sequence into the Transformer. Did the authors consider what are good ways to compress the context in this case? Also does the quality drop significantly?

Thank you for your question. Indeed, we have adopted a sliding window mechanism, which corresponds to the commonly used Moving Horizon Estimation (MHE) framework [1] with theoretical guarantees, as described in Section 3.1 of the paper. This approach ensures that only a fixed number of recent observations are included in the input to the LLM, keeping the embedding sequence length bounded and computationally tractable.

Additionally, as discussed in Appendix E.4, we observe that increasing the window length beyond 20 provides diminishing returns in performance. With a window length of 20, the resulting embedding sequence typically contains fewer than 200 tokens, which is well within the effective context length of modern large language models [2, 3]. Based on this observation, we use a window length of 20 in all experiments.

[Weakness 1]: The paper assumes that we have access to the true states in Eq 4. But I wonder if this is a very strong assumption.

In practice, this assumption is not as strong as it may initially seem. Supervised learning-based filters typically require access to ground-truth states during training [4, 5, 6], which are often obtained using high-precision sensors or external tracking systems. This setup is standard in many real-world applications.

For example, in quadruped robot state estimation [7], ground-truth data is commonly collected using an indoor motion capture system to supervise the training process. Once trained, the filter can operate using only onboard measurements such as IMUs and leg odometry, while still producing accurate state estimates.

[Weakness 2]: Moreover, there have been prior papers on neural approximation of hidden state models and Lin and Eisner, which may be discussed in the paper. In particular, Lin and Eisner considered a smoothing scenario where the entire sequence of is given before hand, which can likely further improve accuracy in this paper's setting as well.

That's a good question. We would like to clarify that smoothing and filtering correspond to different problem formulations [8]. Smoothing estimates the current state based on both past and future observations, whereas filtering relies solely on observations available up to the current time to infer the current state. While smoothing can improve accuracy by leveraging future information, it is generally infeasible in real-time applications, Such as robotics control. For example, a leg-robot must make decisions (e.g., control actions, trajectory planning, collision avoidance) based on the current belief of its state, which must be inferred from past and immediately available sensor data. Future sensor readings are inherently unavailable during execution, making smoothing inapplicable in such real-time systems. Therefore, this is a typical filtering problem, and our paper is designed with this class of applications in mind

We also appreciate your pointing us to the work by Lin and Eisner. We acknowledge that our paper currently lacks a discussion of smoothing-based methods, and we will include a more complete review of related smoothing literature in the appendix to better situate our work. Thank you again for your thoughtful feedback.

Reference

[1] Alessandri, A., Baglietto, M., & Battistelli, G. (2008). Moving-horizon state estimation for nonlinear discrete-time systems: New stability results and approximation schemes. Automatica, 44(7), 1753-1765.

[2] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

[3] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

[4] Revach, G., et al. KalmanNet: Neural network aided Kalman filtering for partially known dynamics. IEEE Transactions on Signal Processing, 70, 1532-1547.

[5] Dahal, P., Mentasti, S., Paparusso, L., Arrigoni, S., & Braghin, F. (2024). RobustStateNet: Robust ego vehicle state estimation for Autonomous Driving. Robotics and Autonomous Systems, 172, 104585.

[6] Ji, G., Mun, J., Kim, H., & Hwangbo, J. (2022). Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion. IEEE Robotics and Automation Letters, 7(2), 4630-4637.

[7] Buchanan, R., et al. Learning inertial odometry for dynamic legged robot state estimation. In Conference on robot learning (pp. 1575-1584). PMLR.

[8] Särkkä, S., et al. Bayesian filtering and smoothing (Vol. 17). Cambridge university press.

2025-08-06

I thank the authors for their detailed and thoughtful response.

My primary remaining concern, however, centers on the assumption of having access to ground-truth states ( $\mathbf{X}$ ) for supervised training. I appreciate the authors' explanation that this is a common and practical setup in specific applied domains like robotics, where high-fidelity motion capture systems can provide such data during a training phase.

While this assumption is valid for the target application, it is crucial to recognize that this frames the problem in a way that significantly narrows its impact from a machine learning methodology perspective, particularly for the NeurIPS audience. A substantial body of work on state-space models within the NeurIPS community is dedicated to the more challenging (and general) setting of learning from observations ( $\mathbf{Y}$ ) alone, without access to the true latent states. While the results are strong for the chosen problem, the work's primary contribution is a successful application of LLMs to a specific, fully-observed filtering task, rather than a fundamental advance in state-space modeling or inference that would be broadly applicable to the latent variable modeling community.

Therefore, while I acknowledge the paper is technically sound for its target application, my assessment of its limited significance and impact for the broader NeurIPS audience remains.

2025-08-06

We sincerely thank you for your continued engagement and thoughtful critique.

As you correctly pointed out, our method assumes access to ground-truth states during training and therefore does not directly address the unsupervised setting. From a theoretical standpoint [1], learning-based filtering methods generally fall into two categories: those that rely on ground-truth states (supervised) [2–4], and those that require accurate system modeling (unsupervised) [6, 7]. Supervised approaches learn system characteristics directly from true states, whereas unsupervised methods aim to infer latent states from observations via precise modeling. Our work falls within the supervised category with ground-truth states available, which is a common setting in many real-world applications.

Additionally, we would like to clarify that our primary contribution is not to introduce a new supervised filter in the traditional sense, but rather to propose a general filtering framework based on LLMs. Unlike current learning-based filters that are tightly coupled with specific system training, our approach is broadly applicable across different systems by leveraging the in-context learning capabilities and pretrained knowledge of LLMs. To the best of our knowledge, this type of framework has not been explored in the existing filtering literature.

We appreciate the reviewer’s suggestion that adapting our framework to settings without ground-truth supervision is a meaningful and promising direction. We will include a discussion of this point in the revised manuscript as a potential avenue for future research.

Reference

[1] Särkkä, S., et al. Bayesian filtering and smoothing (Vol. 17). Cambridge university press.

[2] Dahal, P., Mentasti, S., Paparusso, L., Arrigoni, S., & Braghin, F. (2024). RobustStateNet: Robust ego vehicle state estimation for Autonomous Driving. Robotics and Autonomous Systems, 172, 104585.

[3] Ji, G., Mun, J., Kim, H., & Hwangbo, J. (2022). Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion. IEEE Robotics and Automation Letters, 7(2), 4630-4637.

[4] Buchanan, R., et al. Learning inertial odometry for dynamic legged robot state estimation. In Conference on robot learning (pp. 1575-1584). PMLR.

[5] Särkkä, S., et al. Bayesian filtering and smoothing (Vol. 17). Cambridge university press.

[6] Revach, G., Shlezinger, N., Locher, T., Ni, X., van Sloun, R. J., & Eldar, Y. C. (2022, August). Unsupervised learned Kalman filtering. In 2022 30th European Signal Processing Conference (EUSIPCO) (pp. 1571-1575). IEEE.

[7] McCabe, M., & Brown, J. (2021). Learning to assimilate in chaotic dynamical systems. Advances in neural information processing systems, 34, 12237-12250.

评论- Thanks again for the response!

2025-08-08

From a theoretical standpoint [1], learning-based filtering methods generally fall into two categories: those that rely on ground-truth states (supervised) [2–4], and those that require accurate system modeling (unsupervised) [6, 7].

I am not sure if such categorization is entirely accurate, since there are certainly methods that jointly learn to estimate the states and to model the world. For example deep state-space models DVBF and EKVAE. These models typically require powerful model families. With powerful LLMs as base models, from a methodology research perspective, it would seem surprising that with their pretrained world-knowledge and sheer parameter sizes, ground-truth state as supervision signal still strains their capabilities.

In the meanwhile, I do acknowledge this submission as a solid solution to the specific state estimation problems the authors consider.

DVBF: https://arxiv.org/abs/1605.06432 EKVAE: https://openreview.net/forum?id=-WEryOMRpZU

评论- Thank you for the stimulating discussion.

2025-08-08

Thank you for the stimulating discussion. We note that there is, in fact, no universally accepted classification in this area; the taxonomy we adopt reflects our own perspective.

We appreciate the references you provided. We are familiar with both DVBF and EKVAE, yet we respectfully regard them as representation learning frameworks rather than rigorous state estimation methods. In our view, while these models map observed variables into a latent space, this mapping serve primarily as abstract representations and are difficult to align with the actual system states required for real-world applications. Even if one constrains the model family to restrict the state space, such constraints do not ensure that the learned latent manifold aligns with the true physical state manifold. This makes their direct application to practical state estimation challenging. Moreover, because the latent variables have no explicit physical meaning, performance is often reported only in terms of correlation coefficients or KL divergences with ground truth states, rather than RMSE—the canonical metric in filtering.

Scalability also remains a concern: DVBF and EKVAE have thus far been demonstrated only on toy or low-dimensional problems. High-dimensional chaotic regimes, such as large-scale fluid dynamics or weather forecasting, demand substantial model capacity and numerical stability—capabilities that these architectures have yet to exhibit. By contrast, our approach leverages the pretrained world knowledge and expressive power of LLMs, achieving strong RMSE performance while retaining robust generalization.

We sincerely appreciate your recognition of our work and look forward to continued scholarly exchange.

2025-08-08

Thank you again for your valuable suggestions and feedback.

We emphasize that supervised and unsupervised learning represent distinct approaches to filtering. Our core innovation and contribution specifically lie in the supervised filtering methods with comparative evaluations conducted against supervised baselines.

As the review period is concluding shortly, we kindly request confirmation on whether our clarifications have adequately addressed your concerns regarding the supervised setting and other related aspects. Should any points require further elaboration, we remain ready to provide additional information immediately.

2025-08-08

Thank you again for your valuable suggestions and feedback.

We emphasize that supervised and unsupervised learning represent distinct approaches to filtering. Our core innovation and contribution specifically lie in the supervised filtering methods with comparative evaluations conducted against supervised baselines.

As the review period is concluding shortly, we kindly request confirmation on whether our clarifications have adequately addressed your concerns regarding the supervised setting and other related aspects. Should any points require further elaboration, we remain ready to provide additional information immediately.

审稿意见

评分: 5置信度: 42025-07-03

The paper is addressing state estimation of dynamical systems (hidden states in markov chains). Previously in this field LLMs and VLMs have shown positive results. This paper is introducing a general filter to integrate LLMs into state estimation. The system is able to take in disparate data modalities, then uses inputs augmented with System-as-Prompt to run inference with an LLM - said inference is then mapped into a state estimate.

优缺点分析

Quality: Good idea seemingly done well. Good benchmarking, good results. Initially I was very skeptical of the SaP idea however [remark 1] mentions they tested without it and sure enough table 1 shows LLM-filter-O performing substantially worse in multiple tests.

Clarity: Very clear, easy to read paper. Some hyperparameter details were only mentioned in the appendix (like window seize, prompt length, etc) which could confuse some people

Significance: seems to be reasonably significant, to my knowledge they’re the first to repurpose LLms for generalist filtering. They outperform specialist filters. Some limitations like all experiments pressuring observation and state dimensionality or the latency of the filter remain for future work.

Originality: Novel system integration, SaP seems novel as well (albiet mostly just clever prompt engineering).

问题

1.Did you do any experiments playing with the levels of SaP? Such as omitting taks examples, varying the amount of examples provided. I know you included LLM-filter-O, but I would like to see a bit more experimentation with how SaP added to the efficacy of the filter. 2. LLM-Filter optimizes MSE but does not output filter covariance: Can the authors: quantify residual error distributions for the Hopf system, and identify conditions where the filter diverges or large errors occur? 3.Could you clarify a bit more about where the latency/memory overhead is coming from in the LLM filter?

局限性

yes

最终评判理由

My review stands for the reasons outlined in my comment. I think this is a solid paper with a solid goal - there is just nothing massive or novel enough for me to raise my review

格式问题

None

作者回复

2025-07-31

Thank you very much for your insightful question. We appreciate the opportunity to clarify and hope the following explanation addresses your concerns.

[Question 1]: Did you do any experiments playing with the levels of SaP? Such as omitting taks examples, varying the amount of examples provided. I know you included LLM-filter-O, but I would like to see a bit more experimentation with how SaP added to the efficacy of the filter.

Thank you for the insightful suggestion. To better understand the role of SaP, we conduct an ablation study by varying the number of task examples in the prompt, while keeping the task instruction fixed. The results are summarized in the table below.

Example Num	LLM-Filter-O	0	2	5	10
Selkov	0.6369	0.6321	0.5882	0.4061	0.4164
Oscillator	0.5753	0.5692	0.5241	0.5247	0.5099
Hopf	0.8180	0.8128	0.7382	0.5751	0.5514
Pendulum	0.9218	0.9145	0.8597	0.8348	0.8267
Lorenz96	0.9735	0.9806	0.9226	0.9149	0.9223
VL20	0.8433	0.8469	0.7970	0.7717	0.7708

We observe that adding a few task examples significantly improves performance, confirming their value as in-context cues. However, the marginal gain diminishes as the number increases. This is likely due to attention dilution, where too many examples reduce focus on the actual input segment. Moreover, the result for 0 example (task instruction only) outperforms LLM-Filter-O (no instruction nor examples), highlighting that task instruction alone provides a meaningful signal for the model.

These findings suggest that both task instruction and examples are beneficial, but the number of examples should be balanced to avoid overwhelming the attention capacity of the model.

[Question 2]: LLM-Filter optimizes MSE but does not output filter covariance: Can the authors: quantify residual error distributions for the Hopf system, and identify conditions where the filter diverges or large errors occur?

Thank you for your question. As far as we know, existing learning-based filters [1-4] generally do not provide an explicit quantification of residual error distributions either. They are usually trained to directly predict outcomes (e.g., filtering decisions), but they don't come with a built-in probabilistic model that would allow for estimating uncertainty in the residual error.

One possible direction to address this limitation is to adapt uncertainty estimation techniques developed for LLMs. For example, some recent work has proposed using Laplace approximation [5] to estimate the variance of the model's output, which provides a principled way to quantify uncertainty. Incorporating such techniques into LLM-Filter could enable future extensions that include uncertainty estimates alongside state predictions.

[Question 3]: Could you clarify a bit more about where the latency/memory overhead is coming from in the LLM filter?

Thank you for the question. Below, we provide a detailed breakdown of the training time and GPU memory usage for each component of the LLM-Filter in the Selkov system. The 'Ops' primarily consist of reshape, normalization operations. The pre-allocated memory includes storage for input data, model weights, and key-value (KV) caches, which remain mostly static during inference and training.

Component	Encoder	LLM Core	Decoder	Ops	Total
Time (ms)	1.00	79.28	0.58	2.66	73.08

Component	Encoder	LLM Core	Decoder	Pre-allocated	Total
Memory (MB)	4.19	12416.85	4.19	12647.80	25073.03

The latency and memory overhead primarily stem from the LLM core, while the remaining components and operations add only minor costs. This is consistent with the expected resource demands of LLMs during filtering tasks.

We hope this clarifies where the latency and memory overheads arise.

Reference

[1] Revach, G., et al. KalmanNet: Neural network aided Kalman filtering for partially known dynamics. IEEE Transactions on Signal Processing, 70, 1532-1547.

[4] Buchanan, R., et al. Learning inertial odometry for dynamic legged robot state estimation. In Conference on robot learning (pp. 1575-1584). PMLR.

[5] Yang, A. X., Robeyns, M., Wang, X., & Aitchison, L. Bayesian Low-rank Adaptation for Large Language Models. In The Twelfth International Conference on Learning Representations.

评论- Comment

2025-08-06

Thank you for addressing my concerns. I don't have many remaining concerns I think could be illuminated through discussion. I think there is great value in demonstrating a new paradigm for filtering -- even if there is not great technical innovation. However since there is no massive technical innovation I am hesitant to change my review. I will be staying with my original evaluation.

2025-08-06

最终决定Accept (poster)

2025-09-17

This paper introduces LLM-Filter, a framework that leverages large language models for state estimation in dynamical systems through modality alignment and a System-as-Prompt (SaP) mechanism. The approach is presented clearly and evaluated rigorously across diverse benchmarks, showing consistent gains over classical filters and learning-based baselines. Results highlight strong cross-system generalization, robustness under varied conditions, and a scaling effect with larger LLMs. These findings position the work as a compelling demonstration of pretrained LLMs as a technique for filtering.

The main concerns from reviewers were limited methodological novelty—since ObsEmbedding/StateProjection modules are standard from previous time series work and SaP could be viewed as prompt engineering—and the reliance on supervised training with ground-truth states. Theoretical justification for why LLM latent spaces align well with continuous states also remains underexplored. Nonetheless, reviewers agreed that the contribution lies in establishing a new paradigm rather than in architectural novelty. It is a borderline paper but given the breadth of experiments, clarity, and potential to shape future research directions, the paper perhaps offers a meaningful contribution. I recommend acceptance.