6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.8

置信度

创新性2.8

质量3.0

清晰度3.5

重要性2.8

NeurIPS 2025

Training-free Online Video Step Grounding

Luca Zanella,Massimiliano Mancini,Yiming Wang,Alessio Tonioni,Elisa Ricci

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

摘要

关键词

Video step groundingtraining-freelarge multimodal models

评审与讨论

审稿意见

评分: 4置信度: 42025-06-29

This paper tackles the problem of Video Step Grounding (VSG)—identifying which instructional steps are being performed in a video. The key novelties of the paper are: 1) using a LMM for step recognition, which can operate online, 2) using a bayesian framework to incorporate past predictions into the decision making. The paper shows that a LMM alone achieves SoTA results on this problem, and the bayesian framework further improves the performance. The experiments show performance gains on three public benchmarks: HT-Step, CrossTask, and Ego4D Goal-Step.

优缺点分析

Strengths:

A very timely approach - showing that one can use the current SoTA LMMs to solve classical computer vision problems zero-shot, without task-specific training, is an important realization for the community
I like the bayesian approach to incorporating previous predictions into the current decision making, as the problem is indeed sequential.
Using an LLM to elicit the transition probabilities is a really good idea, especially in the cooking domain as it has seen so many text recipes.

Weaknesses:

While the method is advertised as online, it is important to show that it is indeed practical to run it in the online fashion (i.e., near real-time), but the runtime analysis is missing in the paper. If the method is not able to run in the real-time, the online assumption is not very useful. If that so, one could instead estimate the probabilities of steps for the whole video sequence, and than run classical offline algorithms (that could also be bayesian). As it stands, the pitch about the method being online isn't helping the paper.
Is this really reasonable to ask an LLM for a probability score? It’s unlikely the LLMs "know" the actual probability, thus making this a threat of validity.

问题

It looks like step validity doesn’t add much value on top of step readiness. Would it make sense to only use one to have a simpler method?
I wonder if the authors thought of estimating transition probabilities differently. Instead of asking an LLM for the probability of co-occurence of two steps, an interesting approach would be to sample task step sequences from an LLM and calculate the probability empirically.

局限性

Nothing the authors do not mention in the paper

最终评判理由

I thank the authors for providing their response. All of that answers my questions. I recommend the paper to be accepted.

格式问题

no concerns

作者回复

2025-07-31

We thank reviewer pBnT for recognizing our timely use of zero-shot LMMs to solve classical computer vision tasks. We are glad the Bayesian integration of past predictions into current decisions is well received. We also appreciate the positive feedback on using an LLM to estimate transition probabilities.

1. Runtime analysis:

Section C "Inference Speed Analysis" in our Appendix reports the average processing time per video segment for computing VSG scores, estimating step progress, and video decoding/preprocessing on the HT-Step dataset. Videos are grouped by the number of steps in their tasks, and the times were measured on a single H100 GPU with 64GB of memory. For tasks with up to 24 steps, which exceeds the average number of steps in any of the datasets we used, the total inference time remains under 2 seconds per segment. This suggests that BaGLM can process 2-second video segments in real-time. Importantly, the time needed to estimate step progress can be reduced linearly by either batching multiple inferences on a single GPU (if there is sufficient memory) or by distributing them across multiple GPUs.

2. Validity of LLM-estimated probabilities:

We would like to clarify a possible misunderstanding: we do not ask the LLM/LMM for a direct probability score. Instead, we prompt it to generate an answer by choosing from a multiple-choice list. We then take the logits output by the LLM for the first token generated after the prompt, convert these logits into probabilities over the vocabulary using a softmax, and select the probability corresponding to the first token of each choice. For Video-Step Grounding, the choices correspond to the full set of steps of the task associated with the video (see prompt $\pi_{`VSG`}$ in the Appendix). For step progress estimation, the choices are a list of discrete progress values (see prompt $\pi_{`prog`}$ in the Appendix). For step dependencies, the choices are Yes or No (see prompt $\pi_{`prereq`}$ in the Appendix).

3. Step validity:

While step validity alone brings only a small improvement (i.e., +0.7% on Ego4D Goal-Step, as shown in Table 2 of the main manuscript), the best performance is achieved by combining it with step readiness. This combination leads to gains of 1.9%, 2.5%, and 1.2% across datasets, representing additional improvements of +0.3%, +0.6%, and +0.6% over using step readiness alone. Importantly, step validity adds negligible computational overhead compared to the variant that does not use it, and proves especially useful in scenarios where certain steps should not be repeated once completed. For example, without step validity, a model might repeatedly predict "turn on the stove" after it has already been activated. By enforcing validity, the model can avoid such errors, producing more reliable predictions.

4. Alternative transition probability:

Sampling step sequences from an LLM and estimating transition probabilities empirically can reveal how often steps appear together, but our goal is different: we aim to model causal dependencies, not just co-occurrence. In procedural tasks, it is important to know what must happen before something else. Rather than asking the LLM which steps co-occur, we ask whether one step is a prerequisite of another. This allows us to build a dependency matrix that captures step ordering constraints, which we then use to define the transition model (L197-205).

To see why this matters, consider the steps: (1) attach the tabletop, (2) attach the table legs, and (3) tighten the screws. A sampling-based approach might generate plausible sequences, for example:

attach the tabletop → tighten the screws → attach the table legs → tighten the screws
attach the table legs → tighten the screws → attach the tabletop → tighten the screws

From these samples, a co-occurrence-based method might incorrectly infer that attach the tabletop and attach the table legs depend on tighten the screws, simply because they appear after it in the sequence. In reality, both are prerequisites for tighten the screws, which should only happen after either the tabletop or the legs are attached. Only a method that explicitly models dependencies, not just co-occurrence, can capture this causal structure.

2025-08-06

Thank you again for taking the time to review our paper. We would like to know if our rebuttal has addressed your concerns. If any part of our response remains unclear, please feel free to let us know. We would be happy to provide further clarification.

审稿意见

评分: 5置信度: 42025-07-03

The paper proposes an online, training-free approach to step grounding in procedural videos by leveraging the inherent procedural and task knowledge of LLMs, combined with the visual understanding capabilities of LMMs. The authors first demonstrate that LMMs like InternVid-2.5-8B can outperform state-of-the-art VSG methods out of the box. This motivates further improvement through Bayesian filtering and the incorporation of language and progress priors, such as step transition probabilities, step readiness, and step validity. The resulting method, BAGLM, is fully online and training-free, and outperforms existing state-of-the-art offline VSG methods across three datasets.

优缺点分析

Strengths:

S1. The paper is very well-written, and the motivations are clearly articulated.
S2. The method is simple and elegant, demonstrating that VSG can be effectively addressed using LMMs. This opens up a promising, training-free, and resource-efficient alternative to existing SOTA approaches.
S3. The experimental results are strong, and the ablations are thorough. The accompanying discussion is particularly well-explained and contributes meaningfully to understanding the factors behind performance gains.

Weaknesses: The paper does not have any major flaws, but there are a few limitations related to the online setup that prevent me from rating it higher.

W1. In L32, the authors mention “operating on a streaming video.” While the method does function in an online manner, there may be practical latency concerns, as video is processed in fixed 2-second chunks. Streaming VideoLLMs have recently gained attention, and I would like to point the authors to models such as Dispider [a] and LiveCC [b], which operate more efficiently in streaming, real-time settings. Replacing the offline LMMs with a truly streaming VideoLLM and demonstrating performance in a low-latency, real-time setup would significantly strengthen the paper and likely raise my rating to a strong accept.

W2. Another aspect missing from the paper is the analysis of step prediction performance across different positions in the sequence. Specifically, in a video with steps [ $s_1, s_2, ..., s_n$ ], one would expect early steps (e.g., $s_1$ ) to be more difficult to predict due to limited context, while later steps (e.g., $s_n$ ) may benefit from accumulated temporal evidence. It would be helpful to understand whether this trend holds in practice. Additionally, if early predictions are incorrect, how does the error propagate or accumulate over the rest of the video?

[a] R. Qian et al., Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction, CVPR 2025

[b] J. Chen et al., LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale, CVPR 2025

问题

I expect a response to the following questions in the rebuttal:

Q1. Response to W2.
Q2. While using LMMs for both offline and online VSG, all the steps are provided as multiple-choice options in the prompt. For datasets like HT-Step and Ego4D Goal-Step, which contain over 500 steps in the form of sentence descriptions, doesn’t the performance degrade due to long-context limitations of the LMM? I could not find the actual LMM prompts with the full list of steps in the main paper or supplementary. Is there any filtering process used to reduce the number of tokens required to represent each step? Otherwise, it would be helpful to include an analysis showing how performance varies as the number of candidate steps increases.
Q3. A small clarification: is the ground-truth start and end information used at all in the original BAGLM setup in Table 1, or is it only used in the Oracle setting reported in Table 4?
Q4. This is low priority, but I would appreciate it if the authors could consider W1 and benchmark against any existing streaming VideoLLMs.

局限性

yes

最终评判理由

I thank the author for their detailed responses to be queries. I am satisfied with the rebuttal and am maintaining my rating of Accept.

I would recommend the author to include the ablations concerning Q1 and Q2 to the revised supplementary.

格式问题

I did not notice any major formatting issues.

作者回复

2025-07-31

We appreciate reviewer WbTE's positive feedback on the clarity and motivation of our paper. We are glad that the simplicity and elegance of our method, leveraging LMMs for Video Step Grounding, is recognized as a promising, training-free, and resource-efficient alternative to existing training-based approaches. We also thank the reviewer for highlighting the strength of our experimental results, thorough ablations, and the well-explained discussion.

1.1 Step prediction performance by temporal position:

We thank the reviewer for this insightful suggestion. To analyze performance across step positions, we compute R@1 on HT-Step by grouping steps based on their normalized position within each video. For every step, we normalize its occurrence to the $0, 1$ range. We then divide these positions into five bins and compute R@1 for each.

The table below shows R@1 performance across these bins, from early (bin 0) to late (bin 4) steps. The off-the-shelf model (InternVL2.5-8b), which does not use past information, maintains relatively stable performance across bins. It performs best on early steps (bin 0) and worst on the final ones (bin 4), suggesting that initial steps are more semantically distinctive, while later ones are harder to recognize. In contrast, BaGLM shows an upward trend from early to mid-range bins, likely due to its use of accumulated temporal context. For example, its R@1 increases by +1.0% from bin 0 to bin 1, +1.7% from bin 1 to bin 2, and +0.7% from bin 2 to bin 3. The only exception is the final bin, likely due to limitations inherited from the base model (i.e., InternVL2.5-8b performs worst in the final part of the video) as well as error propagation.

Model	Bin 0	Bin 1	Bin 2	Bin 3	Bin 4
InternVL2.5-8b	55.1%	50.7%	53.8%	50.7%	49.0%
BaGLM	55.3%	56.3%	58.0%	58.7%	54.5%

1.2 Error propagation:

We analyze error propagation by measuring R@1 performance on HT-Step across five temporal bins for BaGLM. To gain insights into the source of mistakes, we show the results of BaGLM variants augmented with oracle signals. BaGLM with Oracle Progress (but predicted dependencies) outperforms BaGLM across all bins, highlighting the importance of an accurate progress estimate for stable performance. Instead, BaGLM with Oracle Dependency Matrix (but predicted progress) does not show the same upward trend as BaGLM. This suggests that when the progress contains errors, the oracle dependency matrix amplifies those, ruling out valid steps based on flawed readiness/validity. When both oracle signals are provided, performance improves across the mid-range bins (bins 1, 2, and 3), showing that accurate progress and dependencies together help the model select the correct step. The two exceptions are the initial and final steps, where performance drops likely due to noise in the estimation of the belief in the initial states for the former, and the limits of the base model for the latter.

Model	Bin 0	Bin 1	Bin 2	Bin 3	Bin 4
BaGLM	55.3%	56.3%	58.0%	58.7%	54.5%
BaGLM w/ Oracle Progress	58.6%	60.0%	59.3%	58.7%	57.8%
BaGLM w/ Oracle Dep. Matrix	56.1%	54.3%	59.0%	54.4%	49.0%
BaGLM w/ Oracle	56.3%	65.2%	66.0%	71.6%	53.7%

2.1 Steps preprocessing:

For HT-Step and CrossTask, we follow the standard evaluation protocol [4,14] by providing, for each video, the full set of steps associated with its corresponding task as multiple-choice options in the prompt. On average, each task includes about 10 steps in HT-Step and 7.5 in CrossTask. As reviewer WbTE correctly noted, Ego4D Goal-Step includes a larger number of annotated segments, which are divided into steps and substeps. To reduce the number of candidate steps in our approach, we only use the step-level descriptions (i.e., no substeps). Additionally, many of these step descriptions differ only in minor grammatical ways (e.g., capitalization). To address this, we apply a simple preprocessing step using the spaCy Python library: we lowercase the text, lemmatize it (while preserving plural nouns and verbal adjectives), and normalize whitespace and punctuation. After this preprocessing, the average number of steps per video is reduced to 17.

2.2 Analysis on the number of candidate steps:

We analyze how performance changes with the number of candidate steps in the HT-Step dataset by grouping videos based on the number of steps in their respective tasks. For each group, we compute the average R@1 using the off-the-shelf InternVL2.5-8B model. We summarize the results in the table below. The values in the table depict a non-monotonic trend. Except for tasks with very few steps (e.g., fewer than 5), there is no clear decline in performance as the number of steps increases. This suggests that InternVL2.5-8B is capable of handling prompts with a large number of candidate steps. Note that for some bins, the number of matching videos is small, which can lead to large fluctuations in the reported measurements (e.g., from 24.1% to 48.1% for 19 to 20 steps).

Candidate Steps	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	22	23	25	26	28
Mean R@1 (%)	100	72.2	42.1	52.9	54.7	44.4	49.0	49.4	62.4	50.9	59.1	45.0	53.9	65.7	50.5	50.0	24.1	48.1	44.4	70.6	43.8	53.8	42.9

3. Ground-truth start and end information:

The ground-truth start and end information are only used in the oracle analysis reported in Table 4. All other experiments are performed without access to any ground-truth timestamps.

4. Streaming VideoLLMs:

We thank the reviewer for the suggestion! We benchmarked Dispider [a3] and LiveCC [b3] using the authors' official implementations on the HT-Step dataset, following the temporal video grounding prompt from ETBench [c3] (also used in Dispider's evaluation). The prompt is:

You are given a video about daily activities. Watch the video carefully and find a visual event described by the sentence: '{step}'. The format of your response should be: 'The event happens in <start time> - <end time>'. You must represent start and end times in seconds. For example: 'The event happens in 10.2 - 12.8 seconds'.

Unfortunately, both models either failed to produce outputs or simply repeated the example from the prompt. We believe it is worthwhile to further investigate how these models can be adapted to online video step grounding. However, this was not feasible within the limited time of the rebuttal period.

It is worth noting that, as we report in Appendix Section C (Inference Speed Analysis), our method achieves low latency even with a traditional LMM. On HT-Step, the total processing time per 2-second segment, including video decoding, VSG scoring, and progress estimation, remains under 2 seconds for tasks with up to 24 steps (more than the average step count of any dataset we evaluate on), using a single H100 GPU with 64GB of memory. This suggests that BaGLM can operate in real time. We can further reduce latency by parallelizing step progress estimation, which requires as many independent forward passes as the number of steps in a task, either by batching multiple inferences on a single GPU (if there is sufficient memory) or by distributing them across multiple GPUs. In the optimal scenario, latency can be reduced approximately linearly with the number of parallel executions.

References:

[a3] Qian, Rui, et al. "Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction." CVPR 2025.

[b3] Chen, Joya, et al. "Livecc: Learning video llm with streaming speech transcription at scale." CVPR 2025.

[c3] Liu, Ye, et al. "Et bench: Towards open-ended event-level video-language understanding." NeurIPS 2024.

评论- Response to rebuttal

2025-08-04

I thank the authors for their detailed responses to my questions. I have no further comments.

2025-08-06

We thank the reviewer for the feedback and are pleased that our responses addressed your questions.

审稿意见

评分: 4置信度: 42025-07-05

The paper addresses the task of Video Step Grounding (VSG), which aims to detect specific procedural steps described in text within corresponding video frames. Traditionally, this task relies heavily on offline models trained on annotated datasets, limiting real-time applicability and generalizability. To overcome these limitations, the authors propose Bayesian Grounding with Large Multimodal Models (BAGLM), a novel online and training-free approach. BAGLM leverages the zero-shot capabilities of recent large multimodal models (LMMs), particularly InternVL2.5-8B, and integrates temporal dependency modeling through Bayesian filtering. The dependencies between steps are inferred from a large language model (LLM), refining the predictions of LMMs by incorporating knowledge of past video frames. Experiments conducted on three benchmarks (HT-Step, CrossTask, and Ego4D Goal-Step) demonstrate that BAGLM significantly outperforms state-of-the-art training-based offline methods.

优缺点分析

Strength:

The paper presents the first study to address Video Step Grounding in an online, training-free setting, which reduces data annotation and training overhead and better aligns with real-world, real-time application scenarios.
Authors presented BAGLM, the first training-free, online approach that incorporates priors from past video frames into LMMs through Bayesian filtering, modeling the temporal dependencies across steps via LLM queries.
Extensive experiments show compelling results where the proposed BAGLM surpasses existing state-of-the-art methods, including those specifically trained or fine-tuned on respective datasets. Authors also provided detailed analyses to validate design choices, such as the influence of different LMMs/LLMs and the transition model, demonstrating their impacts on the model performance

Weakness:

Authors show that the Zero-shot LMMs show better performance than previous offline training-based methods. I wonder if it’s because the LMMs, e.g., the InternVL2.5-8B, have seen such instructional videos of those evaluation benchmarks during pre-training? Authors should provide a more detailed analysis of why such Zero-Shot LMMs provide even better performance than training-based methods.
BAGLM’s performance strongly relies on the quality of the dependency matrix generated by the LLM. Although the authors tested two LLMs (GPT-4.1-mini and LLaMA-3.3-70B-Instruct), the robustness of this dependency inference across other domains and tasks remains unclear.

问题

Please see my questions in the weakness points.

局限性

The authors didn't discuss the limitations or potential negative societal impact of their work

最终评判理由

My questions have been well addressed by the authors. I has been reflected in my current rating.

格式问题

作者回复

2025-07-31

We thank reviewer JSj1 for recognizing our work as the first training-free, online approach for Video Step Grounding, which reduces annotation and training costs and better aligns with real-world, real-time applications. We also appreciate the acknowledgment of our extensive experiments showing that BaGLM outperforms state-of-the-art methods, and the analysis validating our key design choices.

1. Zero-shot LMMs vs. training-based methods:

The strong performance of zero-shot LMMs is not due to data contamination. As shown in Table 4 of the InternVL 2.5 report [6], HT-Step, CrossTask, and Ego4D Goal-Step are not included in its pre-training data mixture.

Training-based methods like NaSVA predict alignment score matrices, used to localize each step, by computing similarities between visual features of video segments and textual features of steps. These alignments are typically obtained using discriminatively pre-trained vision-language models such as CLIP. However, CLIP struggles with fine-grained alignment [a2,b2,c2,d2], particularly in instructional videos where different steps often involve similar objects or scenes. For example, the steps "tighten the screw" and "loosen the screw" involve the same objects in nearly identical visual settings. What distinguishes them is the direction of motion and the intent, subtle cues that CLIP-like models often miss due to their limited temporal modeling and contextual reasoning (L77-88).

In contrast, LMMs have stronger alignment capabilities thanks to the reasoning power of their underlying LLMs, which are trained on massive text corpora with complex question-answer tasks. Recent work [16] shows that computing the generative likelihood of answers to simple questions leads to more reliable alignment scores than traditional similarity metrics like CLIPScore, even outperforming models fine-tuned with human feedback. This explains why zero-shot LMMs can outperform training-based methods, even without having seen the evaluation datasets during pre-training.

2. Dependency matrix robustness:

Following reviewer JSj1's suggestion, we evaluate the robustness of the LLM-generated dependency matrices on the HT-Step dataset, which includes several (120) cooking-related tasks, each with 5 videos (600 videos total). For each task, we have a soft dependency matrix generated using either GPT-4.1-mini or LlaMA-3.3-70B-Instruct. To evaluate robustness, we measure the ratio of violated dependencies. We first threshold the soft matrices to obtain binary (hard) dependencies. A dependency $\mathbf{D}_{i,j}$ , indicating that step $a_j$ is a prerequisite for step $a_i$ , is considered violated if $a_i$ appears before the first occurrence of $a_j$ in any of the videos for that task (i.e., the prerequisite should occur before the dependent step whenever both are present).

The tables below report the percentage of violated dependencies and the percentage of tasks with at least one violation for both LLaMA-3.3-70B-Instruct and GPT-4.1-mini. Both models perform reasonably well across different tasks in this domain, with roughly 10% of dependencies being violated, depending on the threshold value. In particular, LLaMA-3.3-70B-Instruct shows a flat violation ratio (~8%) across a wide range of thresholds, suggesting that its outputs are mostly binary (i.e., close to 0 or 1). In contrast, GPT-4.1-mini shows a smoother decline in violation rate as the threshold increases, indicating a more graded confidence distribution. Even at low thresholds, a portion of tasks (28% for LLaMA-3.3 and 26% for GPT-4.1-mini at threshold 0.1) show no violated dependencies, suggesting that the predicted matrices can approximate the oracle matrix.

LLaMA-3.3

Threshold	0.00	0.10	0.20	0.30	0.40	0.50	0.60	0.70	0.80	0.90	1.00
Dependency violated (%)	26.1	8.6	8.4	8.4	8.4	8.4	8.4	8.3	8.3	8.1	8.1
Tasks w/ violation (%)	100.0	72.0	71.3	71.3	69.3	69.3	69.3	68.7	68.7	68.7	68.7

GPT-4.1-mini

Threshold	0.00	0.10	0.20	0.30	0.40	0.50	0.60	0.70	0.80	0.90	1.00
Dependency violated (%)	26.1	15.4	12.5	11.6	10.8	10.2	9.6	8.9	8.2	7.6	0.0
Tasks w/ violation (%)	100.0	74.0	65.3	64.0	59.3	54.0	51.3	48.0	44.7	42.0	0.0

References

[a2] Bagad, Piyush, Makarand Tapaswi, and Cees GM Snoek. "Test of time: Instilling video-language models with a sense of time." CVPR 2023.

[b2] Momeni, Liliane, et al. "Verbs in action: Improving verb understanding in video-language models." ICCV 2023.

[c2] Wang, Zhenhailong, et al. "Paxion: Patching action knowledge in video-language foundation models." NeurIPS 2023.

[d2] Yuksekgonul, Mert, et al. "When and why vision-language models behave like bags-of-words, and what to do about it?." ICLR 2023.

2025-08-06

审稿意见

评分: 4置信度: 32025-07-08

This paper introduces the first training-free online video step grounding method, powered by large multimodal models (LMMs). The authors start by demonstrating the capability of vanilla LMMs on the task of video step grounding, where they show fairly competitive results. To further improve the performance, Bayesian filtering is applied with two stages, i.e., prediction and update, which delicately model the inter-dependency among steps. Through the extensive experiments on three benchmarks, the effectiveness of the proposed method is validated.

优缺点分析

Strengths

This paper showcases the successful exploitation of large multimodal models within the field of temporal step grounding.
The idea of adopting Bayesian filtering is interesting, and it is seamlessly integrated within the framework.
The proposed method largely surpasses the existing state of the arts, despite the more challenging online setting.

Weaknesses

Despite the effectiveness of the Bayesian filtering method, its practicality is quite limited, considering the performance degradation on Ego4D. Although the authors demonstrate that the performance could be further improved under the Oracle situations (Table 4), it does not seem to be the case for a real-world scenario. Obtaining more accurate transition matrices of the target dataset would lessen the advantage of the training-free setting.
The baselines under the online setting are missing. I would encourage the authors to build simple baselines or extend existing approaches to be capable of handling online scenarios. This would make the superiority of the proposed method clear.

问题

Please refer to weaknesses.

局限性

Yes.

最终评判理由

This paper introduces a method for incorporating LLMs into a streamlined pipeline for online video step grounding.

Despite its strengths, I find some major concerns that need to be handled, including the practicality and the selection of online baselines.

Through the rebuttal, the authors addressed most of my concerns.

After all, I would vote for a borderline accept.

格式问题

No.

作者回复

2025-07-31

We thank reviewer oKLx for acknowledging the successful exploitation of large multimodal models for temporal step grounding, the interesting and seamless use of Bayesian filtering within the framework, and that our method largely surpasses state-of-the-art methods despite the more challenging online setting.

1.1 Applicability in real-world scenarios:

Our method performs consistently well on CrossTask and HT-Step, which are as representative of real-world scenarios as Ego4D Goal-Step. This suggests that the method remains practical despite the lower performance on Ego4D. Each dataset used in our evaluation has its own challenges. CrossTask includes over 2,300 videos spanning 18 diverse everyday tasks beyond cooking, such as household and craft activities. These videos are shorter, averaging under 5 minutes, with concise action segments (~10.7 seconds). HT-Step focuses on cooking-related tasks with medium-length videos (~6 minutes) and an average segment duration of 16 seconds. Ego4D Goal-Step videos are significantly longer, with an average duration of about 28 minutes, step annotations of about 53 seconds per segment, and a focus on the cooking domain. These longer segments result in more generic step descriptions, which in turn make it more difficult to infer accurate dependencies between steps.

1.2 Availability of step transitions in practice:

The Oracle analysis (Table 4) shows the potential performance gain when step dependencies are derived from the actual execution order. In some real-world scenarios, such transition matrices may already be available. For example, in industrial manufacturing, processes often follow standardized, predefined protocols. In these settings, step transition data can be reliably obtained (e.g., from existing documentation), making the Oracle setup not just a theoretical upper bound, but a realistic scenario. Our method is flexible and works in both cases, whether the dependency matrix is inferred or provided as ground truth.

2. Online baselines:

We thank the reviewer for the constructive feedback. Following the reviewer's suggestions, we developed an online version of NaSVA [14], our best-performing offline competitor (see Table 1), which is based on the Transformer architecture. In this model, visual features extracted from each 1-second clip are encoded and treated as key-value pairs, while the textual features of steps serve as queries. The queries then attend to the visual features through cross-attention, and alignment scores are computed based on their similarity. The original NaSVA is offline because its transformer encoder allows each video segment to attend to both past and future segments. To enable online processing, we introduce causal masking within the transformer encoder’s self-attention layers, restricting attention to only past segments and preventing access to future information. The online version of NaSVA achieves an R@1 of 46.1 on HT-Step and 24.2 on Ego4D Goal-Step, representing a drop of 7% and 4.9% compared to its offline variant. These scores are also substantially lower than BaGLM (i.e., -10.2% on HT-Step and -18.6% on Ego4D Goal-Step). These results further highlight the effectiveness of our approach in the challenging online setting.

2025-08-04

I appreciate the detailed response from the authors.

I have carefully read it as well as other fellows' reviews.

The authors handled my concerns, although some remain unresolved (e.g., I am still not convinced of the generalizability of the proposed method, considering the performance gains on only two out of three datasets).

After all, I would stick to my original rating.

2025-08-06

We thank the reviewer for the feedback, and we are glad to hear that the rebuttal addressed most of the initial concerns. Regarding generalizability, the lower performance on Ego4D Goal-Step, which focuses only on cooking videos, is likely due to some challenges specific to this dataset, such as the more general step descriptions (e.g., Prepares the bread), complicating modeling step dependencies.

To further demonstrate the generalizability of our method, we evaluate it on the COIN dataset, which contains instructional YouTube videos spanning various domains. Specifically, we use the testing split, consisting of 2,797 videos. Out of these, we successfully downloaded 2,354 videos because the remaining videos were either unavailable or set to private. These videos span 12 domains (i.e., Dish, Drink and Snack, Electrical Appliance, Furniture and Decoration, Gadgets, Housework, Leisure and Performance, Nursing and Care, Pets and Fruit, Science and Craft, Sport, and Vehicle) and are categorized into 180 distinct tasks.

We compare the performance of BaGLM with three baselines: (1) InternVL2.5-8b (off-the-shelf), (2) NaSVA [14] (best offline competitor), (3) an online version of NaSVA that we developed to address W2 (Online baselines). As shown in the table below, BaGLM outperforms NaSVA by 14.7 (offline) and 19.4 (online), and achieves a gain of 4.2 over InternVL2.5-8b. These results suggest that our method generalizes well across diverse domains and tasks.

Method	$\uparrow$ R@1 (COIN)
Offline
NaSVA [14]	36.8
Online
NaSVA [14]	32.1
InternVL2.5-8b [6]	47.3
BaGLM	51.5

最终决定Accept (poster)

2025-09-17

This paper presents a training-free method for step-wise grounding of the individual steps in a given procedural video. It incorporates Bayesian filtering together with LLMs to model the step-wise transitions in the procedure. Performance is excellent, especially when compared to supervised offline methods.

All four reviewers support this paper and think it would be a nice contribution to the NeurIPS program, and the AC agrees. The extensive discussion between the authors and the four reviewers highlights various areas of confusion and missing considerations in the analysis. The authors are requested to incorporate these points when revising their manuscript for the final version.