PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
5
4
3
4
4.0
置信度
创新性2.8
质量2.8
清晰度2.8
重要性2.8
NeurIPS 2025

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29
TL;DR

Current multi-modal LLMs struggles with live, step-by-step task guidance. We built Qualcomm Interactive Cooking (a new dataset with mistake videos and timed feedback) and LiveMamba (a streaming model) to enable better real-time interactive guidance.

摘要

关键词
Multi-modalLarge language ModelsVision and Language

评审与讨论

审稿意见
5

The paper tackles the problem of how to enable multi-modal LLMs to provide step-by-step guidance in an interactive fashion. It introduces the first benchmark for the task, LIVECOOK, that features dense, timestamped instructions + feedback and mistake alerts. Then a model - LIVEMAMBA - is proposed to give instructions and feedbacks (when mistake happens) in real-time, as well as an extra re-planner module for giving next step instructions.

For evaluation, two modes are considered, real-time mode and turn-based mode. For real-time mode, the paper consider several existing video LLMs as baselines for zero-shot settings, and an existing online video LLM for fine-tuned settings. For turn-based mode, comparison against existing video LLMs are considered. The evaluations are conducted on three aspects: i) instruction completion accuracy ii) mistake prediction performance and iii) language fluency of the mistake feedback.

优缺点分析

Strengths:

  1. The task setup as well as benchmark is novel. As far as I know, this is the first benchmark that targets at evaluating MLLM's in instruction-guidance capability in an interactive mode.

  2. LIVEMAMBA's design seems interesting to me, the “when-to-say” mechanism as well as the extra re-planner effectively enables the model to interact in real-time.

Weaknesses:

  1. From application perspective, since the model design is aim for efficiency and is possible for deployment in real-time, I am wondering does the model benchmark the latency aspect for real-time performance? Can such model run successfully on memory-limited devices (memory consumption)? Some discussion around these points will be helpful.

  2. How does pretraining affect the final performance of the model? Since the model specifically targets at the cooking settings rather than a general-purpose LLM, perhaps pretraining won't help here.

  3. In Table 3, the Mistake prediction accuracy for first 5 baseline models are all 0s performance in prec, recall, F1 and language metrics. Why that happens need more explanations, otherwise, it seems a bit suspicious to me.

  4. (More of a suggestion) The benchmark itself is still built on top of offline videos rather than a simulation environment. I think a real interactive environment where a MLLM that can guide simulated user (or agents) is more suitable for building a benchmark since the user/agent can change its actions rather than being fixed every time when a instruction or feedback is given, which is closer to real-world scenario. For example, some prior works in embodied navigation [1] can achieve such functionalities.

[1] CAVEN: an embodied conversational agent for efficient audio-visual navigation in noisy environments. Xiulong Liu, et.al.

问题

See weaknesses above.

局限性

Yes

最终评判理由

After carefully reading through the rebuttal, I think the paper's merit outweighs its limitation. With enough novelty and solid experiments, I lean more towards accept. Please incorporate all the discussion points during the rebuttal into the final version of the manuscript.

格式问题

No

作者回复

We thank the reviewer for recognizing the novelty of our benchmark and task setup for interactive guidance, and for finding the design of our LiveMamba model interesting and effective for real-time interaction, particularly the “when-to-say” mechanism and the re-planner which effectively enables the model to interact in real-time.

Latency: On a Nvidia A100 GPU, our LiveMamba model (with a vision encoder, Q-former, Mamba model, and a re-planning module) has a real-time factor of 2 on average. This means that it can process input data twice as fast as it becomes available. In detail, we can process 4.1 frames per second on average and the input frame rate is 2 frames per second. A key advantage of the Mamba backbone of our LiveMamba model is its low memory requirement, which scales linearly with sequence length, compared to transformers whose memory requirements scale quadratically with sequence length. The peak GPU memory usage during inference of the LiveMamba model (with the Qwen2.5-32B-Instruct re-planner) is ~32GB to process ~500 frames. This makes it more suitable for deployment on edge devices such as phones or smart glasses with limited memory.

Pre-Training: As described in L184-191 in the main paper, the pre-training phase aligns the Q-Former’s vision embeddings with the language backbone’s text embeddings. We employ a diverse set of image and video datasets for object grounding (LVIS, EPIC-KITCHENS) and fine-grained action understanding (SSv2, EPIC-KITCHENS, Ego4D); these skills are essential for the cooking domain. Without pre-training (LiveMamba (w/o-PT)), we see a significant degradation of performance (on the simple planning set):

ModelIC-Acc↑Prec.↑Rec.↑F1↑BERT↑ROUGE-L↑
LiveMamba (w/o-PT)12.90.080.110.090.4230.355
LiveMamba (Ours)19.20.120.100.110.4910.460

We will clarify this in the final version.

Mistake Prediction Accuracy: The reason why the performance of the baseline VideoLLaMA3-7B [54], Video-LLaVA [59, 33], VideoChat2 [29], Video-ChatGPT [37], LLaVA-NeXT [35] models on the mistake precision, recall, F1 and language metrics is poor is because they cannot detect mistakes. The reason for this poor performance on mistake detection is the fact that our LiveCook benchmark requires these models to detect mistakes in a streaming setup. However, these models were trained for offline question-answering tasks. More powerful models such as Qwen-2.5-VL-7B [15] can adapt to our streaming setup due to their strong instruction following capabilities. We explain the reasons for the poor performance of the baseline models in L243-253 of the main paper, and we provide the details of the prompts used for zero-shot evaluation in Appendix G of the supplementary material. We will clarify this further in the final version.

Offline Videos: Thank you for the reference to the highly relevant work of Xiulong Liu, et.al. that highlights the advantages of a real interactive environment. However, there are no extant simulators that can simulate fine grained actions which are common in cooking: e.g., cutting vegetables or adding 1 tablespoon of salt to a dish. Therefore, as described in L88-94 of the main paper, we use real-world offline videos which allows us to create a setup akin to a “non-reactive simulation”, where we task the multi-modal LLM to produce the right instruction or feedback at the appropriate time, but the subject is non-compliant. We will discuss and clarify these further in the final version.

评论

Thank you for acknowledging the rebuttal.

We would be glad if you could let us know if the rebuttal addressed your concerns and if not, we would be glad if you could let us know if you have any additional questions.

评论

Yes, my concerns have been address, and I already raised my ratings.

审稿意见
4

This paper tackles the challenge of enabling multi-modal large language models to provide live, step-by-step task guidance in interactive settings. The authors introduce LIVECOOK, a new benchmark and dataset built on top of CaptainCook4D, specifically annotated with timestamped instructions and mistake feedback in cooking scenarios. Additionally, the authors propose LIVEMAMBA, a lightweight streaming multi-modal LLM that features a "when-to-say" mechanism, mistake-aware data augmentation, and an iterative re-planning module. Through comprehensive evaluations across streaming and turn-based settings, the paper shows that LIVEMAMBA significantly outperforms existing baselines in recognizing completed instructions and detecting fine-grained user mistakes.

优缺点分析

Strengths

  • The paper addresses an interesting and meaningful problem: enabling live, step-by-step task guidance in interactive settings.

  • The paper is well-structured, clearly written, and easy to follow.

  • The authors conduct extensive experiments, and the observed performance improvements are promising, demonstrating the potential of their proposed approach.

Weakness

  • The paper lacks details on how annotation quality were ensured, making it difficult to assess the reliability of the LIVECOOK dataset.

  • The model is only tested on cooking tasks within LIVECOOK, limiting insight into its generalizability across other domains.

  • The LIVEMAMBA model depends on a powerful external LLM (Qwen2.5-32B-Instruct) for re-planning, which makes it harder to isolate the model’s own capabilities.

问题

  • Lack of annotation quality control. The paper introduces LIVECOOK with dense annotations of instructions and feedback, but does not provide sufficient details about the annotation protocol's consistency or quality. A deeper analysis of annotation quality would significantly strengthen the credibility of the dataset.

  • The evaluation allows a 30-second window for detecting instruction completion or mistakes, but this choice appears arbitrary and is not empirically justified. It is unclear whether 30 seconds is too lenient or too strict. Maybe an ablation study is needed.

  • Benchmark is domain-specific and may limit generalization. Although the authors acknowledge that LIVECOOK is focused on cooking, the benchmark does not include any other experiments or analysis showing how transferable the model is to other task domains. A promising direction would be to evaluate the model on other public datasets, which also involve step-by-step tasks.

If my concerns can be adequately addressed, I would consider raising my score.

局限性

yes

最终评判理由

The authors’ rebuttal has addressed my concerns, and I believe the paper’s contributions and novelty are sufficient. I will increase my score by one point.

格式问题

no major formatting issue

作者回复

We thank the reviewer for recognizing the importance of the problem we are addressing through the novel LiveCook benchmark and dataset, the clarity and structure of our paper, and the potential of our proposed LiveMamba approach demonstrated through extensive and promising experiments.

Annotation Quality: All annotations are created manually, and to ensure the quality of the annotations, they are manually verified independently by another annotator. The quality of the dataset is further highlighted in the qualitative example in Figure 1 of the main paper and the additional qualitative examples in Figures 4, 5 and 6 of the supplementary material. We also include the LiveCook benchmark data files in the uploaded supplementary materials. The annotation process is detailed in Section 3 of the main paper and Appendix B of the supplementary material. We will clarify this further in the final version.

Generalization: To evaluate the ability of multi-modal LLMs to provide step-by-step guidance, an ideal test domain should be both practical and accessible. We chose cooking because it meets these criteria: its tasks are relevant to daily life, and the specific recipes in our LiveCook benchmark do not require specialized knowledge. These qualities make cooking a robust environment for our study. Moreover, our model can be trained with data from other domains based on data availability, and generalizing to other domains is an interesting direction for future work. We will clarify this in the final version.

The Powerful Re-Planning Module: We would like to begin by clarifying that (as described in L172-181 in the main paper), if a recipe step is performed out of sequence, the re-planning module invokes the Qwen2.5-32B-Instruct model to decide if the current recipe step must be repeated and also update the task plan. Thus, note that the Qwen2.5-32B-Instruct model is invoked only on the advanced planning set in Table 4 of the main paper. In the simple planning set, the re-planning module only returns the next recipe step (thus, the re-planning module ensures that the LiveMamba model does not need to memorize the recipe steps). To better highlight the capabilities of the LiveMamba model without the re-planning module, we add a LiveMamba (w/o-reP) ablation to the results in Table 4 on the advanced planning set:

ModelIC-Acc↑Prec.↑Rec.↑F1↑BERT↑ROUGE-L↑
Videollm-online9.30.090.040.050.4570.431
LiveMamba (w/o-reP)8.70.070.030.040.4140.389
LiveMamba (Ours)11.60.100.040.050.4700.447

These results show that even without the re-planning module, our LiveMamba model is comparable to the Videollm-online baseline.

Furthermore, in the turn-based evaluation in Table 5 of the main paper, we do not use the re-panning module as the models are evaluated on each recipe step independently. We again see that our LiveMamba model outperforms state of the art multi-modal LLMs, even without the powerful re-planning module.

IC-Acc 30-second Window: Note that the 30 second window is centered at the end of each recipe step and thus includes the last 15 seconds of the current recipe step and the first 15 seconds of the next recipe step. The average length of a recipe step is 62.05 seconds. Thus the window for calculating the IC-accuracy covers on average the last 24% of the duration of a recipe step and the first 24% of the beginning of the next recipe step. In practice, this is sufficient to account for any ambiguities in the data along with any residual inaccuracies in the annotations while not sacrificing on accuracy (L214 in the main paper). We will clarify this in the final version.

Finally, note that even with this 30 second window, the results in Table 3 of the main paper highlight that the LiveCook benchmark is challenging even for current state of the art models like Qwen-2.5-VL-7B.

Other Public Datasets: The proposed LiveCook dataset is the only dataset that includes step-by-step task guidance along with a diverse range of timed feedbacks covering mistakes as highlighted in Table 1 of the main paper. Therefore, we restrict our evaluation to our proposed LiveCook dataset.

评论

Thank you for your response. My concerns have been mostly addressed.

审稿意见
3

This paper focuses on the challenge of enabling multi-modal large language models (LLMs) to provide real-time, interactive step-by-step task guidance. The authors point out that current multi-modal LLMs excel in conversational abilities but struggle to offer live, interactive step-by-step guidance, which is a critical capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution and promptly identifying and alerting users to mistakes—all in real time. To address this challenge, the authors introduce LIVECOOK, a new benchmark and dataset built upon the CaptainCook4D dataset, which includes user errors during task execution. LIVECOOK features densely annotated, timestamped instructions and feedback messages, particularly mistake alerts precisely aligned with their visual occurrences in the video.

优缺点分析

Strengths

  1. Novelty of the benchmark dataset: The LIVECOOK benchmark addresses the limitations of existing datasets by incorporating user errors and providing densely annotated, timestamped instructions and feedback messages. This enables more realistic evaluation of multi-modal LLMs' ability to provide step-by-step task guidance.
  2. Innovative model design: The LIVEMAMBA model introduces several novel features, such as the "when-to-say" mechanism, data augmentation for mistake recognition, and iterative re-planning. These features enhance the model's ability to provide real-time, interactive guidance.

Weakness

  1. Limited generalizability: The research is primarily focused on the cooking domain. The generalizability of the proposed method to other task domains remains unclear.
  2. Handling complex tasks: The model struggles with advanced planning scenarios (e.g., order errors or missed steps), as highlighted in the "Advanced Planning Set" experiments. This suggests limitations in reasoning about non-linear task structures.
  3. Computational efficiency:While LIVEMAMBA is lightweight compared to models like VideoLLaMA, its reliance on external re-planning modules (e.g., Qwen-2.5-32B) may hinder scalability for real-time edge deployment.
  4. Annotation cost: The paper’s annotation process for the LIVECOOK dataset relies heavily on manual labeling of step-by-step instructions and feedback, particularly for mistake detection and timing. While the authors describe leveraging existing annotations from the CaptainCook4D dataset and manually curating timestamps for mistakes, the methodology lacks a scalable, automated pipeline for generating such interactive feedback data. This manual approach is labor-intensive and domain-specific (focused on cooking), making it difficult to generalize to other procedural tasks (e.g., assembly, fitness, or medical procedures) where large-scale, temporally aligned feedback is required. Minor mistakes:
  5. The paper's data augmentation for mistakes relies heavily on counterfactual action descriptions (e.g., change nouns/actions), which may not fully capture real-world variability in user errors. Real-world user errors are often nuanced (e.g., partial successes, context-dependent mistakes). Augmenting training data with diverse, real-world examples would improve robustness.
  6. The primary metrics focus on binary correctness but lack nuanced evaluation of feedback quality.

问题

see weakness

局限性

see weakness

格式问题

None

作者回复

We thank the reviewer for recognizing the novelty of the LiveCook benchmark and dataset in enabling more realistic evaluation, and the innovative design of the LiveMamba model, which enhances real-time, interactive task guidance.

Generalization: To evaluate the ability of multi-modal LLMs to provide step-by-step guidance, an ideal test domain should be both practical and accessible. We chose cooking because it meets these criteria: its tasks are relevant to daily life, and the specific recipes in our LiveCook benchmark do not require specialized knowledge. These qualities make cooking a robust environment for our study. Moreover, our model can be trained with data from other domains based on data availability, and generalizing to other domains is an interesting direction for future work. We will clarify this in the final version.

Reasoning About Non-Linear Task Structures: We agree that the advanced planning set is very challenging for all models, as it requires reasoning about non-linear structures. However, the use of the re-planning module in our LiveMamba model leads to improved performance, as it is able to update the task plan on the fly as described in L168-181 in the main paper. This is highlighted in Table 4 of the main paper where our LiveMamba model performs significantly better compared to the prior work such as Videollm-online [6] on the advanced planning set of our LiveCook benchmark.

Computational Efficiency: Re-planning using the Qwen2.5-32B-Instruct re-planner takes 8.2 seconds on average on a Nvidia A100 GPU. Note that, as mentioned in L109 in the supplementary material, the Qwen2.5-32B-Instruct re-planner is called once every 2.5 minutes on average on the advanced planning set. In terms of throughput, on a Nvidia A100 GPU, our LiveMamba model (system with a vision encoder, Q-former, Mamba model and a re-planning module) has a real-time factor of 2 on average. This means that it can process input data twice as fast as it becomes available. In detail, we can process 4.1 frames per second on average and the input frame rate is 2 frames per second.

Annotation Cost: We rely on manual annotations for our LiveCook dataset and benchmark to maintain accuracy. It is currently not possible to automate this data generation pipeline as demonstrated by the poor performance of state of the art multi-modal LLMs in detecting mistakes in Table 3 of the main paper. Thus, it is crucial to have human expert annotators provide the ground-truth for the mistakes and their timestamps in a video sequence. This also highlights the importance of our LiveCook benchmark and dataset as the first to provide timed instruction and feedback messages covering a diverse range of mistakes in the cooking domain.

Counterfactual Action Descriptions: The counterfactual action descriptions generated using our data augmentation scheme leads to improved performance of our LiveMamba model as highlighted in Table 4 of the main paper (LiveMamba (w/o-DA) vs LiveMamba (ours), where LiveMamba (w/o-DA) is the model trained without the proposed data augmentation scheme). We see a significant increase in performance across all metrics with our data augmentation scheme. We agree that the addition of more nuanced errors could improve performance even further, however, generating such data automatically is challenging with current state of the art multi-modal large language models, and is thus an interesting direction for future research. We will include samples of the augmented data in the final version.

Nuanced Evaluation of Feedback Quality: In addition to the “binary” Instruction Completion Accuracy (IC-Acc) and mistake detection precision (Prec.), recall (Rec.) and F1 metrics, we also include the mistake feedback fluency (BERT and ROUGE-L) metrics. This allows for a more nuanced evaluation of the feedback quality in Tables 3, 4 and 5 in the main paper.

评论

I've read the rebuttal from authors as well as the comments from other reviewers. I still have a few remaining concerns. For example, the rebuttal claimed that the proposed method can be generalized to other domains. Which domains? How to generalize? After reading the rebuttal, I decided to remain the original rating.

评论

Thank you for acknowledging the rebuttal. In this paper, we propose the LiveMamba model for interactive instructional guidance, along with the LiveCook dataset and benchmark, which use cooking data as an example to train and evaluate the model.

Regarding generalization to other domains, our LiveMamba model can be trained and evaluated, in the same way as in LiveCook, on data relevant to other procedural tasks, for example, mechanical assembly or repair, if such data are available. In other words, if any interactive instructional guidance datasets and benchmarks exist for procedure‐based tasks, our LiveMamba model should naturally adapt to them. As far as we know, our proposed LiveCook dataset and benchmark are the first resources of their kind for interactive instructional guidance.

Please let us know if there are any concerns that remain unresolved.

审稿意见
4

This paper describes work around building a cooking dataset and a model for providing live instructions and task guidance where the model/system can also identify mistakes (in-step or out of order), re-plan and suggest the next steps. The research addresses gaps in the capabilities of current multi-modal LLMs, which, despite their advancements, largely operate in a turn-based manner and are not suited for providing continuous, real-time guidance and feedback for task operation. The authors introduce the LIVECOOK dataset, which is built upon the CaptainCook4D dataset and is additionally annotated with timed instructions and feedback on user mistakes. This dataset is designed to train and evaluate models on their ability to not only provide instructions but also to detect successful completion and identify errors as they happen.

The paper proposes LIVEMAMBA, a streaming multi-modal LLM designed for this interactive guidance task. Key features of LIVEMAMBA include its ability to decide "when to speak," a data augmentation scheme for recognizing a diverse set of mistakes, and an iterative re-planning module to adapt to user actions. The evaluations demonstrate that existing state-of-the-art multi-modal LLMs perform poorly in this live guidance scenario, while the fine-tuned LIVEMAMBA model establishes a slightly better baseline.

优缺点分析

Strengths:

  • Real time multimodal task guidance for complex task is a hard problem that hasn't been solved yet since most work in the space aims to address turn by turn task following. This paper establishes an initial baseline for this more ambitious interactive task guidance and error correction problem. Follow up work can build up on this dataset for dynamic planning (when an out of order step is executed), identification of crucial mistakes (that in some other non-creative domain can result in a more serious harm), etc.
  • the notion of timely feedback via the <response> and <vision> tags as well as adapting to the human error are gaps that this paper addresses appropriately. This dataset with ~100 hrs of annotations will help address the task of real time guidance in this domain appropriately
  • the overall system design is interesting with the use of efficient backbone models like Mamba to focus on the core task of vision and language grounding and a more expensive re-planner to address complex reasoning.
  • good evaluations with zero shot studies, fine-tuned model performance as well as ablations (wo data augmentation and re-planner) are very interesting. The choice of creation of a simple and an advanced dataset makes the problem approachable with existing models as well as simple in design. (this adds to the clarity of the research direction)

Weaknesses

  • Authors mention that "inference using our model should ideally be possible using compute constrained edge devices such as mobile phones or smart glasses." However, there is no discussion whatsoever about latency and throughput of the models and system that use a vision encoder, Q-former, Mamba model and a Qwen2.5-32B-Instruct re-planner - what kind of backend infrastructure would this be able to run on, etc (mobile phones, smart glasses?). While the use of Mamba backbone is motivated by efficiency, concrete numbers on processing times are missing especially when this work is differentiated by it's ability to do live task guidance.
  • Many details are missing: how good is the timeliness of the feedback (use of 30 second windows for calculating IC-accuracy may be too high)? how good is the re-planner? the re-planner is able to skip steps completely or create new steps, etc - this kind of replanning may work in the cooking domain but not in any mission critical domain for which building these systems may be more pressing (and therefore use of cooking as a proxy as used in many R&D projects)
  • there is no generalization studies on this model, can the model capture errors in other domains? Is this s solid foundation to build models on top of this for other domains?
  • this research aligns well with the recent trend of agentic AI models like MM-ReAct and SayCan (also in robotics domain) for task execution. While these models are used for empowering the human better, they can still learn from work in the agentic AI multimodal task execution literature.

问题

(from comments above:)

  • how good is the timeliness of the feedback (use of 30 second windows for calculating IC-accuracy may be too high)?
  • how good is the re-planner? the re-planner is able to skip steps completely or create new steps, etc - this kind of replanning may work in the cooking domain but not in any mission critical domain for which building these systems may be more pressing (and therefore use of cooking as a proxy as used in many R&D projects)
  • what's the compute and memory requirements for the models?
  • is there a demo for the system?
  • Are there plans to open-source any part of the research?

局限性

yes

最终评判理由

Based on the author's responses (about latency details, re-planner details etc), I've increased my score by a point. However, I am still not clear about the re-planner details (besides repeating the steps, what other actions can it do and what are the consequences if the re-planner introduces a new step, how can one verify it, etc) Authors mention they use Nvidia A100 for inference, and still mention that they can run on mobile devices, I am not clear about this.

格式问题

none

作者回复

We thank the reviewer for recognizing our contributions: establishing a baseline for the challenging problem of real-time interactive task guidance, the value of our ~100-hour annotated dataset, and the strengths of our evaluations and system design.

Latency and Throughput of the Models: Our LiveMamba model consists of a vision encoder, Q-former, Mamba model, and a re-planning module. As explained in L172-176 in the main paper, only when the user does not perform the intended step (on the advanced planning set), the Qwen2.5-32B-Instruct re-planner will be invoked. Otherwise, the re-planning module just returns the next step in the plan. In terms of throughput, on a consumer Nvidia A100 GPU, our LiveMamba model has a real-time factor of 2 on average. This means that it can process input data twice as fast as it becomes available. In detail, we can process 4.1 frames per second on average and the input frame rate is 2 frames per second. In terms of latency, that is time to generate the first token, is 1.1 seconds on average. Re-planning using the Qwen2.5-32B-Instruct re-planner takes 8.2 seconds on average. Note that, as mentioned in L109 in the supplementary material, the Qwen2.5-32B-Instruct re-planner is called once every 2.5 minutes on average on the advanced planning set.

The target backend infrastructure for our LiveMamba model are edge devices such as mobile phones and smart glasses. A key advantage of the Mamba backbone of our LiveMamba model is its low memory requirement, which scales linearly with sequence length, compared to transformers whose memory requirements scale quadratically with sequence length. The peak GPU memory usage during inference of the LiveMamba model (with the Qwen2.5-32B-Instruct re-planner) is ~32GB to process ~500 frames. This makes our LiveMamba model more suitable for deployment on edge devices with limited memory compared to standard transformer architectures. We will clarify this in the final version.

Windows for Calculating IC-Accuracy: Note that the 30 second window is centered at the end of each recipe step and thus includes the last 15 seconds of the current recipe step and the first 15 seconds of the next recipe step. The average length of a recipe step is 62.05 seconds. Thus the window for calculating the IC-accuracy covers on average the last 24% of the duration of a recipe step and the first 24% of the beginning of the next recipe step. In practice, this is sufficient to account for any ambiguities in the data along with any residual inaccuracies in the annotations while not sacrificing on accuracy (L214 in the main paper). We will clarify this in the final version.

Finally, note that even with this 30 second window, the results in Table 3 of the main paper highlight that the LiveCook benchmark is challenging even for current state of the art models like Qwen-2.5-VL-7B.

The Re-Planning Module: As described in L172-176 in the main paper, if a recipe step is performed out of sequence (on the advanced planning set), the re-planning module in our LiveMamba model is able to decide if the current recipe step must be repeated and also update the plan, this is when the Qwen2.5-32B-Instruct re-planner is invoked. Otherwise, the re-planning module will just return the next step in the plan. We highlight the advantage of the re-planning module in Table 4 of the main paper on the simple planning set through an ablation of our LiveMamba model without the replanning module (w/o-reP). Furthermore, here we add this LiveMamba (w/o-reP) ablation to the results in Table 4 on the advanced planning set:

ModelIC-Acc↑Prec.↑Rec.↑F1↑BERT↑ROUGE-L↑
LiveMamba (w/o-reP)8.70.070.030.040.4140.389
LiveMamba (Ours)11.60.100.040.050.4700.447

These results again highlight the advantage of the re-planning module in our LiveMamba setup.

Finally, note that in this work we are primarily focused on the cooking domain, and we find that this re-planning module is sufficient in such cases. Extension of our re-planning module to other domains is an interesting direction for future research.

Generalization Studies: To evaluate the ability of multi-modal LLMs to provide step-by-step guidance, an ideal test domain should be both practical and accessible. We chose cooking because it meets these criteria: its tasks are relevant to daily life, and the specific recipes in our LiveCook benchmark do not require specialized knowledge. These qualities make cooking a robust environment for our study. Thus, we propose LiveCook, the first large scale benchmark and dataset with timed instruction and feedback messages for step by step task guidance. We leave the generalization of our method to other procedural tasks as a direction for future work. We will clarify this in the final version.

Because the LiveMamba model only tuned to recognize mistakes on the cooking data, currently it may not be able to capture errors in other domains. But our model can be trained with data from other domains based on data availability.

Learning from Works in Agentic AI: This is a great point and an interesting direction for future research. We will discuss the MM-ReAct and SayCan papers in the final version.

Demo: We will include demo videos of the LiveMamba system with the final version. We also include qualitative examples in Figure 3 in the main paper and Figure 7 in the supplementary material.

Open-source: We will release both the LiveCook benchmark and dataset and the LiveMamba model code (pending legal approval).

评论

Thank you for acknowledging the rebuttal.

We would be glad if you could let us know if the rebuttal addressed your concerns and if not, we would be glad if you could let us know if you have any additional questions.

最终决定

The paper introduces LIVECOOK, a densely annotated and timestamped benchmark for real-time, multimodal instruction-following and mistake feedback within the cooking domain. The paper also proposes LIVEMAMBA, a multi-modal LLM designed for this interactive guidance task.

The paper received scores of 4, 3, 4, 5. The reviewers agree that using multimodal LLMs to support live, interactive, error-aware guidance is timely and important. However, they also pointed out that the dataset is limited to the cooking domain and thus generalizability to other tasks remains unclear. The reviewers also questioned the feasibility of annotation pipeline for LIVECOOK, which is labor-intensive and domain-specific. Its scalability to other tasks is not demonstrated. However, given the complexity of the problem space and the absence of comparable prior work, I recommend this paper for acceptance.