PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
3
2
3
4
ICML 2025

WorldSimBench: Towards Video Generation Models as World Simulators

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

Towards Video Generation Models as World Simulators

摘要

关键词
Embodied VisionWorld Model,Dataset and Benchmark

评审与讨论

审稿意见
3

This paper introduces WorldSimBench, a benchmark for evaluating video predictive models from an embodied perspective, in contrast to previous approaches focused on generative evaluation. It features Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, combining human preference assessments from a visual standpoint with action-level evaluations in embodied tasks. WorldSimBench covers three key embodied scenarios: Open-Ended Embodied Environments, Autonomous Driving, and Robot Manipulation. Experiments on multiple existing models reveal the strengths and limitations of current world simulators. Additionally, it provides the HF-Embodied Dataset, comprising 35,701 entries, to further support research in this domain.

给作者的问题

  1. This paper evaluates multiple video generation methods. What are the key conclusions and insights? How can the findings guide improvements in model design and methodology?

论据与证据

This is a benchmark paper and does not make traditional claims.

The authors propose:

  1. Evaluating the capability of predictive models from a hierarchical perspective.
  2. Assessing models from an embodied perspective by leveraging video-action models to evaluate driving and manipulation tasks, revealing significant shortcomings in existing models.

Overall, evaluating from an embodied perspective is a valuable approach, as current generative models are clearly insufficient. However, regarding the first point, I have some doubts. As the authors mentioned, prediction can occur across different modalities with varying focuses, but it does not necessarily imply a clear hierarchical structure.

方法与评估标准

The authors evaluate video prediction models from explicit perceptual and implicit manipulative perspectives. They assess models based on visual quality, condition consistency, and embodiment, which is a reasonable approach. However, the work lacks significant innovation, serving more as an integrative evaluation of previous research.

理论论述

No theoretical claims.

实验设计与分析

  1. Running only 20 trials on CALVIN is not reasonable, as the standard evaluation setting typically involves 1,000 trials.
  2. The evaluation of AD also feels somewhat problematic. It essentially involves predefining trajectories as instructions, using those instructions to guide video generation, and then converting them into corresponding actions. Throughout this process, information loss occurs at each transformation step.

补充材料

Yes, the supplementary materials provide related works, additional experiments, and implementation details.

与现有文献的关系

This work attempts to evaluate the capabilities of existing video generation models from an embodied perspective, aiming to make them more practically applicable.

遗漏的重要参考文献

Predictive models have been widely explored in both autonomous driving and robotics. For instance, in autonomous driving, works such as GAIA-1, Drive-WM, and DriveDreamer focus on world models. However, this paper's related works section lacks discussion on these contributions.

其他优缺点

I appreciate the authors' contributions, as they have conducted extensive experiments across gaming, driving, and robotics. Although some metric evaluations may not be entirely reasonable, their work still contributes to the advancement of the field.

However, the evaluation lacks strong persuasiveness and innovation, as it primarily follows existing methods to assess current models. Additionally, video-to-action approaches are not universally applicable across different domains.

其他意见或建议

  1. How to use the human feedback annotations for further improvement?
  2. Line84-85, duplicate “output modality”
  3. Line1238, cite PVDM
作者回复

Thank you for your constructive and thoughtful comments. They were indeed helpful in improving the paper. We take this opportunity to address your concerns:

Hierarchical Structure of Modalities

The hierarchical structure we propose is not based on semantic abstraction, but on the practical difficulty of turning different predicted modalities into real-world executable actions—a key consideration for evaluating video generation models as world simulators. Language is high-level but often ambiguous and under-specified for action planning; images are more concrete but lack temporal and causal context; videos, while potentially modeling transitions, may produce physically inconsistent content. To address this, we introduce the notion of actionable video prediction—predicted videos that are both instruction-aligned and physically plausible, making them suitable for downstream decision-making in embodied agents. This task-driven, grounded perspective is essential for advancing video generation toward real-world utility.


CALVIN Evaluation Trials

While the main text initially reported 20-trial results on the CALVIN benchmark, we also conducted 100-trial evaluations following prior works such as UniPi and SuSIE (Appendix Table 15). In response to your suggestion, we have now further extended our evaluation to 1,000 trials to improve statistical reliability.

Task completed in a row (%) ↑ for (1,2,3,4,5)

Method12345Avg. Len. ↑
Open-Sora-Plan91.873.261.235.332.23.06
DynamiCrafter92.469.250.129.318.72.71
EasyAnimate88.357.333.917.312.22.05

AD Evaluation Setup

We would like to clarify that our AD evaluation does not rely on predefined trajectories. Instead, the agent receives language instructions, and a video generation model predicts future frames accordingly. These are passed to a fixed video2action policy—trained only on ground-truth data and shared across all models—to generate control actions. As this policy is not trained on generated videos, task success rates directly reflect the quality and physical plausibility of the predictions. Our setup is designed to evaluate whether models can produce instruction-aligned, physically grounded outputs, rather than reproduce preset trajectories.


Related Work in Autonomous Driving

We will revise the Related Work section to include these approaches.


Evaluation Persuasiveness and Domain Generality

We respectfully clarify that our evaluation framework is tailored for embodied settings, where video predictions must be both visually plausible and physically actionable. Unlike prior work focused on perceptual quality, our task-driven protocol assesses a model’s ability to support real-world decision-making by using predicted videos to control agents via a shared video-to-action policy. We introduce the concept of actionable video prediction—emphasizing instruction alignment and physical feasibility—and combine perceptual and task-based evaluations. While video-to-action may not apply universally, it is well-suited for embodied tasks like robotics and autonomous driving, where grounding in action space is essential. This embodiment-centered perspective brings both novelty and practical value to evaluating generative world models.


Suggestions #1

We address this point in Appendix C.1, where we explore how human feedback annotations can be leveraged to further improve the video model’s capabilities through reinforcement learning. Specifically, we conducted experiments using the Open-SORA-Plan framework to fine-tune the model based on human feedback. The results of these experiments are reported in Appendix Tab.5, which demonstrate that incorporating human feedback can lead to measurable improvements in planning and control performance.


Suggestions #2, #3

We will fix the typos in the revised version.


Questions: Conclusions and Insights

Our paper highlights several key insights across different sections. In Section 4.3, we analyze limitations of current video generation models as world simulators with several physical aspects. Appendix C.1 shows how human feedback from the HF-Embodied dataset can enhance video quality via reinforcement learning, offering a path to integrate human preferences into training. Appendix H further demonstrates how generated videos and our benchmark can benefit VLA methods, revealing broader utility in downstream embodied tasks. Together, these findings provide actionable guidance for future model design, including better multi-modal supervision, improved control fidelity, and the use of synthetic data for generalization.


Thank you for the constructive feedback. We will incorporate the above clarifications, examples, and additional experiments into the paper, and we hope these revisions address your concerns.

审稿意见
2

This paper proposes a framework that can be used to evaluate world simulators, such as video generation models. the framework is divided into a couple of components, a perceptual evaluation, and an embodied evaluation. the perceptual one uses a model trained on human collected data to score the results. the embodied on measures task success by combining it with a trained video-to-action model.

给作者的问题

please see the comments above

论据与证据

yes

方法与评估标准

yes

理论论述

n/a

实验设计与分析

yes

补充材料

no

与现有文献的关系

i believe this paper will have some impact to a sizable audience in computer vision and ML communities

遗漏的重要参考文献

references are adequate

其他优缺点

  1. there are limited qualitative results. this makes it very difficult to evaluate the performance

  2. can the generated video be applied to improve vision-language-action models?

  3. I feel that the proposed method is more like a survey paper, which means that the novelty is limited.

其他意见或建议

n/a

作者回复

Thank you for your recognition of our paper:

  • Inspiring for the Community: "I believe this paper will have some impact to a sizable audience in computer vision and ML communities."
  • Comprehensive Literature Review: "References are adequate."

Supplementary Material

Our supplementary material is included in the same PDF file, appended to the end of the main paper.


Quantitative Experimental Results

We appreciate your suggestion regarding the presentation of results. D.2. Qualitative Results and Figure. 7 in the supplementary material shows some of the Quantitative Experimental Results, and we have also provided corresponding analyses for these results. We have also created a Anonymous Website to showcase more visualized results, which we believe will enhance the clarity and impact of our paper. We hope this will provide a more comprehensive understanding of the performance of different models on our benchmark.


Discussion of VLA

We appreciate the reviewer for bringing this up—connecting our work to Vision-Language-Action (VLA) models is indeed a valuable perspective. We would like to clarify that we have already discussed this in the supplementary material, specifically in Section H. Discussion of Vision-Language-Action Models.

As summarized there, the generated videos can contribute to VLA models in two key ways:

  1. As a source of diverse and scalable training data for imitation learning, including through hindsight relabeling, which helps models learn from generated successful trajectories.
  2. As a means of reward generation in reinforcement learning settings, where generated videos can simulate goal states or desirable behaviors, enabling the creation of dense and context-aware reward signals.

This is particularly beneficial for bridging the sim-to-real gap in real-world tasks where explicit reward specification is often challenging.


Survey Paper Concern

We respectfully disagree with the assessment that our paper resembles a survey. While our work does provide a structured taxonomy and draws insights from multiple perceptual and task-based settings, these components are not retrospective summaries, but rather integral parts of a novel and operational evaluation framework.

Specifically, we make the following contributions that go beyond a survey:

  • We design and implement a novel evaluation framework tailored for assessing video generation models as world simulators—a key open challenge in the community. This framework is not merely a collection of metrics; it includes:

    • A human-aligned perceptual evaluation protocol that maps cognitive dimensions to measurable factors, supported by the HF-Embodied dataset with costly human preference annotations that enrich the quality and depth of the evaluation.
    • A physics-driven embodied task evaluation, which leverages embodied agents in simulated environments to test the rationality and functional validity of generated videos—something traditional metrics and human studies often overlook.
  • We introduce and apply a taxonomy of perceptual attributes to systematically assess different facets of human judgment. This taxonomy is used to guide newly designed human studies, not to summarize existing ones.

  • We conduct extensive experiments across diverse tasks and settings, showcasing the practical utility and diagnostic power of our framework. Our results not only highlight model behaviors but also reveal consistent patterns between perceptual and task-level evaluations—demonstrating the framework’s effectiveness and explanatory value.

In summary, rather than surveying existing work, our paper offers a principled and implementable approach to a previously underexplored problem. We believe this constitutes a significant and novel contribution to the evaluation of video generation systems, and we hope the reviewer will reconsider their assessment in this light.


Thank you for the constructive feedback. We will incorporate the above clarifications, examples, and additional experiments into the paper, and we hope these revisions address your concerns.

审稿意见
3

The lack of categorization based on inherent characteristics hinders predictive model development, and existing benchmarks fail to evaluate highly embodied models effectively. To address this, this paper introduces WorldSimBench, a dual evaluation framework for World Simulators. It includes Explicit Perceptual Evaluation, using the HF-Embodied Dataset and a Human Preference Evaluator for visual fidelity assessment, and Implicit Manipulative Evaluation, measuring video-action consistency in dynamic environments. Authors evaluate across three embodied scenarios: open-ended environments, autonomous driving, and robot manipulation. It provides key insights to advance video generation models and strengthen their role in embodied AI.

给作者的问题

I have mentioned this in the previous sections

论据与证据

As video generation models continue to advance, we lack a more powerful and comprehensive benchmark. I completely agree with this point, especially since earlier evaluation metrics focused too much on video quality rather than its practical usability—particularly whether the generated videos can serve as a world simulator.

This viewpoint is somewhat obvious to practitioners in the field. Therefore, the key to evaluating this paper lies in whether the benchmark it provides aligns with researchers' needs. Its contribution ultimately depends on the novelty and relevance of the new dataset and evaluation metrics.

方法与评估标准

WorldSimBench evaluates World Simulators at two levels: Explicit Perceptual Evaluation, assessing human-perceived quality, and Implicit Manipulative Evaluation, testing video-to-control translation in closed-loop tasks. It covers three key embodied scenarios: Open-Ended Environments (OE) using Minecraft for complex tasks, Autonomous Driving (AD) for stability in dynamic conditions, and Robot Manipulation (RM) for precise control in physical interactions. These benchmarks provide a comprehensive assessment of a simulator’s effectiveness in real-world tasks.

理论论述

No theoretical claims are required for review in this work.

实验设计与分析

8 popular video generation models were evaluated, including Open-Sora-Plan, Lavie, ModelScope, Open-Sora, AnimateDiff, Open-Sora-Plan, Dynamicrafter, and EasyAnimate. These models are assessed through both Explicit Perceptual Evaluation and Implicit Manipulative Evaluation across three scenarios: Open-Ended Embodied Environment, Autonomous Driving , and Robot Manipulation. Each model is fine-tuned on specific datasets corresponding to these embodied scenarios for a comprehensive evaluation.

补充材料

Yes, I’ve gone through it. However, I feel that the authors could present the results of different models on their new benchmark more clearly. Providing more visualized results on the website would be essential for increasing the paper’s impact.

与现有文献的关系

A strong benchmark should highlight the shortcomings of current models and indicate future directions for improvement. Right now, I think the paper could do a better job of making these aspects more visually explicit.

遗漏的重要参考文献

NA

其他优缺点

Overall, the evaluation of this work should focus on 3 key aspects:

  1. Data Contribution – The novelty and usefulness of the proposed dataset.
  2. Metrics and evaluation – Whether the evaluation metrics effectively capture the critical aspects of video generation. How well the benchmark reveals the limitations of current models.
  3. Guiding Future Research – Whether it provides clear insights into the next steps for advancing video generation models.

Data contribution: HF-Embodied Dataset is useful for constructing evaluation metrics. Large-scale embodied videos sourced from the internet, paired with captions, are used to train data generation models. Fine-grained human feedback annotation is then applied based on the corresponding Task Instruction Prompt List, ensuring coverage across multiple embodied dimensions.

Additionally, instructions for video generation are expanded using GPT and manually verified to construct a comprehensive Task Instruction Prompt List, which serves as a guideline for both data generation and evaluation.

I think this aspect of the contribution is positive, but there is still room to expand the scope of the dataset. For example, since the paper emphasizes embodiment, it would be beneficial to include additional annotated information, such as camera poses and other egomotion data. This could further enhance the dataset’s utility. This is just a suggestion for the authors to consider in future iterations.

Metrics and evaluation: The authors correctly focus on instruction alignment and train a Human Preference Evaluator using the HF-Embodied Dataset to evaluate eight models. They then observe that these models struggle with instruction alignment. However, I have two concerns:

  1. Effectiveness of Fine-Tuning: The authors fine-tune these models on the new dataset, but different models can exhibit significantly different behaviors depending on how they are fine-tuned. Whether SFT, LoRA, or other techniques are used can also impact the results. How do the authors ensure that their fine-tuning is effective and comparable across models?

2: Choice of Models: The models used in the paper seem somewhat outdated. It would be much more compelling if the evaluation included newer models like CogVideoX, as this would provide a better assessment of current model capabilities.

Future Direction: I think this section could be further improved. I would like to see more failure cases of the Human Preference Evaluator, as well as a deeper analysis of when and why current video generation models fail.

The paper could further categorize different prompts to identify specific scenarios where models are more prone to failure. This would provide a more detailed and actionable understanding of instruction alignment challenges in video generation.

其他意见或建议

NA

作者回复

Thank you for your recognition of our paper:

  • Inspiring and Novelty: "It provides key insights to advance video generation models and strengthen their role in embodied AI."
  • Reasonable Evaluation Criteria: "These benchmarks provide a comprehensive assessment of a simulator’s effectiveness in real-world tasks."
  • Useful Dataset: "HF-Embodied Dataset is useful for constructing evaluation metrics."

Quantitative Experimental Results

We appreciate your suggestion regarding the presentation of results. To address this, we have created a Anonymous Website to showcase more visualized results, which we believe will enhance the clarity and impact of our paper. We hope this will provide a more comprehensive understanding of the performance of different models on our benchmark.


Broader Scientific Literature

We agree that a strong benchmark should clearly expose the limitations of current models and point toward directions for future improvement. While our paper already discusses these aspects—for example, in Section 4.3 we analyze the physical limitations of current video generation models as world simulators; Appendix C.1 demonstrates how human feedback from the HF-Embodied dataset can improve video quality via reinforcement learning; and Appendix H shows how generated videos can benefit VLA methods and downstream embodied tasks—we acknowledge that these insights could be presented more intuitively. In the revised version, we will make these findings more visually explicit and accessible, helping readers better grasp the limitations of current approaches and the future opportunities they reveal.


Suggestion of Dataset Expansion

Thank you for your thoughtful and constructive feedback. We’re glad to hear that you find the HF-Embodied Dataset valuable, especially in terms of evaluation and task diversity. We appreciate your suggestion regarding the inclusion of additional annotations such as camera poses and egomotion data. This is indeed a valuable direction, and we agree that such information could further enrich the dataset and benefit downstream embodied tasks.

Due to the diverse sources and large scale of our current dataset, obtaining accurate camera pose and egomotion annotations presents certain challenges. However, we recognize their importance and are actively exploring methods—both automated and semi-automated—to incorporate such annotations in future versions of the dataset. We believe this will significantly boost its applicability in more complex embodied scenarios.

Thank you again for the suggestion, it provides a meaningful direction for our future work.


Effectiveness of Fine-Tuning

Thank you for your insightful comment. To ensure the effectiveness and comparability of our fine-tuning process across different models, we used the same dataset for training all models. In order to maintain fairness and optimize performance, we followed the fine-tuning strategies recommended by the authors of each respective model. This approach allows us to ensure that each model is trained in the most optimal way according to its specific architecture and design.

By adhering to the suggested fine-tuning protocols, we aim to eliminate potential biases and ensure that the models' performances are evaluated under comparable conditions. We hope this approach addresses your concern regarding the impact of different fine-tuning methods on the results. Thank you again for raising this important point!


Choice of Models

We have added evaluations on HunyuanVideo, Cosmos Diffusion 7B, and CogVideoX1.5-5B. Due to space limits, please see references in our reply to Reviewer yx3f More Video Models.


Future Direction

We appreciate the reviewer’s insightful suggestions regarding failure analysis and prompt categorization. We fully agree that understanding the limitations of the Human Preference Evaluator and identifying failure modes in instruction alignment are meaningful directions for future research.

Specifically, analyzing when and why video generation models fail, and categorizing prompts to uncover scenario-specific weaknesses, could offer valuable insights into the robustness and generalization capabilities of current models. These efforts would require dedicated experiments and analysis, which we believe merit a standalone investigation.

We will incorporate these ideas into the Future Work section of the paper to inspire and guide follow-up research in this important area. Thank you again for highlighting this valuable direction.


Thank you for your thoughtful and constructive feedback—it has been instrumental in helping us improve the paper. We will incorporate the above clarifications, examples, and additional experiments into the paper, and we hope these revisions address your concerns, and we’d be glad to continue the discussion if you have any further questions or suggestions.

审稿人评论

My concerns are well addressed. Therefore I will keep my original rating and recommend this work for accepting.

审稿意见
4

This paper proposes WorldSimBench, a benchmark used to evaluate the world simulation performance of video generative models. This paper investigates several video valuation benchmarks and introduces a hierarchy for classifying video models. This paper evaluates several video generative models through proposed Explicit Perceptual Evaluation and Implicit Manipulative Evaluation.

给作者的问题

  • Since you mainly evaluate some text-based video models, I am curious does these models exhibit superior world simulation ability in the embodied tasks.

论据与证据

Yes. This paper includes extensive empirical results to support the main claim made in this paper.

方法与评估标准

This paper introduces a Human Preference Evaluator, which can be quite interesting and applicable. Other metrics introduced in this paper also make sense.

理论论述

N\A

实验设计与分析

Yes. However, only quantitative experimental results are included in this paper, while qualitative results are missing.

补充材料

No

与现有文献的关系

N\A

遗漏的重要参考文献

I encourage the authors to include more advanced open-sourced video models for evaluation, such as Huanyuan and Cosmos.

其他优缺点

Strengths

  • This paper proposes a new benchmark to evaluate video-based world models, which can be valuable for future research in this area.
  • This paper introduces a human preference evaluator and video-to-action evaluation metrics to evaluate the world simulation abilities of video models, which are interesting.
  • This paper investigates the world simulation performance of several video generative models.

Weaknesses

  • Only quantitative experimental results are included in this paper, while qualitative results are missing. Including some qualitative visualization can let readers know whether your benchmark is challenging or easy for nowadays video models.
  • Some world foundation models instead of traditional video generative models should be evaluated by the proposed benchmark, such as the newly introduced Cosmos.

其他意见或建议

NA

作者回复

Thank you for your recognition of our paper:

  • Solid Experiment: "This paper includes extensive empirical results to support the main claim made in this paper."
  • Novelty and Soundness: "This paper introduces a human preference evaluator and video-to-action evaluation metrics to evaluate the world simulation abilities of video models, which are interesting."
  • Reasonable Evaluation Criteria: "Other metrics introduced in this paper also make sense."
  • Inspiring for the Community: "This paper proposes a new benchmark to evaluate video-based world models, which can be valuable for future research in this area."

More Video Models

We have added evaluations on HunyuanVideo, Cosmos Diffusion 7B, and CogVideoX1.5-5B:

Evaluation results in OE, which can be incorporated into Table 6 of the supplementary material. The abbreviations are listed in Table 2.

ModelBCFCIASAVCTJEIOverall
HunyuanVideo1.952.02.01.62.02.01.651.89
Cosmos1.92.02.01.82.02.01.81.93
CogVideoX1.952.02.02.02.02.01.551.93

Evaluation results in AD, which can be incorporated into Table 7 of the supplementary material. The abbreviations are listed in Table 2.

ModelAEIAPVTJKESFOverall
HunyuanVideo3.755.04.14.83.45.04.34
Cosmos4.155.04.64.93.655.04.55
CogVideoX3.455.03.94.83.05.04.20

Evaluation results in RM, which can be incorporated into Table 8 of the supplementary material. The abbreviations are listed in Table 2.

ModelAEBCFCIAPVTJEIOverall
HunyuanVideo4.03.94.01.95.05.04.13.99
Cosmos4.24.264.02.85.05.04.444.24
CogVideoX3.854.04.02.25.05.04.234.04

Quantitative Experimental Results

D.2. Qualitative Results and Figure 7 in the supplementary material shows some of the Quantitative Experimental Results, and we have also provided corresponding analyses for these results. Additionally, we have showcased more visual results on an Anonymous Website to address your concerns.


Questions

There are already quite a few papers and projects based on video models for specific tasks. Methods [1, 2, 3, 4] have demonstrated impressive results in robot manipulation and navigation tasks. Furthermore, future video prediction models can leverage Internet-scale pretraining and visual understanding to guide low-level goal-conditioned policies, potentially showing better generalization in new scenarios. However, during the inference phase, these methods tend to incur additional computational overhead, making them slower compared to traditional methods like imitation learning. Nevertheless, with the continued development of models and acceleration techniques, these approaches may hold significant potential and be worth further exploration.


Thank you for your constructive and thoughtful comments. They were indeed helpful in improving the paper. We will incorporate the above clarifications, examples, and additional experiments into the paper, and we hope these revisions address your concerns. If you have any further questions, please do not hesitate to continue the discussion with us.We hope this will provide a more comprehensive understanding of the performance of different models on our benchmark.


References
[1] Learning Universal Policies via Text-Guided Video Generation
[2] Compositional Foundation Models for Hierarchical Planning
[3] VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation
[4] Grounding Video Models to Actions through Goal Conditioned Exploration

审稿人评论

Thanks to the authors for adding additional experiments to address my concerns. Since my main concerns have been addressed adequately, I decided to raise my score to 4 to support this work.

最终决定

This paper proposes a benchmark for "World Simulation", essentially, applying text-to-video models to embodied AI tasks. This is becoming an increasingly popular application so it seems sensible to have strong benchmarks to test capabilities here.

The authors introduce three problem settings, Minecraft, driving and robotic manipulation, and conduct a series of evaluations in each. In general there are some issues with respect to these being somewhat disjoint tasks, but the diversity could also be seen as a strength. In the rebuttal, the authors included the recent Cosmos work which makes the paper stronger given its recent prominence at this exact task.

Side note: The authors incorrectly cite Ha and Schmidhuber for this - who worked on "World Models" as opposed to "World Simulators". The prior has a specific meaning, predicting action conditioned dynamics. The latter is largely a marketing term from Sora which is not really defined.