How Far Is Video Generation from World Model: A Physical Law Perspective
摘要
评审与讨论
This paper investigates whether state-of-the-art video generative models can learn fundamental physical laws from purely visual data. Inspired by the vision of video models as “world simulators” (e.g. OpenAI’s Sora), the authors conduct a systematic study using a controlled 2D physics environment. They construct a simulation testbed (based on Box2D) where simple geometric objects move and collide according to known classical mechanics laws (uniform motion, elastic collision, parabolic motion). Using this testbed, they generate large training datasets (up to 3 million videos) and train diffusion-based video generation models (with a VAE + transformer architecture similar to Sora) to predict future frames from initial conditions. The goal is to evaluate if these models, given enough data and scale, can infer and obey the underlying physical laws without explicit supervision.
给作者的问题
N/A
论据与证据
The authors claim that current video generation models fail to infer universal physical laws and instead generalize by referencing similar training cases. This is supported by experiments showing that, while models can perfectly generalize within the training distribution, they break down on novel scenarios. The structured scaling analysis provides convincing evidence: even large diffusion models trained on massive data cannot correctly predict physics in unseen setups, underscoring that scaling alone is insufficient. The claim is well-supported by quantitative metrics (e.g. high error in predicting object velocity in OOD tests) and qualitative observations of physically implausible generated motions.
方法与评估标准
The methods and evaluation criteria for the problem are well-chosen. The 2D simulator approach is an excellent strategic decision to tackle an otherwise intractable evaluation problem. The use of a strong diffusion model ensures the study tests the frontier of what’s possible. The experiments are structured to answer specific questions (ID vs OOD vs combinatorial generalization), and the metrics directly measure success on those terms. I also appreciate that the authors validated their VAE wasn’t a bottleneck – they report in the appendix that VAE reconstructions of videos have minimal error, so any mistakes are from the diffusion model learning, not from lossy compression. This attention to detail in evaluation bolsters confidence. Overall, the methods are appropriate and quite thorough for this study.
理论论述
This work is primarily empirical.
实验设计与分析
The experimental design is well-structured. The authors systematically scale up training data size and model capacity to test how generalization improves (or doesn’t) with scale. They evaluate model performance on: (1) the training distribution, (2) held-out but similar (in-distribution) scenarios, (3) genuinely novel (OOD) scenarios that involve new object properties or dynamics, and (4) combinatorial cases that mix seen components in new ways. This comprehensive coverage provides a strong empirical foundation to analyze generalization. The results show perfect in-distribution generalization (models can interpolate within the training range) and some combinatorial generalization (performance improves gradually with scale for mixtures of seen concepts). Crucially, they demonstrate complete failure in true OOD generalization, even for the largest models.
补充材料
The supplementary material includes additional implementation details, model architectures, and further quantitative results. It also provides extra visualization samples of the generated videos, which help in assessing qualitative performance. These additions enhance transparency and reproducibility of the research. However, the supplement could be improved by including more qualitative examples of failure cases in OOD or combinatorial scenarios. For instance, showing a side-by-side comparison of ground-truth vs. generated trajectories in challenging cases would illustrate the model’s shortcomings more concretely and help readers visually grasp how the model deviates from true physics.
与现有文献的关系
This paper builds upon prior work in video generation and world models, engaging with the question of whether large generative models can learn physics without explicit supervision. It connects to a growing literature on using foundation models as simulators for real-world processes, referencing studies that scale video models (like OpenAI’s Sora) and those investigating physical common sense in AI. By demonstrating the limits of current diffusion models, the paper contributes to ongoing discussions about the necessity of structure and inductive biases in AI for learning physics. A point that could strengthen the literature comparison is a deeper discussion of alternative approaches such as physics-informed neural networks, symbolic regression of physical laws, or structured simulation engines. Contrasting the diffusion model’s performance with these could highlight what is missing (e.g., explicit enforcement of Newtonian mechanics or object permanence) and emphasize that certain insights from physics-based learning might be needed in conjunction with data-driven models.
遗漏的重要参考文献
N/A
其他优缺点
While I like the paper overall, I don't know how I feel about the paper's overall intuition. Of course, videos models cannot model physics OOD, but what we want is that scaling could make a large-scale pretrained video model to be a sufficient approximation of a real-world projection. The authors primarily conduct experiments by training on smaller domains physics informed data (with some further tests on CogVid and SVD), and conclude the VDMs do not model OOD physical scenarios --- well, I personally just think that is too obvious, and isn't the whole point of scaling is to model more things to be ID? Nevertheless, I think the paper could be accepted, but will not object if it is rejected.
其他意见或建议
N/A
Thank you for your thorough review and for appreciating our method design. Please find our responses to your questions below.
1. More qualitative examples of failure cases in OOD or combinatorial scenarios
Thank you for the helpful suggestion. In Appendix A.8, we have included failure cases with side-by-side comparisons in both the OOD (Figure 20) and the Combinatorial Generalization setting (Figure 22).
For OOD setting, Figure 20 already illustrates all representative failure modes, as failure in this setting primarily involves inaccuracies in velocity prediction.
For Combinatorial setting, We are now providing additional 8 failure cases via the anonymous links below. (In accordance with the ICML rebuttal policy, we include images only, not videos.) These examples will be added into the revision. We will also include many demo videos on our project webpage once the paper is published.
2. discussion of alternative approaches such as PINN, symbolic regression of physical laws, or structured simulation engines
We appreciate your suggestion to expand the discussion to compare video diffusion model (VDMs) with alternative approaches for physics modeling and to highlight what may be missing in purely data-driven models.
-
Physical Consistency and Explicit Inductive Biases: Current VDMs lack explicit representations of physical laws and instead rely on statistical correlations learned from data. As our findings show, this can result in failure cases under conditions such as unseen object velocities or mass values. In contrast, approaches like PINN, symbolic regression, and structured simulation engines encode or recover governing equations, offering stronger guarantees of physical consistency and better extrapolation to unseen velocity and mass values [1, 2].
-
Structured methods often lack visual fidelity and scalability. For example, PINNs are typically tailored to a single equation and require retraining when parameters or initial conditions change [3]. Most existing work also focuses on small-scale, low-dimensional problems, limiting applicability to realistic video generation. In contrast, VDMs generate high-fidelity visuals and scale more effectively across diverse scenarios.
-
Complementarity, Not Competition: These observations point toward a promising direction: combining the strengths of both methods. For instance, physics engines or PINNs could be used to predict future physical states, while VDMs handle rendering and visual synthesis. Such hybrid systems could preserve both physical accuracy and visual realism.
We will incorporate the discussions into the revision.
[1] PINN: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.
[2] Genesis: A Universal and Generative Physics Engine for Robotics and Beyond
[3] Learning the solution operator of parametric partial differential equations with physics-informed DeepOnets.
3. the conclusion that the VDMs do not model OOD physical scenarios is too obvious, and isn't the whole point of scaling is to model more things to be ID?
Thank you for raising this important point. While the failure of current VDMs to model OOD physics may seem expected, our paper provides deeper insight:
-
Generalization-by-memorization mechanism: Despite optimism that scaling VDMs enables generalization to complex, unseen scenarios [4,5], our experiments show that VDMs often rely on retrieving patterns from similar training examples rather than learning underlying physical principles. This generalization-by-memorization mechanism, not been clearly articulated prior to our work, underscores the limitations of VDMs and the need for structural priors and inductive biases in physical modeling.
-
The Limits of Turning OOD into ID — and Actionable Insights: While we agree that the goal of scaling is to absorb more variation into the in-distribution regime, real-world video data is vast, continuous, and high-dimensional, making it more difficult to fully cover than pure language. For example, in robotics, variables such as object velocity, joint configurations, camera angles, noisy backgrounds, and task goals vary across a continuous and combinatorially large space.
Our paper contributes actionable insights to this challenge: We demonstrate that scaling combinatorial diversity in the training data—rather than simply increasing dataset size—is significantly more effective for improving physical video modeling.
We hope this helps clarify the intuition and contributions of our work.
[4] OpenAI. Sora Technical Report: Video Generation Models as World Simulators.
[5] X.AI. 1X World Model. https://www.1x.tech/discover/1x-world-model
Summary
We hope our responses have addressed your concerns. If you have any further questions, please feel free to reach out.
The authors create a benchmark to evaluate the physical understanding of large video models at scale. Specifically, they measure the generalization performance of the model under variations of physically meaningful quantities such as color, shape, size and velocity.
给作者的问题
N/A
论据与证据
They claim that the model does not integrate fundamental physical laws which they demonstrate by showing that scaling the dataset size and parameter number enables the model to generalize inside the training distribution but not outside. They claim that the model has a priority order when generalizing color > size > velocity > shape which they demonstrate by switching attribute between training and testing.
方法与评估标准
The author build there own benchmark. The benchmark is a class of 2D physics environments of multiple objects colliding and parameterized by shape, color, size and velocity which enables to isolate the evaluation of the physics understanding.
理论论述
N/A
实验设计与分析
They use a significant number of 100 to 1000 test cases per experiment.
补充材料
All parts.
与现有文献的关系
Evaluating the adherence of large models to physical laws is a very important direction, given the wide adoption of these models in the industry.
遗漏的重要参考文献
There are several previous work evaluating the physical understanding of large video models that are not cited:
- Videophy: Evaluating physical commonsense for video generation
- T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
- Devil: A comprehensive benchmark for dynamics evaluation in video generation Furthermore, benchmark evaluating large video models more broadly:
- Vbench: Comprehensive benchmark suite for video generative models.
- Evalcrafter: Benchmarking and evaluating large video generation models
其他优缺点
Strength: the paper makes an effort to investigate why they observe such results which is unlike many benchmark paper on physical accuracy of video generative models
其他意见或建议
N/A
Thank you for your thorough review and for supporting the acceptance of our work.
We appreciate you pointing out these relevant works on evaluating the physical understanding of large video models. We agree that incorporating them will help position our contributions within the broader landscape of physical evaluation in video generation research. In the revised version of the paper, we will include a discussion of these works—Videophy, T2V-CompBench, Devil, VBench, and EvalCrafter—in the Related Work section under the “Video Generation” subsection.
This paper explores whether scaling video generation models enables them to learn physical laws. It first provides a thorough problem definition and then evaluates video generation models under three scenarios: in-distribution, out-of-distribution, and combinatorial generalization. The authors develop a 2D simulation testbed that simulates three fundamental physical principles. Through comprehensive experiments on simulated data, they conclude that scaling video generation models alone is insufficient for effective world modeling.
给作者的问题
Line 258: It is mentioned that only the first frame is used as a condition. How does the model infer the physical attributes of objects and accurately predict subsequent events?
论据与证据
The central claim of this paper is that scaling video generation models is insufficient to uncover physical laws. This claim is well-supported by extensive experimental validation.
A secondary claim suggests that video models prioritize different factors when referencing training data. However, this conclusion—particularly the ranking of these factors—is drawn from a single scenario, uniform linear motion, which raises concerns about its generalizability to other physical contexts.
方法与评估标准
The paper introduces three physical scenarios to assess the model’s ability to infer physical laws. This approach is a reasonable simplification, as the selected rules are fundamental and representative of broader physical principles.
理论论述
To the best of my knowledge, the theoretical claims appear to be correct.
实验设计与分析
-
In the Combinatorial Generalization setting, the paper selects eight types of objects but does not provide a rationale for this choice.
-
Additionally, in Combinatorial Generalization, the evaluation excludes velocity as a metric and instead relies on FVD, SSIM, PSNR, and LPIPS. However, these are image/video metrics and do not guarantee physical correctness.
-
As mentioned in Section 3.2, the model fails to improve performance as data or parameters scale up. However, Figure 3 shows that as the training region expands, accuracy improves for OOD data. How should this discrepancy be interpreted?
补充材料
There is no supplementary material.
与现有文献的关系
This paper is related to machine learning, particularly in video generation and world modeling.
遗漏的重要参考文献
None.
其他优缺点
This paper focuses on an important problem—whether scaling video generation models can enable world modeling. I believe the insights provided will be beneficial to the broader research community.
其他意见或建议
Minor Issues:
- Incorrect characters in Line80 ("color ¿ size ¿ velocity ¿")
- Some sections lack textual content, e.g., A.4.1, A.4.2, and A.4.4.
1. The prioritization conclusion is drawn from single uniform linear motion, rasing concerns about its generalizability to other physical contexts.
Thank you for your thoughtful comment. We selected the uniform linear motion scenario as it provides a highly representative and clean setting—velocity is easy to measure, and with only one object, the pixel regions corresponding to attributes such as size, color, and speed are not affected by interference from other objects
To verify the robustness of our conclusion, we also ran experiments on parabolic motion and collisions. The factor prioritization remained consistent, supporting the generalizability of our findings. We will include these additional results in the revision.
2. a rationale for selecting eight types of objects
Thank you for your question. To evaluate combinatorial generalization, we define a "combination" as the physical interaction between different object types. The Phyre simulator we use provides a total of eight distinct object types that enable a wide range of physical interactions. We include all of them to create as complex a combinatorial space as possible for evaluating combinatorial generalization. We appreciate your suggestion and will clarify this rationale in the revision.
3. in Combinatorial Generalization, the evaluation **excludes velocity as a metric** and instead relies on FVD, SSIM, PSNR, and LPIPS. However, these are image/video metrics and do not guarantee physical correctness.
Thank you for your valuable comment. Please allow us to clarify why velocity was excluded as a metric in the Combinatorial Generalization setting, unlike in the ID/OOD setting.
In the ID/OOD setting, scenes are simple (e.g., 1–2 colored balls on a plain background), enabling reliable position and velocity estimation via pixel averaging and frame differencing (line 203). In contrast, the Combinatorial setting includes many irregularly shaped and similarly colored objects, making pixel assignments ambiguous and position estimation unreliable. We also tested methods such as Hough circle detection and pretrained object detectors, but they resulted in significant errors.
Given these challenges, we excluded velocity as a metric and instead relied on a combination of objective video fidelity metrics and human evaluation:
-
FVD, SSIM, PSNR, and LPIPS measure fidelity of generated videos to ground-truth videos ones governed by physical laws. While not exlicitly designed for physical correctness, they reflect plausibility to some extent by measuring consistency and realism with the groundtruth —an approach consistent with prior work like Genie [1].
-
We also conducted human evaluation, where each evaluator was specifically instructed to focus on assessing violations of physical laws. We believe this combined approach offers a comprehensive evaluation for physical correctness in the complex setting. We will clarify this in the revised paper.
[1] Bruce, Jake, et al. "Genie: Generative interactive environments." ICML 2024, Best paper.
4. How should the discrepancy between Section 3.2 and Figure 3 be interpreted?
Thank you for your insightful question. The difference arises from the type of generalization being evaluated.
In Section 3.2, the evaluation is strictly OOD extrapolation. For example, if the training velocity range is [1.0,4.0], then the test set contains velocities outside that range, such as [0.0,0.8]∪[4.5,6.0]. In this setting, simply increasing the amount of data or model size does not significantly improve performance due to the difficulty of truly grasping the physical law and true extrapolation.
In contrast, Figure 3 reveals a transition from extrapolation to interpolation, where the test set lies between two disjoint training subsets, more like an interpolation scenario. As the gap between the two training regions narrows, model performance on the test set improves, reflecting the model's stronger ability to interpolate.
We will clarify this distinction in the revision.
5. no supplementary material
We have well-organized code and datasets and will make them public once the paper is officially published to support future research.
6. With only the first frame as input, how does the model infer object properties and predict future events?
In the Combinatorial Generalization setting, each object has a distinct visual appearance and color, allowing the model to** infer object type and physical attributes by learning associations** from training data.
As all objects start static with zero velocity, the model can use these inferred attributes to predict future events.
7. Minor Issues
Thank you for pointing out these minor issues. We have corrected them in the revision.
Summary
We hope our responses have addressed your concerns and strengthened confidence in our paper.
This paper evaluates if scaling video generation models enables learning physical laws. Using a 2D simulator for object motion governed by classical mechanics, experiments reveal: (1) near-perfect in-distribution (ID) generalization with scaling, (2) failure in out-of-distribution (OOD) scenarios despite scaling, and (3) improved combinatorial generalization via increased data diversity. Models exhibit "case-based" generalization, prioritizing attributes (color > size > velocity > shape) rather than abstract rules. Key contributions include systematic scaling analysis and insights into model biases.
给作者的问题
N/A
论据与证据
Claims are largely supported. OOD failure is evidenced by consistent high errors across scaling levels. Case-based generalization is validated through experiments with flipped training data and attribute conflicts. However, the prioritization hierarchy (color > size > velocity > shape) is tested only in controlled scenarios; broader validation (e.g., real-world textures) is needed to confirm universality.
方法与评估标准
The 2D simulator effectively isolates physical variables, and velocity error metrics align with the goal of assessing physical law adherence. Human evaluations for combinatorial cases add robustness. However, pixel-level metrics (SSIM/PSNR) may not fully capture physical plausibility, and fixed VAE usage limits exploration of end-to-end training benefits.
理论论述
No formal theoretical claims are made. The framework for evaluating generalization (ID/OOD/combinatorial) is conceptual but well-defined. The empirical focus aligns with the paper’s goals.
实验设计与分析
Scaling experiments (model/data size) are rigorous, and controlled attribute comparisons clarify prioritization. However, testing only three physical scenarios (linear motion, collision, parabola) limits generalizability.
补充材料
I've reviewed all parts.
与现有文献的关系
Connects strongly to prior work on world models and LLM memorization. Challenges assumptions in video generation (e.g., Sora’s physical reasoning claims) by demonstrating OOD limitations. Advances understanding of scaling’s role in combinatorial generalization, aligning with trends in foundation models.
遗漏的重要参考文献
N/A
其他优缺点
Novel insights into case-based generalization; actionable scaling guidelines for combinatorial diversity; clear challenge to prevailing narratives about video models as world simulators. Weaknesses: Limited scenario diversity (2D, synthetic data); over-reliance on human evaluation for combinatorial cases; minimal discussion of real-world applicability.
其他意见或建议
N/A
Thanks for supporting the acceptance of our work. Please find our responses to your questions below.
1. pixel-level metrics (SSIM/PSNR) may not fully capture physical plausibility; over-reliance on human evaluation for combinatorial cases
Thank you for your comment. We agree that pixel-level metrics alone are insufficient to fully capture physical plausibility. However, we relied on a combination of objective video fidelity metrics and human evaluation:
-
FVD, SSIM, PSNR, and LPIPS measure fidelity of generated videos to ground-truth videos governed by physical laws. While not exlicitly designed for physical correctness, they reflect plausibility to some extent by measuring consistency and realism with the groundtruth —an approach consistent with prior work like Genie [1].
-
As you pointed out, fidelity alone may not fully represent physical plausibility. To address this, we also conducted a human evaluation, where each evaluator was specifically instructed to focus on assessing violations of physical laws.
We believe this combined approach offers a comprehensive evaluation for physical correctness in the complex setting.
[1] Bruce, Jake, et al. "Genie: Generative interactive environments." ICML 2024, Best paper.
2. fixed VAE usage limits exploration of end-to-end training benefits
Thanks for your question. Here we explain why we chose to fix the VAE and why doing so does not limit the performance of the diffusion model in our setting:
-
Training stability: During diffusion training, the VAE is used to define the training objective. Updating the VAE during diffusion training makes the latent space unstable, slowing convergence. Hence, widely used architectures like Stable Diffusion 3, DALL·E 3, and Hunyuan Video pretrain and fix the VAE.
-
The VAE is not a bottleneck in our setup. As shown in Appendix A.3.2, we validate that VAE reconstructions of the input videos exhibit minimal error, indicating that most inaccuracies arise from the diffusion model rather than the VAE. This was also positively noted by Reviewer hEH6.
We hope this explanation clarifies our rationale and addresses your concerns about fixing the VAE.
3. The prioritization hierarchy (color > size > velocity > shape) is tested only in controlled scenarios; testing only three physical scenarios (linear motion, collision, parabola) limits generalizability; Limited scenario diversity (2D, synthetic data); broader validation (e.g., real-world textures) is needed to confirm universality;
Thank you for your valuable insights. We use simplified synthetic scenarios for the following reasons:
-
Abundant and Controllable Data: Synthetic settings enable large-scale, controlled data generation, allowing systematic study of specific physical principles. Defining settings like ID/OOD or combinatorial generalization is challenging in real-world datasets.
-
Isolated Physical Laws: Each synthetic scenario is governed by a single, well-defined kinematic law. In contrast, real-world videos often involve multiple entangled factors (e.g., unknown friction, unobservable forces), making it hard to attribute behavior to specific laws.
-
Measurable Physical Quantities: In our controlled setup, physical quantities like velocity and mass can be reliably extracted from video frames. In real-world scenarios, such values are often unobservable, making it hard to verify whether generated videos obey physical laws.
By simplifying the rendering process, we isolate core challenges in learning physical dynamics, making our experiments quantitatively tractable and our findings interpretable.
However, we agree that broader validation with realistic data is important for future work. This would require great effort in collecting and curating controllable real-world data, and developing new metrics for evaluating physical consistency. We appreciate your suggestions and welcome further discussion.
4. minimal discussion of real-world applicability.
Our work focuses on scientific insights, particularly regarding the underlying mechanisms of generalization in physical video modeling. The insights from the paper can inform such real-world scenarios:
Real-world video data is vast, continuous, and high-dimensional, making it more difficult to fully cover than pure language. For example, In real-world robotics, variables such as object velocity, joint configurations, camera angles, noisy backgrounds, and task goals vary across a continuous and combinatorially large space.
Our paper contributes actionable insights to this challenge: We show that scaling combinatorial diversity in the training data—rather than simply increasing dataset size—is significantly more effective for improving physical video modeling. And it also implies that Model Scaling is Effective When Supported by Diversity.
We will include this discussion in the revision.
Summary
We hope our responses improved your confidence in our paper.
This paper evaluates whether scaling video generation models enables learning physical laws. Using a 2D simulator for object motion governed by classical mechanics, experiments reveal many interesting conclusions. I recommend accepting this paper. However, the authors should modify this paper carefully according to the given comments.