Thank you for your thorough review and for appreciating our method design. Please find our responses to your questions below.

1. More qualitative examples of failure cases in OOD or combinatorial scenarios

Thank you for the helpful suggestion. In Appendix A.8, we have included failure cases with side-by-side comparisons in both the OOD (Figure 20) and the Combinatorial Generalization setting (Figure 22).

For OOD setting, Figure 20 already illustrates all representative failure modes, as failure in this setting primarily involves inaccuracies in velocity prediction.

For Combinatorial setting, We are now providing additional 8 failure cases via the anonymous links below. (In accordance with the ICML rebuttal policy, we include images only, not videos.) These examples will be added into the revision. We will also include many demo videos on our project webpage once the paper is published.

https://ibb.co/Pv5qgSfL

https://ibb.co/F9G5nSX

2. discussion of alternative approaches such as PINN, symbolic regression of physical laws, or structured simulation engines

We appreciate your suggestion to expand the discussion to compare video diffusion model (VDMs) with alternative approaches for physics modeling and to highlight what may be missing in purely data-driven models.

Physical Consistency and Explicit Inductive Biases: Current VDMs lack explicit representations of physical laws and instead rely on statistical correlations learned from data. As our findings show, this can result in failure cases under conditions such as unseen object velocities or mass values. In contrast, approaches like PINN, symbolic regression, and structured simulation engines encode or recover governing equations, offering stronger guarantees of physical consistency and better extrapolation to unseen velocity and mass values [1, 2].
Structured methods often lack visual fidelity and scalability. For example, PINNs are typically tailored to a single equation and require retraining when parameters or initial conditions change [3]. Most existing work also focuses on small-scale, low-dimensional problems, limiting applicability to realistic video generation. In contrast, VDMs generate high-fidelity visuals and scale more effectively across diverse scenarios.
Complementarity, Not Competition: These observations point toward a promising direction: combining the strengths of both methods. For instance, physics engines or PINNs could be used to predict future physical states, while VDMs handle rendering and visual synthesis. Such hybrid systems could preserve both physical accuracy and visual realism.

We will incorporate the discussions into the revision.

[1] PINN: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.

[2] Genesis: A Universal and Generative Physics Engine for Robotics and Beyond

[3] Learning the solution operator of parametric partial differential equations with physics-informed DeepOnets.

3. the conclusion that the VDMs do not model OOD physical scenarios is too obvious, and isn't the whole point of scaling is to model more things to be ID?

Thank you for raising this important point. While the failure of current VDMs to model OOD physics may seem expected, our paper provides deeper insight:

Generalization-by-memorization mechanism: Despite optimism that scaling VDMs enables generalization to complex, unseen scenarios [4,5], our experiments show that VDMs often rely on retrieving patterns from similar training examples rather than learning underlying physical principles. This generalization-by-memorization mechanism, not been clearly articulated prior to our work, underscores the limitations of VDMs and the need for structural priors and inductive biases in physical modeling.
The Limits of Turning OOD into ID — and Actionable Insights: While we agree that the goal of scaling is to absorb more variation into the in-distribution regime, real-world video data is vast, continuous, and high-dimensional, making it more difficult to fully cover than pure language. For example, in robotics, variables such as object velocity, joint configurations, camera angles, noisy backgrounds, and task goals vary across a continuous and combinatorially large space.

Our paper contributes actionable insights to this challenge: We demonstrate that scaling combinatorial diversity in the training data—rather than simply increasing dataset size—is significantly more effective for improving physical video modeling.

We hope this helps clarify the intuition and contributions of our work.

[4] OpenAI. Sora Technical Report: Video Generation Models as World Simulators.

[5] X.AI. 1X World Model. https://www.1x.tech/discover/1x-world-model

Summary

We hope our responses have addressed your concerns. If you have any further questions, please feel free to reach out.