Q1: Exploration on diverse datasets or complex video scenarios.

Thank you for this insightful question regarding generalization. We agree that generalization is fundamentally linked to the scale and diversity of the training data. While our model excels on distributions seen during training, pushing the boundaries of generalization further is a broader challenge for the field, and we believe expanding datasets with richer motion is a critical direction for future work.

In this paper, we demonstrate our model's robust generalization capabilities through both state-of-the-art benchmark performance and targeted experiments on high-motion scenarios. (1) Benchmark Performance: VideoMAR achieves state-of-the-art results on the comprehensive VBench benchmark. This performance across diverse scenarios inherently validates our model's strong generalization on established tasks. (2) Adaptation to Complex Motion: We specifically tested the model's capacity to handle high-motion dynamics. As detailed in Supplement Section 8, after fine-tuning on specialized datasets (e.g., eye blinks, flower blossoming), our model successfully captures these complex movements (Figure 14). This proves the architecture's ability to adapt and generalize to new, challenging motion profiles when provided with targeted data.

Q2: Comparison between curriculum and fixed schedule.

Thank you for the question about our training schedule, we are happy to elaborate on the rationale. Our adoption of a short-to-long curriculum schedule is a deliberate design choice, grounded in both a clear rationale and decisive empirical validation from our early-stage experiments. We selected this approach for two primary reasons. First, it significantly improves training efficiency, and its progressive nature aligns perfectly with the sequential modeling principles of autoregressive frameworks. Second, crucially, this rationale was confirmed by direct empirical comparison. We trained two models from scratch for 20 epochs at a 25x256x256 resolution using 16 H20 GPUs at our early-stage experiments. The model with the curriculum schedule began to converge and generate roughly plausible videos. In stark contrast, the model trained with a fixed-length schedule failed to converge and produced no meaningful output under the identical setting. This result confirms that the curriculum is not merely a minor optimization but a fundamental component that enables the successful training of our model.

Q3: Comparison between VideoMAR and NTP.

We appreciate the opportunity to clarify our comparison strategy and the rationale for our model's design. Our work aligns with findings from prior researches (e.g., MAR, Fluid), which have consistently shown that mask-based generation with bidirectional context outperforms causal NTP for visual tasks under fair comparisons.

To provide a robust empirical comparison, we benchmarked our model against Cosmos, a leading NTP-style baseline. We were pleased to find that our method achieves superior results on VBench while being significantly more resource-efficient. Given these results and the considerable resources required to train a new large-scale NTP model from scratch, we believe this comparison provides a clear and fair assessment of our method's advantages.

Q4: Variance of the VBench score.

Thank you for emphasizing the importance of statistical rigor. We provide a three-part analysis that confirms the robustness and significance of our results.

(1) Score Stability and Robustness: VBench is already designed for reliability by averaging over five runs per sample. To provide further validation, we conducted three additional evaluation runs for both our model and the Cosmos baseline. As shown in the table below, the variance is negligible (less than 0.1), confirming that the reported scores are stable and the performance gap is consistent.

Methods	Model size	Total Score	I2V Score	Quality Score
Cosmos	5B	84.18 (±0.04)	92.39	75.97
VideoMAR-stage2	1.4B	84.82 (±0.02)	93.94	75.69

(2) Significance of the 0.6 Point Margin: We wish to emphasize that a 0.6 point difference on VBench is significant in the context of state-of-the-art models. On the official VBench leaderboard, top-performing methods are often separated by margins of less than 0.1 points. Therefore, our lead of 0.6 points represents a substantial improvement.

(3) The Overarching Context of Efficiency: Our model achieves this superior score while demanding only a fraction of the resources: 0.5% of the training data and 0.2% of the GPU hours compared to the baseline. This monumental leap in efficiency, combined with a statistically significant performance gain, validates the efficacy and novelty of our approach.

Q5: Quantization of per‑frame error accumulation.

Thanks for your valuable suggestions. We have performed the suggested LPIPS analysis, which decisively confirms the effectiveness of our temperature schedule in combating temporal drift.

(1) We list the LPIPS drift (lower is better) comparison in the following table. Experimental setting: (a) we adopt VideoMAR to generate videos of resolution 49x512x768. (b) The experiments are conducted on the VBench-I2V with 246 samples. (c) We generate the corresponding videos with and without the temperature schedule. (d) We calculate the LPIP distance between the i-th frame and the first frame (0-th frame), and average across all the 246 videos. Experimental conclusion: the quantization results are consistent with our claim, that temperature schedule helps smooth the generated video.

LPIPS drift ()	0-th	8-th	16-th	24-th	32-th	40-th	48-th
w/o temperature schedule	0	2.51	3.03	3.36	3.55	3.66	3.80
w/ temperature schedule	0	1.96	2.38	2.63	2.86	2.99	3.08

(2) We also quantize the performance gain of the temperature schedule via VBench scores in table 4 of the main manuscript, where temperature schedule elevates VBench score from 80.72 to 82.56.