5.5

/10

Rejected4 位审稿人

最低5最高6标准差0.5

4.3

置信度

正确性2.5

贡献度2.5

表达3.3

ICLR 2025

MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

Haozhe Liu,Zijian Zhou,Shikun Liu,Mengmeng Xu,Yanping Xie,Xiao Han,Juan Camilo Perez,Ding Liu,Kumara Kahatapitiya,Menglin Jia,Jui-Chieh Wu,Sen He,Tao Xiang,Jürgen Schmidhuber,Juan-Manuel Perez-Rua

OpenReview PDF

提交: 2024-09-25更新: 2025-02-05

摘要

关键词

Video GenerationDiffusion ModelMasked Auto-regressive Model

评审与讨论

审稿意见

评分: 6置信度: 42024-10-20

The paper proposes a novel masked autoregressive diffusion model architecture that leverages a planning component which processes downsampled masked videos to learn high-level + long-term dependencies, coupled with a smaller diffusion model that learns high-resolution spatial dependencies. In addition, the authors a progressive training strategy for better training stability.

优点

The paper is generally well-written, and clear
The architecture is simple, and a smaller diffusion head can help speed up inference by caching MAR outputs.
Experimental results show better performance in various masked generation tasks, such as image-video generation and video interpolation, as well as ablations to show benefits of their model design

缺点

I am not yet convinced of the scalability of this model. From Table (4), it seems that the “L” variants of MarDini perform worse than the “S” variants? Is this correct?
Related to the above point, would this model have a harder time scaling to learn distributions that are more stochastic? Since a vast majority of parameters are in MAR and not in the diffusion model. To leverage the greater number of params in the MAR, the MAR would need to encode as many possible futures into its representation for the diffusion model to read from, which may be difficult if the complexity of the distribution grows (unconditional video generation where the MAR output is not useful / constant, or simultaneously predicting / interpolating longer videos).
The training recipe is a little bit complicated, requiring a combination of stages such as separate / joint training, and different masking ratios.

问题

Do you have video samples with more deformable motion (e.g. human running / doing stuff, or animals moving around)? Most examples shown in the supplementary show motion related to fire / water / clouds, camera translations, or object translation - of which are generally the easier things to learn for a video generation model.
Instead of the initial training stage, would it possibly work if the model was jointly trained from scratch if the MAR simultaneously optimized the MSE loss? Could the diffusion model also be stabilized if cross attention was initialized to zero / a zero-init scaling term was learned (if that was not done already)?
Is masked autoregression used during generation? E.g. similar to MaskGit, where in this case you would progressively sample subsets of frames following some schedule If not, what is the masked autoregression aspect of the model (what is autoregressive here)?

审稿意见

评分: 5置信度: 52024-10-20

This paper integrates masked auto-regression (MAR) and diffusion model (DM) in a ControlNet-like fashion, combining the efficient generation capability of MAR with the high-quality generation power of DM to achieve video generation conditioned on any number of masked frames. The network architecture adopts an asymmetric design, allocating most of the parameters to the MAR branch and using spatial-temporal attention. The MAR branch takes masked low-resolution frames as input and outputs conditioning signals for the DM network. The DM network, with fewer parameters and utilizing only spatial and temporal separated attention, generates masked high-resolution frames. Under the influence of the conditioning, the DM network generates the missing masked parts. This approach achieves state-of-the-art video interpolation and image-to-video performance.

优点

This paper proposes a computationally balanced asymmetric network architecture. The MAR part has a larger number of parameters and higher computational complexity, but it requires fewer forward passes and takes low-resolution frames as input. In contrast, the DM part has fewer parameters and lower computational complexity, but requires more forward passes and models high-resolution frames. This is an ingenious approach.
The paper introduces Identity Attention, which cleverly addresses the issue of unstable training and theoretically reduces the computational load of the attention module (although it is unclear whether the authors actually implemented this in the code).

缺点

This paper does not provide sufficient ablation experiments to demonstrate the effectiveness of the progressive training strategy. At the very least, experiments are needed to show that under the same training cost, the combination of Joint-Model Stage + Joint-Task Stage outperforms the model with only the Joint-Task Stage.
In Table 3, the paper does not compare its results with SEINE [1], which, as far as I know, also focuses on video interpolation.
Although the proposed method offers some advantages in inference latency, it introduces a large number of parameters compared to previous diffusion methods, which increases the memory requirements of the device. However, Table 3 lacks a direct comparison of memory usage.
Despite the introduction of numerous additional parameters, the image-to-video performance in Table 4 still lags behind previous work [2, 3].

[1] Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.

[2] Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xintao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. In Proceedings of the European Conference on Computer Vision (ECCV), 2024.

[3] Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Fine-grained open domain image animation with motion guidance. arXiv preprint arXiv:2311.12886, 2023b.

问题

Directly using a pre-trained VideoMAE to initialize the MAR branch may sound more elegant and efficient for several reasons:

① Leverage Pre-trained Features: VideoMAE is pre-trained on a large corpus of video data and has already learned spatiotemporal features. Initializing the MAR branch with VideoMAE would allow the model to start with a strong foundation, potentially leading to faster convergence and better performance, especially for tasks related to video generation.

② Reduced Training Time: By starting with a pre-trained model, the MAR branch wouldn't need to learn basic spatiotemporal representations from scratch, reducing the overall training time and computational cost.
Table 4 has typos in the last two rows: "MARDini-L/T-17" and "MARDini-S/T-17" should be corrected to "MarDini-L/T-17" and "MarDini-S/T-17" to maintain consistent naming conventions. And these two lines are missing the Latency metric.

[4] Zhan Tong and Yibing Song and Jue Wang and Limin Wang. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. Advances in Neural Information Processing Systems, 2022.

[5] Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu. VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

审稿意见

评分: 5置信度: 42024-10-28

The authors propose a method for image animation, video interpolation and extrapolation. The novelty of the method lies in the use of a small diffusion model (DM) augmented by the features representations (planning signals) produced by a large masked autoregressive (MAR) model operating in low resolution. By allocating a large amount of compute to the MAR model, diffusion sampling can happen efficiently, as the MAR model needs to be executed only once. The proposed research question of improving the efficiency of DMs is of large interest to the community, though the current work only tackles image animation, video interpolation and extrapolation, without treating the most widespread text-to-video scenario. The paper is mostly clearly written. The high level method idea is sound, though several architectural and training recipe details are not fully justified, reducing soundness of the approach.

优点

The authors treat the relevant problem of improving efficiency of DM-based video generators
The proposed idea of using a MAR model as a feature extractor is interesting
The paper is well-written, though many choices are not fully justified
Ablations show that the small DM benefits from access to MAR features, validating the design choice.

缺点

Absolute quality of the generated videos as shown in the supplementary material is low
Tab 1. presents evaluation of the method vs a DM with same compute and no MAR. A missing baseline would be comparing the full approach with a DM of a capacity rescaled to match the compute of DM + MAR. Without such baseline it is not possible to assert that the combination of DM + MAR outperforms DM under equal computational constraints.
Several details in the architecture and training recipe are unclear or appear not sound (see questions)
Fig 5 (c) shows bad convergence behavior, even after the proposed fixes. The behavior is justified by the use of warmup. Warmup is unlikely to generate a sharp transition at 6k steps as the one shown in the plot which appears to be more likely justified by optimization challenges for the current architecture, raising doubts on how scalable and user-friendly the proposed framework would be.
In Tab. 3. The model underperforms LDMVFI and VIDIM in MidF-LPIPS and FID. A discussion in LL430 attributes this to defective metrics behavior with respect to blur. This appears unconvincing as, differently from PSNR and SSIM, LPIPS and FID would strongly be penalized by the presence of blur. In particular, FVD, would have a behavior substantially similar to FID, being both distribution based metrics. Lower performance under these metrics raises concerns about the quality of the method. The sample in Fig. 6 does not convincingly show the author's point. I would suggest showing multiple samples from regions containing large amounts of high frequency details where differences would appear more clearly.
Text to video generation is not treated in this work
(minor) The employed VAE does not perform temporal compression, contrary to modern T2V generators performing 4x (Cogvideo-X) to 8x (MovieGen) temporal compression. This appears as a lost opportunity for speedup. Note that temporal compression still allows for masking on each frame independently, by applying masks to the input video before encoding.

问题

The model naturally lends itself to the tasks of video inpainting. I.e. considering frames with partial masks. Did the authors consider such scenario? Considering partial masks would enable model training on larger datasets of images that would improve performance.
LL123-126 highlight using only videos without labels as a model advantage. The model however shows low quality video generations that could have been improved through joint image/video training and considering text captions that can be automatically generated for the training data. Could the authors discuss why the chosen approach is desirable?
Why does the MAR model present a Spatial-only attention block? Modern T2V generators tend to use spatial blocks only as a way to preserve generation capabilities if the model was retrained on image-only data. Why would the MAR model benefit from Spatial-only attention blocks that would have been traded for more Spatio-Temporal blocks for which importance is advocated in the paper?
Why does the Temporal attention block in the DM not have modulation applied to it?
Why does modulation involve shifting? As modulation is applied after RMSNorm operations it would be natural to only consider scale operations. This would come at the benefit of reducing the amount of parameters devoted to modulation by 33%
LL194 mention the use of Layer Normalization. Is layer normalization used to perform QK normalization? Why do the authors not employ RMSNorm for QK normalization as previous work (e.g. Stable Diffusion 3)? This would be a more logical choice given the use of RMSNorm layers as the main normalization mechanism.
LL200-203 mention the use of [NEXT] tokens as Lumina-T2X (Gao 2024). In their follow up work Lumina-Next, the authors advocate for the removal of the mechanism in favor of 3D RoPE. Why was this approach adopted in this work?
LL345 could the authors clarify how FSDP would contribute to inference speedups? FSDP would only allow sharing of model parameters to reduce memory usage, but would require additional parameter gathering overhead to achieve this. Please elaborate
LL292 does not appear to be well justified. Could the authors describe why it would not be more desirable to train the DM from scratch with such architecture?
In Fig. 7, the discussed training recipe for the "Initial Stage" of the DM comprises continuous changes in the set of training data (20M Clips -> 80M Clips -> 40M Clips) and FPS (8 -> 4 -> 8). The MAR is trained on 200M clips. Could the authors describe the rationale for adopting the chosen strategy and discuss how readers should instantiate their training recipe in practice?
The proposed idea of using a MAR model as a feature extractor is interesting. The MAR model needs to only be run a single time during the whole DM sampling process. However, single-step distillation (see "SF-V: Single Forward Video Generation Model") of a large video DM also appears as viable route for achieving good quality-cost tradeoff with a single evaluation of a large model. Could the authors describe why MAR + small DM represent a more attractive solution than large video DM + single-step distillation?

2024-11-24

LL194 mention the use of Layer Normalization. Is layer normalization used to perform QK normalization? Why do the authors not employ RMSNorm for QK normalization as previous work (e.g. Stable Diffusion 3)? This would be a more logical choice given the use of RMSNorm layers as the main normalization mechanism.

Use of LayerNorm is an unusual choice for QK normalization. The referenced ablation would have been valuable if shown.

LL200-203 mention the use of [NEXT] tokens as Lumina-T2X (Gao 2024). In their follow up work Lumina-Next, the authors advocate for the removal of the mechanism in favor of 3D RoPE. Why was this approach adopted in this work?

Using 3D RoPE on features not divisible by 3 is possible by uneven allocation of channels to the spatial and temporal dimensions, a commonly followed practice in modern video generators [5]. Adopting a technique that is abandoned by the proposing authors themselves [5] casts doubts on the operated choices which compounds to the doubts casted by the other reported unusual modeling choices and lack of corresponding ablation results. My concern is unaddressed.

LL345 could the authors clarify how FSDP would contribute to inference speedups? FSDP would only allow sharing of model parameters to reduce memory usage, but would require additional parameter gathering overhead to achieve this. Please elaborate

Thank you for the discussion

LL292 does not appear to be well justified. Could the authors describe why it would not be more desirable to train the DM from scratch with such architecture?

Thank you for the discussion. My concern is partially addressed, an ablation would have been valuable to support the decision and add rigor to the work.

In Fig. 7, the discussed training recipe for the "Initial Stage" of the DM comprises continuous changes in the set of training data (20M Clips -> 80M Clips -> 40M Clips) and FPS (8 -> 4 -> 8). The MAR is trained on 200M clips. Could the authors describe the rationale for adopting the chosen strategy and discuss how readers should instantiate their training recipe in practice?

Thank you for the discussion. Please see my previous comments.

The proposed idea of using a MAR model as a feature extractor is interesting. The MAR model needs to only be run a single time during the whole DM sampling process. However, single-step distillation (see "SF-V: Single Forward Video Generation Model") of a large video DM also appears as viable route for achieving good quality-cost tradeoff with a single evaluation of a large model. Could the authors describe why MAR + small DM represent a more attractive solution than large video DM + single-step distillation?

Thank you for the discussion. My concern is addressed

2024-11-24

Absolute quality of the generated videos as shown in the supplementary material is low

Authors report training with 256 H100 GPUs (Equivalent to 512 to 768 A100s GPUs). [1] [2] and [3] show higher quality video generation is possible with a more limited amount of resources. Failure to obtain good video quality with such resources and using stronger conditioning is indicative of problematic aspects with the method highlighted throughout hte review. My concern remains not addressed. [1] Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis [2] Pyramidal Flow Matching for Efficient Video Generative Modeling [3] SF-V: Single Forward Video Generation Model

Tab 1. presents evaluation of the method vs a DM with same compute and no MAR. A missing baseline would be comparing the full approach with a DM of a capacity rescaled to match the compute of DM + MAR. Without such baseline it is not possible to assert that the combination of DM + MAR outperforms DM under equal computational constraints.

As reported by other reviewers, the scalability problem of training a larger diffusion model baseline is not well understandable and may be indicative of problematic aspects with the experimental setting.

Fig 5 (c) shows bad convergence behavior, even after the proposed fixes. The behavior is justified by the use of warmup. Warmup is unlikely to generate a sharp transition at 6k steps as the one shown in the plot which appears to be more likely justified by optimization challenges for the current architecture, raising doubts on how scalable and user-friendly the proposed framework would be.

Several video generation works are able to scale diffusion transformers to achieve high quality video generation results by adopting rather standard techniques [4][5]. Please also see the previously cited examples. The proposed Identity Attention does not appear to fully address the instability problems which emerge in the loss plot and further method scalability remains a concerning aspect regarding the work. I understand the design proposed by the authors is prone to make training more unstable. The complex training recipe appears to be the results of trial-and-error solutions to address such instability and limited insights are given into the reason behind every operated choice. I appreciate the transparency in reporting it, but without such deeper insights and guidelines that future work can follow, reporting of such complex training recipe does not give many valuable insights to the reader. My concern remains not addressed. [4] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [5] Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

2024-11-25

We greatly appreciate the reviewer’s detailed and prompt feedback and are pleased to see that most concerns have been addressed. Below, we provide additional clarifications regarding your remaining questions.

Performance and Resource Claims

We would like to highlight two key points regarding our model’s performance and the reported compute resources:

Comprehensive Compute Accounting
Our reported compute includes not only the training of the final model but also the exploration of model variations, ablation studies, evaluation benchmarks, and the full model training. We apologize for not making this explicit earlier and will clarify this in the updated paper.
Differences in Training Configurations
Comparing MarDini’s training resources with those of SnapVideo, Pyramidal Flow Matching, and SF-V may be misleading due to significant differences in model configurations and training setups:
- SnapVideo operates in a significantly smaller generation space, producing videos with lower resolution and shorter durations. Specifically, SnapVideo generates 16-frame videos at 512 × 288 resolution and 24 fps (0.5 seconds of video), as stated in the original paper:
  
  "We generate 16 frames videos in 512 $\times$ 288 px resolution at 24 fps for all settings."
  
  In contrast, MarDini generates 16-frame videos at 1024 × 1024 resolution and 8 fps, producing 2-second-long videos. This difference in resolution and duration highlights the increased complexity of MarDini's generation capabilities, as demonstrated in our attached video examples.
- Pyramidal Flow Matching relies on pre-training from a strong text-to-image model (SD3-medium), as stated in the paper:
  
  "We adopt the MM-DiT architecture, based on SD3 Medium, which comprises 24 transformer layers and a total of 2B parameters. The weights of the MM-DiT are initialized from the SD3 medium."
  
  In contrast, MarDini is trained from scratch on unlabelled video data only. This gives Pyramidal Flow Matching a clear advantage in convergence and generation quality, leveraging additional textual encoders and pre-trained models along with captioned image and video data.
- SF-V builds directly on a pre-trained video model (SVD), as stated in its paper:
  
  "We apply Stable Video Diffusion as the base model across our experiments."
  
  MarDini, again being trained without any pre-trained models, cannot match such starting advantages.

To conclude, direct comparisons to these models are challenging due to different scopes and goals:

Pyramidal Flow Matching was released after the ICLR submission, making it already ineligible for comparison under the official reviewing guidelines.
SnapVideo focuses on shorter, lower-resolution video generation, limiting its applicability for direct benchmarking with MarDini.
SF-V, as a distillation-based approach, underperforms SVD when the latter uses 25 inference steps.

In contrast, MarDini matches SVD’s performance on VBench (as shown in Table 4) and outperforms many state-of-the-art image-to-video models.

Again, we would like to highlight that MarDini emphasizes the exploration of efficient temporal modelling instead of text guidance, making direct comparisons with text-to-video models less appropriate. We believe, other similar image-to-video models, such as SVD and DynamicCrafter, serve as more relevant baselines here. Notably, most image-to-video baselines (excluding those fine-tuned from text-to-video models) are also trained on datasets comprising short video clips -- making us a fair comparison. DynamicCrafter generates videos with 16 frames at a resolution of 576×1024, while SVD produces 25 frames at the same resolution. While, MarDini is designed to generate 16 frames but at a higher resolution of 1024×1024.

2024-11-25

Scalability Concerns

While scaling the size of diffusion models has been shown to improve results in recent works (e.g., Meta Video Gen, OpenAI Sora), we prioritize the exploration of alternative architectural design spaces over model scaling. Our focus is on proposing new design strategies that enhance flexibility and efficiency, enabling MarDini to address diverse video generation tasks within a unified framework.

-- Empirical Comparison with Diffusion Models of Similar Parameter Size

We understand the reviewer’s interest in comparing MarDini with a diffusion model of the same parameter size and we agree this is a meaningful comparison. However, re-training a diffusion model of comparable size from scratch would require substantial computational resources and is beyond our current budget. Instead, we argue that SVD, which has a comparable parameter count to MarDini-S (~1.5B), already serves as a strong diffusion-based baseline, as suggested by the reviewer.

Our experiments demonstrate that MarDini-S achieves similar performance on VBench compared to SVD, as shown in our paper. This highlights MarDini's ability to maintain competitive results while offering greater flexibility and efficiency.

-- Rationale for Using a 300M Parameter Diffusion Model

We selected a diffusion model with 300M parameters primarily to align with the commonly used ViT-L size in ImageNet-trained models. This choice ensures sufficient capacity for generating realistic frames, as evidenced by prior work [A].

Additionally, constraining the size of the diffusion model opens up design space for scaling the planning model (MAR) to further improve temporal modelling capabilities. This deliberate design decision emphasizes MarDini's goal of exploring efficient temporal architectures while maintaining high-quality frame generation.

[A] Wei, Chen, et al. "Diffusion models as masked autoencoders." Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.

We hope these clarifications address the reviewers’ concerns and provide a deeper understanding of our design choices and priorities.

Stability Concerns.

We would like to clarify that, with our proposed training recipe and architectural design, MarDini's training is NOT unstable. However, we emphasize that addressing training instability through thoughtful design is not unique to MarDini but rather a common challenge in many video generation models.

-- Insights from Related Work

As highlighted in both our paper and reviewer feedback, many aforementioned video generation baselines have also appeared instability during training. For example:

Lumina-T2X explicitly identifies training instability as one of its challenges:

"Despite its promising capabilities, Lumina-T2X still encounters challenges, including training, instability, slow inference, and extrapolation artifacts."

To mitigate these issues, Lumina-T2X introduces a specialized component:

"Next, we task an in-depth analysis of the instability during both training and sampling and find it stems from the uncontrollable growth of network activations across layers. Therefore, we introduce the sandwich normalization block in attention modules, which is proven to effectively control the activation"
CogVideoX similarly reports challenges with training convergence. As shown in Figure 10 of their manuscript, using 2D + 1D attention aggravates loss convergence issues, necessitating careful parameter configuration.

These examples demonstrate that directly adopting a transformer-based architecture for a new task often results in unstable training. This underscores the critical need for carefully designed innovations to address the unique challenges of the task at hand.

MarDini exemplifies this approach by introducing novel solutions such as Identity Attention and a progressive training strategy to ensure stability and scalability in video generation tasks. These contributions are fundamental to the success of our model and represent a significant step forward in addressing this common challenge.

We hope this clarification addresses the reviewers’ concerns and provides valuable context regarding the broader challenges of training stability in video generation.

2024-11-25

Metric Reliability and Qualitative Comparisons

We thank the reviewers for their valuable suggestions. Regarding the evaluation metrics, we would like to emphasize that FVD remains the primary metric for assessing video generation tasks within the research community, as demonstrated by several key works [A, B]. While the reliability and limitations of such metrics are indeed an important topic, we believe this discussion extends beyond the scope of our current work.

Specifically, we agree that extensive qualitative comparisons can provide additional insights. However, the state-of-the-art video interpolation baseline, VIDIM [C], has not yet released its model weights. This presents significant challenges for us in conducting a rigorous and fair user study to compare qualitative results.

[A] Wang, Xiaojuan, et al. "Generative inbetweening: Adapting image-to-video models for keyframe interpolation." arXiv preprint arXiv:2408.15239 (2024).

[B] Blattmann, Andreas, et al. "Stable video diffusion: Scaling latent video diffusion models to large datasets." arXiv preprint arXiv:2311.15127 (2023).

[C] Jain, Siddhant, et al. "Video interpolation with diffusion models." CVPR 2024.

Additional Ablations

We appreciate the reviewer’s suggestion regarding the ablation of additional design elements and agree that such experiments can provide valuable insights. However, conducting these specific ablations in a fair and accurate manner would require a significant computational budget. Given the scope and resource constraints of this study, we have prioritized exploring and analyzing critical design choices, such as Identity Attention and the training strategies for the MAR and generation models.

2024-11-25

Additional Support on Using [NEXT] Token

We appreciate the reviewer’s attention to the [NEXT] token and would like to further elaborate on its design and functionality in comparison to Lumina [A, B].

Lumina-T2X utilizes learnable special tokens such as [nextline] and [nextframe] to encode video data as a unified one-dimensional sequence, as described in their work:

"Since the original DiT can only handle a single image at a fixed size, we further introduce learnable special tokens including the [nextline] and [nextframe] tokens to transform training samples with different scales and durations into a unified one-dimensional sequence."

In contrast, MarDini introduces the [NEXT] token, which is conceptually similar to Lumina’s [nextline] but diverges in its implementation. Specifically, MarDini adopts a 2D embedding representation for the [NEXT] token, as opposed to the one-dimensional [nextframe] token used in Lumina-T2X. This 2D representation is designed to better align with our model’s architectural requirements for video data processing and use of Identity Attention.

While the [NEXT] token is not entirely novel, it builds upon established practices from other domains, such as in language modelling. For example, BERT [C] uses [CLS] and [SEP] tokens to demarcate and organize input sequences, another concept that inspires our [NEXT] token design. These standardized and empirically validated designs have proven effective across domains, and we prioritize adopting such stable methodologies to ensure higher reliability in MarDini's design.

[A] Gao, Peng, et al. "Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers." arXiv preprint arXiv:2405.05945 (2024).

[B] Zhuo, Le, et al. "Lumina-next: Making lumina-t2x stronger and faster with next-dit." arXiv preprint arXiv:2406.18583 (2024).

[C] Devlin, Jacob. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

We sincerely appreciate the reviewer’s thoughtful and careful assessment of our work. We hope our responses have resolved your concerns, and we welcome any further questions or feedback you may have. Please feel free to let us know if there are additional points you would like us to address.

2024-12-01

Snap Video produces videos at variable framerates and its qualitative results on their website are almost exclusively shown in 6fps, producing 2.67s videos. Snap Video is trained from scratch without image pretraining and authors report usage of lower amount of resources to perform training.

While the authors are correct in saying that a direct comparison to text-to-video methods is not possible, the amplitude of the gap in the quality of generated results between such method and the proposed approach is large enough to be concerning.

The small quantity of qualitative results supplied by the authors as support for the proposed method compounds to the concern.

2024-12-01

Specifically, we agree that extensive qualitative comparisons can provide additional insights. However, the state-of-the-art video interpolation baseline, VIDIM [C], has not yet released its model weights. This presents significant challenges for us in conducting a rigorous and fair user study to compare qualitative results.

The authors could have reinforced their evaluation by providing a large set of qualitative results as part of their supplementary independently from the availability of baseline methods. In addition, there are at least 16 publicly available videos from VIDIM on which a user study could have been run.

This 2D representation is designed to better align with our model’s architectural requirements for video data processing and use of Identity Attention.

Could you please specify in detail in which way 3D positional embeddings would not be adequate as a substitute for the [NEXT] construct abandoned by Lumina?

2024-11-24

In Tab. 3. The model underperforms LDMVFI and VIDIM in MidF-LPIPS and FID. A discussion in LL430 attributes this to defective metrics behavior with respect to blur. This appears unconvincing as, differently from PSNR and SSIM, LPIPS and FID would strongly be penalized by the presence of blur. In particular, FVD, would have a behavior substantially similar to FID, being both distribution based metrics. Lower performance under these metrics raises concerns about the quality of the method. The sample in Fig. 6 does not convincingly show the author's point. I would suggest showing multiple samples from regions containing large amounts of high frequency details where differences would appear more clearly.

I appreciate the clarification about the metrics, though my concern remains unaddressed as FVD alone is not sufficient to support better method performance. The supplementary material could have been used to showcase an extensive set of qualitative results supporting the authors claims or a user study could have been added to support claimed flaws in the metrics. My concern remains not addressed.

(minor) The employed VAE does not perform temporal compression, contrary to modern T2V generators performing 4x (Cogvideo-X) to 8x (MovieGen) temporal compression. This appears as a lost opportunity for speedup. Note that temporal compression still allows for masking on each frame independently, by applying masks to the input video before encoding.

Thank you for the discussion. My concerned is addressed.

The model naturally lends itself to the tasks of video inpainting. I.e. considering frames with partial masks. Did the authors consider such scenario? Considering partial masks would enable model training on larger datasets of images that would improve performance.

Thank you for the discussion. My concerned is addressed.

**LL123-126 highlight using only videos without labels as a model advantage. The model however shows low quality video generations that could have been improved through joint image/video training and considering text captions that can be automatically generated for the training data. Could the authors discuss why the chosen approach is desirable? **

Thank you for the discussion. My concerned is addressed.

Why does the MAR model present a Spatial-only attention block? Modern T2V generators tend to use spatial blocks only as a way to preserve generation capabilities if the model was retrained on image-only data. Why would the MAR model benefit from Spatial-only attention blocks that would have been traded for more Spatio-Temporal blocks for which importance is advocated in the paper?

My concern is partially addressed by the discussion, though the chosen architecture and choice to finetune the Spatio-Temporal attention into a temporal attention could have been ablated against the sole use of Spatio-Temporal attention used by state-of-the-art video generators such as Movie Gen. The paper suffers from complex or unusual architectural choices that are not sufficiently ablated.

Why does the Temporal attention block in the DM not have modulation applied to it? Why does modulation involve shifting? As modulation is applied after RMSNorm operations it would be natural to only consider scale operations. This would come at the benefit of reducing the amount of parameters devoted to modulation by 33%

My concern is partially addressed by the reference. An ablation would have improved the paper.

审稿意见

评分: 6置信度: 42024-11-04

This paper presents a new framework for video generation, MarDini, which integrates masked auto-regression (MAR) and diffusion model (DM). Specifically, it separates the low-res planning and high-res generation loads into a heavy-weight but efficient MAR and a lightweight but expensive DM, effectively balancing the computation of scaling video generation. To stabilize training, the paper also proposes Identity Attention, which preserves the input reference frames without attending to other tokens, and a multi-stage training strategy. The method has been evaluated for video interpolation, image-to-video generation, and video expansion. Visualization is provided in the supplemental materials.

优点

This paper presents a novel video generation framework that separates low-res temporal planning and high-res spatial generation via asymmetric MAR and DM models, effectively balancing the computation cost in this task. In general, I do agree with the three characteristics/strengths of MarDini listed in the Introduction, i.e., the flexibility, scalability, and efficiency advantages.
The large flexibility of the proposed framework offers huge room for exploration, not only from the architecture side but also the applications (e.g., as shown in the Appendix, the model can potentially be generalized to other tasks such as zero-shot 3D view synthesis and video expansion). This paper can likely inspire future research in many domains.
In-depth analyses are presented to reveal the model’s behavior and justify the design choices, nicely supporting the arguments.
The paper has set new SoTAs on VBench for image-to-video generation and VIDIM-Bench for zero-shot video interpolation, further justifying the effectiveness of the proposed framework.
Overall, I think this paper has been nicely written, methods, results, and visualizations are clearly presented.

缺点

Overall, I am happy with this paper, but considering the scale of experiments (256 H100 GPUs for training), which is very large, there are a few concerns I wish to point out:

Video length. It seems that the method is only evaluated on very short videos, mostly generating 9 or 17 frames at low FPS. It is unclear that the method, in particular the temporal planning, can generalize well to longer videos.
Conditioning. The paper claims that the method does not require image-based pre-training, but I guess this is mainly because the paper has chosen to address the video interpolation, video expansion, and image-to-video generation tasks. As mentioned in Future Works, if we want to enable the model to do text-conditioning, then we might have to introduce abundant image data to learn the rich semantics. It is unclear whether the model can still work well with text conditions (or condition signals in other modalities) and generate long videos, especially when the DM is very lightweight.
Training difficulty. I am happy to see that the paper deeply investigates the training strategy and found feasible solutions. But this also makes me concerned about the stability and training difficulty of such a model, especially since this might be very unfriendly for future works that try to follow or extend this MarDini.
Visualizations. I carefully reviewed all the videos provided in the Supplemental materials, but most of the videos are very short and static, and are scenery or single objects that are relatively easy to generate. For such a large-scale model (3.1B parameters, trained with 256 H100), I expect to see much more complex and successful examples such as high dynamics, complex scenes, or human activities. Additionally, I am not fully satisfied with the Figure 4 videos that compare with/without planning. In this (picked) example, buildings are much better for with planning, but the ferry still has obvious distortion.

问题

I don’t have any other questions. I hope that the authors can concisely respond to my concerns mentioned above.

2024-11-25

We thank the reviewer for their positive support and are pleased that we were able to address most of the concerns.

To address your question: Yes, the vanilla Shutterstock video dataset does include videos with humans, though these comprise only a very minor portion. We applied post-processing steps to filter out such videos, mainly to address legal and privacy considerations.

For training, we segmented the videos into two-second clips (as also explained in the general comments above) to enhance training efficiency and align with other image animation models [A, B], which also generate two-second videos. This segmentation ensures a fair comparison with these models while enabling larger batch sizes during training. We apologize for this inconsistency and will further clarify it in the updated paper.

[A] Xing, Jinbo, et al. "Dynamicrafter: Animating open-domain images with video diffusion priors." ECCV 2025

[B] Blattmann, Andreas, et al. "Stable video diffusion: Scaling latent video diffusion models to large datasets." arXiv preprint arXiv:2311.15127 (2023).

Please let us know if there are any remaining concerns we can address.

评论- Final Comment

2024-12-03

Dear Reviewer 7wkP,

We sincerely appreciate your recognition of our work and your thoughtful feedback, which has been invaluable in improving our paper. As the rebuttal period concludes, please let us know if there are any additional quick clarifications that we can provide.

Thank you once again for your time and thoughtful review.

Best regards, The MarDini Team

2024-11-25

Thank you to the authors for addressing my question and providing both a general response and additional materials.

That said, I feel that my key concerns regarding the generation of short and low-dynamic videos, as well as the complexity and instability of the training strategy, have not been fully addressed. I believe that utilizing 256 H100 GPUs and a 3.1B parameter model represents a significant computational effort. Given this scale, it seems reasonable to expect much higher-quality video generation and clearer evidence of scalability and potential for future extensions. Additionally, the challenges with training difficulty and instability make these expectations less certain.

One follow-up question: the rebuttal attributes some of the above problems to the training data. However, the paper states that the training leverages the "Shutterstock video dataset with 34 million videos". The dataset does include diverse, long videos and complex motions (and with humans). Could the authors clarify this inconsistency?

After reviewing the feedback and rebuttals from/to other reviewers, I note that others have raised similar concerns. At this time, I am inclined to maintain my current rating.

评论- Updated General Comments

2024-11-27

We sincerely thank all reviewers for their valuable feedback and constructive criticisms, which have helped improve our work significantly. In response to the new comments, we have further updated our paper with the following changes (marked in orange):

As suggested by Reviewer 7wkP, we clarified the GPU costs and training data in Appendix B.
As suggested by Reviewers pvBR and ppEA, we elaborated in Appendix A on the rationale for using a 300M parameter model for generation.
As suggested by Reviewers pvBR and ppEA, we added a statement in Section 3.3 highlighting the significance of SVD-XL as a DM-only baseline due to its comparable model size to MarDini-S. This comparison serves as a DM-only to DM+MAR ablation study, demonstrating MarDini’s efficiency over diffusion models without compromising generation performance.
As suggested by Reviewer pvBR, we provided a more detailed explanation of the rationale behind using the [NEXT] token in Section 2.3.1, drawing inspiration from previous works."

评论- Final Updated Comments on Parameter Allocation Ablation

2024-12-03

In this final comment within this rebuttal period, motivated by reviewer comments (pvBR and ppEA), we conducted an ablation study using generation models (DM models) of varying sizes. Based on recent findings [A], which demonstrated that pretraining loss is closely correlated with final generation performance, we adopt pretraining loss as a proxy metric to evaluate model performance, given the limited budget and time available during the rebuttal period.

In this study, we fixed our pre-trained 3B MAR model as the planning component and trained different DM models using the same filtered Shutterstock dataset. All models were trained under identical configurations, including a batch size of 512, video clips of 4 frames, a resolution of 256, and a mask ratio ranging from 0.65 to 1.00. Training was conducted on 16 H100 GPUs and evaluated at the same selected training steps in each training configuration.

Key Observations:

Consistent with prior studies, larger DM models deliver superior performance but at the cost of increased latency.
The 150M DM model is a bit too small, showing persistently higher loss values even at later stages of training.
The 300M DM model (used in our paper) achieves a balanced trade-off, providing performance comparable to larger models while significantly reducing latency. For applications with a larger inference budget, increasing the DM model size could further improve performance, though with diminishing returns at larger scales.

Gen. Model		Pre-Training	Loss		Latency (sec/frame)
	1.5k	2.5k	4.5k	20k
150M	0.1314	0.0908	0.0694	0.0403	0.26
300M	0.0968	0.0773	0.0598	0.0360	0.48
600M	0.0798	0.0649	0.0540	0.0332	0.97
1B	0.0675	0.0587	0.0586	0.0319	2.15

We plan to include this table in the final version of our paper and hope it addresses the reviewers' remaining concerns comprehensively.

[A] Liang, Zhengyang, et al. "Scaling laws for diffusion transformers." arXiv preprint arXiv:2410.08184 (2024).

AC 元评审

2024-12-19

The paper presents an approach for animation, interpolation and extrapolation. The paper received two slightly negative scores and two slightly positive, suggesting its borderline status. Among major concerns were the quality of the results, very short length of the video, lack of important ablations, and analyses. And all of that at pretty high computational budget of 256 H100 cores. The AC went thought the reviewers and discussions, checked the supplement and agrees with the evaluation provided by reviewers. AC believes that the quality of results is quite low compared to current standards accepted in the community (recall the paper is submitted for publication in 2025). Lack of certain ablations makes the manuscript less valuable.

审稿人讨论附加意见

There was a very long discussion between the authors and reviewers. AC found it very fruitful. However, the concerns remained and the scores didn't go up to the necessary to the level at which we can confirm traction with the community.

最终决定Reject

2025-01-22

Reject