Masked Generative Nested Transformers with Decode Time Scaling
Efficient visual generation using a decode time model scaling approach and cached computations, achieving similar performance with significantly reduced compute.
摘要
评审与讨论
Recent advances in visual generation have improved content quality but face challenges in computational efficiency during inference. Many algorithms require multiple passes over a transformer model, keeping a consistent model size that leads to high computational costs. This work proposes two strategies to address this: (a) implementing a decode-time model scaling schedule to allocate computational resources more effectively, and (b) caching and reusing computations. These approaches allow smaller models to handle more tokens while larger models process fewer, without increasing parameter size due to shared parameters. This results in competitive performance with significantly reduced computational costs.
给作者的问题
No other questions
论据与证据
Supported by references to various paradigms, but could benefit from specific performance metrics to illustrate improvements. This claim is generally accepted.
方法与评估标准
The submission introduces MaGNeTS, which employs model size scheduling and KV-caching during the decoding process. This approach logically addresses the identified inefficiencies in parallel decoding and redundancy in computations. The gradual scaling of model size is a sensible strategy to optimize computational resources, making it relevant for high-quality image and video generation.
The use of benchmark datasets like ImageNet, UCF101, and Kinetics600 is appropriate for evaluating the performance of generative models. These datasets are widely recognized in the field and provide a robust basis for comparing the quality of generated outputs.
理论论述
This paper does not cover theoretical claims.
实验设计与分析
The submission outlines the use of benchmark datasets (ImageNet, UCF101, and Kinetics600) to evaluate MaGNeTS. This choice is sound as these datasets are well-established and relevant for image and video generation tasks.
补充材料
I reviewed the supporting materials.
与现有文献的关系
The caching mechanism proposed for reusing computations is reminiscent of techniques in self-attention models, where key-value pairs are cached to improve efficiency (Gu et al., 2022). This notion of leveraging previously computed results to enhance performance aligns with broader trends in machine learning focused on efficiency.
遗漏的重要参考文献
I don't think there is any important related work that has not been discussed.
其他优缺点
Strengths
By reducing computational costs by 2.5–3.7× while maintaining quality, the method has practical implications for real-time applications and resource-constrained environments. The validation across both image (ImageNet) and video (UCF101, Kinetics600) datasets underscores its versatility.
Weaknesses
The nested architecture might introduce training complexities, such as balancing shared parameters across sub-models.
Generating using the method in this paper will result in a certain degree of degradation in generation quality. Is there a model with lower FID?
其他意见或建议
No other suggestions
We thank the reviewer for their valuable time and constructive reviews. We are glad to see that the reviewer appreciates the effectiveness of decode-time model scaling and caching, to significantly reduce computational costs in visual generation while maintaining competitive performance and demonstrating practical implications across image and video tasks.
Training complexities in nested models
As discussed in the related works section, several methods in the literature have successfully demonstrated the use of nested architectures for different tasks like language modeling (MatFormer, Kudugunta et al., 2023 and Flextron, Cai et al., 2024b), discriminative tasks like image/video classification (MoNE, Jain et al., 2024) as well as in representation learning (Matryoshka Learning, Kusupati et al., 2022) and some others discussed in Line 144 onwards (left column). These works show that nestedness does not introduce any training complexities. However, when we keep adding a lot of nested models, that might constrain the smaller models, making their performance inferior. To avoid that, in this work we propose a curriculum based distillation (Line 266 right column), which helps to improve the performance of the smaller models as shown in Table 7 of supplementary material. To further assess the training complexity, we ablate the effect of the number of nested models as suggested by Reviewer fe9H. Please refer to our response on “Impact of the number of nested models” to Reviewer fe9H for more details.
Quality of generated samples
Our work primarily focuses on enhancing efficiency of a baseline model, rather than directly improving its generation quality. Having said that, the compute-performance trade-off curve in Fig 6 shows that given a certain compute budget, our method can reach a better FID than the baseline.
In Table 1, we report results for a fixed model schedule of (3, 3, 3, 3), i.e. 3 iterations of each nested model. It takes a small hit in performance (0.6 FID) compared to the baseline, but being 2.65x inference efficient. To recover the small difference in performance we can change the model schedule to use bigger nested models like (0, 0, 6, 6), i.e., using 6 iterations of only the two largest nested models. With this, we achieve an at par quality with baseline (FID of 2.6) with 745 GFLOPs. This is 1.7× compute efficiency. Fig 6 also shows that the compute-performance tradeoff is even better for a smaller GFLOPs budget. | Method | Schedule | FID | # params | # GFLOPs | |---|---|---|---|---| | MaskGIT | NA | 6.2 | 303M | 647 | | MaskGIT++ | NA | 2.5 | 303M | 1.3k | | MagNeTS (ours) | (3, 3, 3, 3) | 3.1 | 303M | 490 | | MagNeTS (ours) | (0, 0, 6, 6) | 2.6 | 303M | 745 |
This paper introduces Nested Transformers for efficient image and video generation. The method progressively increases the model size during decoding to reduce computational costs in the early steps. Additionally, KV-caching is employed across decoding steps to further enhance efficiency. Experiments are conducted on both image and video generation tasks.
给作者的问题
- Can the proposed method be applied to other frameworks, or is it limited to the MaskGiT framework?
- Why was inference time not reported as part of the evaluation to validate the efficiency of the method?
论据与证据
Experimental results support the claim that the proposed method reduces computational cost in terms of FLOPs while maintaining generation quality.
方法与评估标准
Using nested modeling and gradually increasing the model size is a reasonable approach to reducing the computational cost of generative models.
From the evaluation perspective, the paper includes comparisons on both image and video generation tasks, which appropriately validate the proposed method.
理论论述
This paper does not include any theoretical proofs.
实验设计与分析
This paper evaluates model efficiency using parameter size, FLOPs. However, it would be beneficial to include inference time in the ablation study as a direct indicator of practical efficiency.
补充材料
I have reviewed the supplementary material regarding the additional results but have not examined the implementation details.
与现有文献的关系
This paper presents an efficient approach for generative models, building upon MaskGiT. It improves efficiency by approximately three times while keeping the generation quality, which is a significant advancement.
The contributions of this paper are related to distillation-based methods but from a nested modeling perspective, which has not been explored before.
遗漏的重要参考文献
I do not notice any other essential references that require further discussion.
其他优缺点
The design of the proposed method appears to be specifically tailored to the MaskGiT framework and may not generalize well to other generative modeling frameworks, which is a potential limitation.
其他意见或建议
typos:
- missing space in Line 42 between "and" and "video".
- unexpected space in Line 238 and Line 309.
We thank the reviewer for their valuable time and constructive reviews. We are glad to see that the reviewer found this to be an efficient approach for image and video generation, experimentally demonstrating reduced computational costs while maintaining generation quality and offering an unexplored perspective compared to existing methods. We answer the reviewer's question below.
Inference Time
We would like to highlight that in Table 8 in our Supplementary Material we present practical efficiency like throughput (in img/sec) of our method as compared to the corresponding baseline. For the sake of completeness, we also present the latency below. As we can see the proposed method is 2.5x faster than the baseline in this setting. | Algorithm | Baseline (MaskGIT++) | MaGNeTS | |---|---|---| | Images/Sec | 22.5 | 56.3 | | Latency (ms) | 712 | 285 |
Generalization to other generative modeling frameworks
We would like to highlight that the idea of model size scheduling over decoding iterations is generic enough to be applied to other multistep processes like diffusion. The core idea is that some parts of the decoding/denoising process might be easier than others, hence allowing for a step-wise allocation of model capacity. While we explore fixed schedules in this work, the idea can be further extended to input-adaptive schedules, i.e., some images might be easier to generate than others and based on the input we can decide which model to use for a certain step.
To support this broader applicability of our algorithm, we conducted preliminary experiments on diffusion models. Following the above discussion, we did some experiments using model schedules for diffusion. Due to time constraints, we were not able to train a new diffusion model in nested fashion, rather use UViT’s [A] pretrained checkpoints [B] on ImageNet 64×64. We use two models - U-ViT-L/4 and U-ViT-M/4 - to demonstrate our idea on model schedule during inference. Some key details of experiment -
- We use the default number of sampling steps = 50 and batch size = 500 in all experiments. We use a single A100 GPU.
- We do not use classifier-free guidance. We do not use any caching for these experiments (due to the continuous nature of the input) and only demonstrate the generalizability of the model scheduling idea.
- Since the initial denoising steps play a crucial role in shaping the final output of the reverse diffusion process, we utilize the L model for these early stages and transition to the M model for the later denoising steps.
- Given that the L model has greater denoising capacity than the M model, we customize the noise schedule with larger denoising step sizes for L and smaller step sizes for M, balancing efficiency and performance. | Method | FID (50k) | # steps | time (sec/iter) | |---|---|---|---| | U-ViT-M/4 | 5.92 | 50 | 17.12 | | U-ViT-L/4 | 4.21 | 50 | 32.34 | | Ours (Model Sched) | 4.58 | 50 | 21.10 |
As we can see, with only model scheduling, we are able to achieve ~1.53x inference compute gains with almost similar performance as baseline. Exploring better schedules, training the models with nesting and distillation will offer better compute gains. This shows that the proposed method of model scheduling over multistep decode process in image/video generation is generic enough to be applied to different modeling approaches
Typos
Thank you for bringing these typos to our knowledge. We will fix them in our final revision.
References
[A] F. Bao, S. Nie, K. Xue, Y. Cao, C. Li, H. Su, and J. Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
This paper proposed a promising efficient approach for image/video generation. Specifically, this work introduces the concept of model size scheduling during the generation process to significantly reduce compute requirements. It demonstrates KV cache also works for parallel decoding. They used nested modeling to achieve these ideas. The experimental results show strong performance of the proposed efficient method.
给作者的问题
Please refer to the "Methods And Evaluation Criteria" and "Other Strengths And Weaknesses".
论据与证据
There are three key parts of the proposed method: 1) Decode Time Model Schedule; 2) Caching and Refresh; and 3) Nested models.
The experiments and ablation study demonstrate the effectiveness of these modules.
方法与评估标准
Method: After reading the first section of the supplementary material, I think the author's motivation is very natural. However, I have a concern about the number of the nest models. Is there any experiment to analyze how the number of the nest models effect the performance?
Evaluation: It will be better to add some cases about the generated videos in the suppl.
理论论述
There is no proof for theoretical claims.
实验设计与分析
The quantitive experiments are well-designed, and the ablation study strongly demonstrates the efficiency of the proposed method
补充材料
I checked the full suppl.
与现有文献的关系
None
遗漏的重要参考文献
None
其他优缺点
Additional Weaknesses: There was relatively little qualitative analysis, and I would have liked to see more visual contrasts.
其他意见或建议
It is a promising work about the efficiency of the vision generative models. I think this paper deserves to be accepted.
We thank the reviewer for their valuable time and constructive reviews. We are glad to see that the reviewer acknowledged the work to be a promising and efficient approach for image/video generation by introducing model size scheduling and demonstrating the effectiveness of KV caching and nested models, supported by strong experimental results and a well-designed ablation study. We answer the reviewer's question below.
Impact of the number of nested models
Thanks for the interesting question. We analyse the effect of the number of nested models on performance. We train four more setting p=[1, 2] (two models), p=[1, 2, 4] (three models), p=[1, 2, 4, 8, 16] (five models) and p=[1, 2, 4, 8, 16, 32] (six models) apart from p=[1, 2, 4, 8] (four models) which we have in the paper. We observe that for all of these models, the biggest model performance remains the same for all cases. However, the performance of the smaller models, let's say , degrades by 0.5 FID when we add and then further by 0.6 FID when we add the nested model. We hypothesize that the drop in performance of smaller models is due to its lower representational power. As we add more nested models, the complexity of the shared representation increases, and burdens the smaller model. This drop in performance does not impact the performance of model scheduling (MaGNeTS, FID=3.1), as the larger models dominate the final results. Note that all of these results are on top of models trained with distillation (Line 261 onwards, right column), which itself helps to retain the performance up to 4 distilled models. This can be seen from Table 7 of supplementary material, which shows that distillation helps to boost the performance of smaller nested models.
Qualitative Analysis
We would like to clarify that we have included additional qualitative results in the Supplementary Material (Fig 10 for image generation, Fig 11 for video generation) along with the main qualitative results in Fig 1. We also present some failure cases in Fig 12. We will add more results to the final supplementary material.
Thanks for the authors' responses, and I will keep my recommendation.
The paper introduces MaGNeTS, a approach to improving the efficiency of visual generative models by dynamically scaling model size during decoding.
给作者的问题
Regarding artefacts in the generated results: The paper acknowledges that there may be artefacts in the results generated by MaGNeTS. Can the authors provide more specific details about these artefacts?
论据与证据
The claims made in the submission are largely supported by clear and convincing evidence, particularly through extensive experiments on ImageNet256×256, UCF101, and Kinetics600.
However, the authors claim that KV caching improves efficiency without performance loss, but Table 4 shows that caching degrades FID, and only with cache refresh does performance recover.
方法与评估标准
Yes, the proposed methods and evaluation criteria are appropriate for the problem. MaGNeTS is tested on ImageNet256×256, UCF101, and Kinetics600, which are standard benchmarks for image/video generation. Metrics like FID, IS, and FVD effectively measure generation quality. However, The model is compared to old baselines, and while compute savings (2.5–3.7×) are significant, FID scores slightly degrade, requiring further trade-off analysis.
理论论述
The paper primarily focuses on algorithmic innovations and empirical results rather than extensive theoretical proofs.
实验设计与分析
Yes, the experimental design and analyses were reviewed for soundness and validity.
补充材料
I have read the Additional Ablations part.
与现有文献的关系
The key contributions of this paper build upon prior research in efficient visual generation, parallel decoding, and nested transformer models, while introducing novel improvements in compute efficiency through decode time scaling and KV caching.
遗漏的重要参考文献
I think the paper has good references .
其他优缺点
Strengths :
The paper introduces decode time model scaling, a novel dynamic compute allocation approach that progressively scales model size, reducing redundant computation.
Weaknesses:
Comparison to Efficient Diffusion Models is Missing
其他意见或建议
Refer to Other Strengths And Weaknesses
We thank the reviewer for their valuable time and constructive reviews. We are happy to hear that the reviewer acknowledged the novelty of the dynamic compute allocation approach in our work, which helps to reduce redundant computation, and experimental evidence supporting the claims. We answer the reviewer's question below.
Claim about KV Caching and Refresh for Inference Efficiency
We would like to clarify that the claim in our paper is not about KV caching alone improving efficiency without performance loss. Instead, it is the combination of KV caching with intermittent refresh that makes our approach inference-efficient without degrading performance, as discussed in the paper. All our efficiency claims explicitly include the compute required for cache refresh. For e.g.,
- In Line 109 (left column), we state “KV caching can also be used in parallel decoding, which can effectively reuse computation when refreshed appropriately”.
- In Line 240 (right column) we mention “Caching the key-value pairs for the unmasked tokens helps reduce computation, but it can slightly degrade performance”.
- Then in Line 251 (right column) we mention “To remedy this, we strategically refresh the cache while changing the model size.”
Trade-off analysis
It is interesting to note that compute efficiency is better represented as a compute-performance tradeoff curve as shown in Fig 6. This curve illustrates the relationship between compute/latency and model quality:
- Given a fixed compute or latency budget, it shows that the proposed method obtains the best quality.
- Conversely, given a desired quality requirement, it shows that the proposed method is the most inference-efficient one. This tradeoff analysis effectively captures the true essence of compute efficiency.
Comparison to Efficient Diffusion Models
As discussed in the related works section of the paper, efficient diffusion methods can be categorized around two main categories – (1) reducing the number of network calls, (2) designing better network architectures to reduce the computation of each call. Our method is complementary to both of these approaches and can be combined to further enhance efficiency. Having said that, we do compare with a bunch of efficient diffusion methods in Table 1 of the paper. Moreover, as mentioned in Line 356, several recent diffusion works (example - Feng et al 2024, Lee et al 2024, Meng et al 2023, Berthelot et al 2023, Song et al 2023, Zheng et al 2023) only report results on the low-resolution of ImageNet (64×64), and therefore a direct comparison is not possible, as all our experiments are on 256x256. In addition to the ones presented in Table 1 of the paper, below we add some more comparisons with efficient diffusion methods which report for image size 128 and 256 on ImageNet. As we can observe, while some diffusion models do perform well, it needs considerably more steps and hence FLOPs compared to MaGNeTS. We will add these comparisons to the main paper.
| Method | Image Size | FID | Params | Steps | GFLOPs | |---|---|---|---|---|---| | DPM-Solver (Lu et al 2022) | 128 | 4.1 | 422M | 12 | >3000 | | MagNeTS (Ours) | 128 | 3.9 | 303M | 12 | 236 | |---|---|---|---|---|---| | EDiff [A] | 256 | 2.1 | 450M | 50 | 119k | | LPDM-ADM [B] | 256 | 2.7 | - | 50 | - | | MagNeTS (Ours) | 256 | 3.1 | 303M | 12 | 490 |
Artifacts
We would like to clarify that our approach does not introduce new artifacts, but inherits these properties from the baseline (MaskGIT) on which we apply our model scheduling approach. We discuss these artifacts in Section E of Supplementary Material: Lines 757-761 (right column). These artifacts are also visualized in Figure 12, which show that failure cases (like faces of humans) in the baseline method (MaskGIT++) directly translate to failure cases in our method. However, we reemphasize that improving this aspect of generative modeling is orthogonal and beyond the scope of the current work.
References
[A] Hang, T., Gu, S., Li, C., Bao, J., Chen, D., Hu, H., ... & Guo, B. (2023). Efficient diffusion training via min-snr weighting strategy. CVPR, https://arxiv.org/pdf/2303.09556
[B] Wang, Z., Jiang, Y., Zheng, H., Wang, P., He, P., Wang, Z., ... & Zhou, M. (2023). Patch diffusion: Faster and more data-efficient training of diffusion models. NeurIPS, https://arxiv.org/abs/2304.12526
The paper proposes Masked Generate Nested Transformers with Decode Time Scaling (MaGNeTS), a novel approach for efficient visual generation that dynamically scales model size during decoding and utilizes KV caching with intermittent cache refresh. Through nested transformer architectures and compute-aware scheduling, MaGNeTS significantly reduces inference computational costs without compromising generation quality much, as shown across image and video benchmarks (ImageNet256, UCF101, Kinetics600). Reviewers particularly appreciate the strong empirical validation, practical impact, and generalizability of the proposed method. Initial concerns regarding KV caching performance trade-offs, qualitative analyses, and generalization beyond MaskGIT were effectively addressed in the authors' thorough rebuttal. Therefore, I recommend acceptance of this paper.