FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion
FPSAttention is a training-aware FP8 quantization and sparsity co-design for video diffusion models that achieves up to 4.96× speedup without quality loss by aligning 3D tile granularity, denoising-step adaptation, and hardware-efficient kernels.
摘要
评审与讨论
This paper introduces FPSAttention, a novel framework to accelerate video diffusion models by proposing a training-aware co-design of FP8 quantization and structured sparsity. By synergistically optimizing these two techniques with a unified tile-wise granularity and a hardware-optimized kernel, the method achieves a remarkable 4.96x end-to-end speedup without sacrificing generation quality. Overall, the work proposes an effective strategy for acceleration in video generation.
优缺点分析
Strengths
- The authors propose an effective approach to accelerating video generation models and achieve an impressive speed gain in their experiments.
Weaknesses
- Unclear Settings: Some key experiments' settings ( e.g., hyperparameters and introduced in Line 220 ) are missing in the article, making reproducing the results difficult.
- Lack of Key Ablation Studies: The authors provide no ablation study for quantitative analysis of each proposed module's gain in speed, making the effectiveness of the proposed modules unclear, especially for Denoising Step-Aware Scheduling and Hardware-Optimized Kernel Design.
- Generalization to other resolutions and frame numbers: The resolutions and frame numbers of generated videos in the paper are fixed ( 81 frames, 480P and 720P for the two models, respectively), leaving the speedup on different input sizes unknown.
问题
- The authors conduct experiments on unclear frame number ( is it 81 frames consistent with the Wan model? ), and a single resolution for each model ( 480P for 1.3B model and 720P for 14B model ). The generalizability of speed gains is not fully explored, especially in the window dimensions ablation study ( Table 4 ). The efficiency of attention acceleration can be highly sensitive to input scale, so it remains unclear how the reported speedup would scale with varying numbers of frames or higher resolutions. It is recommended to conduct experiments with different input scales or provide a clear mathematical proof.
- The Piecewise schedule for quantization and sparsity divides denoising process into 3 parts based upon and . However, this paper provides no strategy for selecting and , as well as the corresponding , , , , , and . Besides, in Line 227, the authors mention that these hyperparameters are selected at inference time and then transferred to the training stage. I wonder whether such a strategy is too sensitive to different validation datasets, and whether it needs finetuning for every different validation dataset? In Table 3 and Table 4, the tile size and window size are both single fixed values instead of a couple of values for , , and , , . Could the authors clarify this?
- In Section 3.4, the authors mention that a hardware-optimized kernel implementation complements FPSAttention. I wonder how much benefit this implementation brings. Can the authors provide a detailed ablation study?
- In Supp. Table D, the authors present that Wan has 13B parameters and generates 480P videos, which differs from the statement in the main paper ( Line 15 and Line 31 ). Besides, Wan is described to have 13B parameters everywhere in the Supplementary Material. Please check the statements.
局限性
yes
最终评判理由
I thank the author for patiently addressing my questions, and I have decided to improve my score.
格式问题
Not found.
Response to Reviewer Sj6K
Thanks to the reviewer for the valuable comments.
Q1: Ablation study on input size and higher resolution
A1.1 : Yes, in our experiments, we setup the 81 frames consistent with official Wan Model.
A1.2: The quadratic complexity O(N²) of attention operations means that computational demands grow dramatically as we increase either the number of frames (temporal dimension) or the resolution (spatial dimensions). Our joint tile-wise FP8 quantization and structured sparsity are specifically designed to address this scaling challenge. As the attention computation becomes more expensive, the relative benefits of our optimizations become more pronounced.
To empirically validate this scaling behavior, we conducted comprehensive experiments across various video configurations:
| Configuration | Frames | Resolution | Baseline | FPSAttention | Speedup |
|---|---|---|---|---|---|
| Short clip | 41 | 480p | 156s | 36s | 4.33× |
| Standard | 81 | 480p | 271s | 60s | 4.52× |
| Long clip | 121 | 480p | 512s | 107s | 4.78× |
| HD variant | 81 | 720p | 1,301s | 262s | 4.96× |
| Ultra HD | 81 | 1080p | 4,102s | 789s | 5.20× |
These results clearly demonstrate our scaling properties:
Temporal Scaling: Fixing resolution at 480p, we observe progressively better speedups as video duration increases: 4.33× (41 frames) → 4.52× (81 frames) → 4.78× (121 frames). This improvement occurs because longer sequences have more temporal redundancy that our sparse attention patterns can reduce without quality loss.
Spatial Scaling: Fixing duration at 81 frames, speedups improve with resolution: 4.52× (480p) → 4.96× (720p) → 5.20× (1080p). Higher resolutions benefit more because our tile-wise approach maintains consistent hardware utilization while the baseline suffers from increased memory pressure and irregular access patterns.
In summary, FPSAttention's efficiency gains scale favorably with both temporal duration and spatial resolution.
Q2: Missing hyperparameter details (α₁, α₂, g_coarse, etc.)
A2: Our hyperparameter selection follows an empirical approach based on the error analysis shown in Figure 5. Specifically, we observe that error tolerance follows a characteristic U-shaped curve across denoising steps, which guides our three-phase partitioning. For the Wan models evaluated, we empirically determined α₁=0.3 and α₂=0.7, meaning early phase (0-30% of steps), mid phase (30-70%), and late phase (70-100%). The granularity and window sizes were selected through grid search on a held-out validation set, optimizing for the quality-speed Pareto frontier.
Regarding the concern about dataset sensitivity, our approach is designed to be robust across different generation scenarios. The key insight is that the error tolerance patterns (Figure 5) are inherent to the diffusion process itself rather than specific to particular datasets. The noise-dominated early steps and mid-steps exhibit consistent characteristics regardless of the content being generated. We validated this by testing our fixed hyperparameters across diverse VBench prompts without dataset-specific tuning.
The statement about "selected at inference time" (Line 227) may have caused confusion. To clarify: we perform one-shot hyperparameter selection during method development, then use these fixed values for both fine-tuning and inference. There is no need to fine-tune for different validation datasets - the same hyperparameters work across all our experiments.
Q3 : Hardware-optimized kernel contribution
A3: Our ablation study demonstrates that fused kernel implementation is crucial for achieving high efficiency. Without proper kernel optimization, theoretical FLOPs reduction poorly translates to actual speedup.
When FP8 quantization and sparsity lack native hardware support—using pseudo-quantization and tile-wise sparsity simulation—the system cannot fully exploit hardware capabilities. Our measurements show: FP8 quantization alone achieves 2.0× theoretical FLOPs reduction but only 1.4× actual speedup (70% efficiency); sparsity alone provides 3.5× theoretical reduction with 1.8× actual speedup (51% efficiency) due to irregular memory patterns.
The efficiency gap widens dramatically with naive combination. Despite 7.0× theoretical FLOPs reduction, naive implementation delivers merely 2.2× actual speedup—31% efficiency. This occurs because separate quantization and sparsity implementations introduce redundant memory operations, prevent data reuse, and fail to utilize specialized Tensor Cores. In contrast, our fused kernel achieves 4.96× actual speedup from the same 7.0× theoretical reduction (71% efficiency)—a 2.26× improvement over naive approach.
This improvement stems from hardware-conscious design that integrates quantization and sparsity within a single kernel. Through training-aware co-design, quantization and sparsity patterns align with hardware compute units. The fused kernel exploits this alignment by coalescing memory accesses, minimizing data movement across GPU hierarchies, utilizing FP8 Tensor Cores on Hopper architectures, and eliminating redundant computations.
| Component | Theoretical FLOPs Reduction | Actual Speedup | Efficiency |
|---|---|---|---|
| FP8 Quantization Only | 2.0× | 1.4× | 70% |
| Sparsity Only | 3.5× | 1.8× | 51% |
| FP8 + Sparsity (Naive) | 7.0× | 2.2× | 31% |
| FP8 + Sparsity + Fused Kernel | 7.0× | 4.96× | 71% |
The kernel design is indeed crucial for translating theoretical computational savings into practical speedups, and we provide a comprehensive ablation study to quantify its contribution in the table above.
Q4: Inconsistent reported model size
A4: We sincerely apologize the typo for this inconsistency. The correct model sizes are:
- Wan2.1-1.3B: 1.3 billion parameters, typically used for 480p generation
- Wan2.1-14B: 14 billion parameters (not 13B), typically used for both 480p and 720p generation
We will correct all instances of this error in the revised manuscript and ensure consistency throughout the paper and supplementary material.
Q5: Ablation study with different components
A5: Our design philosophy separates concerns between quality preservation (Denoising Step-Aware Scheduling) and speedup optimization (Hardware-Optimized Kernel Design), with each component serving a distinct and complementary purpose.
The denoising step-aware scheduling primarily preserves generation quality under aggressive compression. As shown in the table below, incorporating denoising step-aware scheduling with our core QAT+Sparse components improves VBench scores from 0.8019 to 0.8160 (+1.8% quality improvement) while maintaining comparable speed (2.18× vs 2.20×).
Meanwhile, the hardware-optimized kernel design translates theoretical computational savings into wall-clock speedups. As demonstrated in Section 3.4 and the table below, employing our hardware kernel dramatically accelerates inference from 2.20× to 4.96× while maintaining quality (VBench score 0.8046).
| Components | VBench Score | E2E Inference Speed |
|---|---|---|
| QAT + Sparse (core components) | 0.8019 | 2.20× |
| Core components + Hardware-Optimized Kernel Design | 0.8046 | 4.96× |
| Core components + Denoising Step-Aware Scheduling | 0.8160 | 2.18× |
Thanks for helping us improve our manuscript. We will update these ablation studies in the revised version.
Dear Reviewer Sj6K,
Did we satisfactorily answer your questions? Would you like us to clarify anything further? Feel free to let us know, many thanks.
Best regards,
Authors of #6707
I thank the author for patiently addressing my questions, and I have decided to improve my score.
Dear Reviewer Sj6K,
We are grateful for your thorough review and confirmation that our rebuttal have successfully addressed your concerns.
We deeply appreciate your continued strong support of our work, particularly your willingness to champion the paper. Your endorsement represents a significant vote of confidence in our research.
Thank you for your valuable insights and constructive engagement throughout the review process.
Sincerely,
Authors of Paper #6707
This work introduce FPSAttention, a training-aware co-design of FP8 quantization and sparsity for video diffusion model. The authors propose three techniques to optimize 3D bi-directional attention. (i) A unified 3D tile-wise granularity that supports both quantization and sparsity. (ii) A denoising step-aware strategy that dynamically adjusts quantization granularity and sparsity based on denoising steps, aligning with the model sensitivity across time steps. (iii) A hardware-friendly kernel optimized Hopper GPU implemented by FlashAttention, enabling efficient execution. Acceleration results are reported for Wan2.1 1.3B and 14B models, achieving around 5 times overall speedup with minimal quality loss.
优缺点分析
Strengths:
- This work is clearly written and easy to follow. The paper organization and presentation is good.
- This work timely addresses the high-impact problem in video diffusion efficiency, demonstrating strong practical contribution with significant speedup over state-of-the-art, and preserves quality.
- The proposed joint quantization and sparsity is well-supported by the FlashAttention kernel to realize the speedup. The proposed FPSAttention achieves on-par or higher speedup compared to baseline methods (SageAttention) with better quality.
- The authors properly discuss limitations in the supplementary material, including hardware dependency, training costs and hyper parameter tuning.
Weaknesses:
- The training cost is heavy (64 nodes with 8 H20 GPUs for 7 days), which may limit the broader application of this method compared to training-free baselines.
- As a minor concern, the proposed tiled quantization and sparse methods are more engineering-oriented, the novelty is a bit incremental over prior sparsity and quantization work (e.g., SageAttention, STA). Timestep-aware quantization was also explored in previous work [1]
[1] Xiuyu Li, et al., Q-Diffusion: Quantizing Diffusion Models, ICCV 2023.
问题
How does FPSAttention generalize to MMDiT architecture?
局限性
Yes
最终评判理由
Thanks for providing the rebuttal, which address my concerns. This work presents a training-aware co-design of FP8 quantization and sparsity to improve the efficiency of video diffusion model, which can benefit the research in this domain. I vote for acceptance.
格式问题
NA
Response to Reviewer JYRy
Thanks to the reviewer for the valuable comments.
W1: Fine-tuning costs
WA1: First, our training cost is minimal compared to video generative model pretraining. As an analogy, while video generation models require massive pretraining resources—OpenAI's Sora is estimated to need 4,200-10,500 H100 GPUs for one month [1]. Our empirical statistics show that the training overhead accounts for less than 1% of the pre-training.
Second, training-free methods cannot achieve effective joint optimization of quantization and sparsity. As shown in Figure 2, naive combination leads to severe quality degradation because "quantization errors can be magnified when combined with sparsity mechanisms" (line 51-52). Our supplementary Table A quantifies this: training-free joint application causes a 21.1% performance drop (0.8019→0.6325), while FPSAttention achieves a 1.8% improvement.
As a byproduct, our training represents a one-time investment that is rapidly amortized through 4.96× end-to-end speedup, translating to around 80% reduction in recurring inference costs. For production models serving millions of requests, this upfront cost becomes negligible compared to the cumulative inference savings.
[1] Plappert, M. Under the hood: How OpenAI's Sora model works. Factorial Funds Blog.
Q1 : The proposed methods are engineering-oriented
A1: Our core contribution lies in a training-aware co-design of sparsity and quantization—a paradigm that is distinct from previous training-free approaches such as SageAttention and STA. We observe that naive combination of FP8 quantization and sparse attention leads to catastrophic 21.1% performance degradation (see Appendix B.4). Figure 2 in the paper also shows that training-free combination of FP8 quantization with STA produces visually corrupted outputs. To make FPSAttention effective, we innovatively propose three non-trivial strategies, including unifying tile-wise operations, denoising step-aware scheduling and a hardware-optimized kernel design. Regarding timestep-aware approaches like [1], we acknowledge this important prior work. Previous timestep-aware methods focus solely on quantization parameters, missing the critical interplay between precision and sparsity that varies across denoising phases. However, our denoising step-aware strategy fundamentally differs in that it adaptively schedules both quantization granularity g(t) and sparsity window size W(t) based on different compression error patterns across denoising steps.
Q2: Generalization to MMDiT
A2: Our FPSAttention targets the general attention module, thus it is intuitively applicable to common diffusion Transformer architectures, including MMDiT. Since the main difference between MMDiT architecture and Wan2.1 is the condition injection part, corss-attention v.s. in-context attention, the adaptation will be straightforward: our implementation already supports arbitrary token configurations, and the in-context attention, a pure self-attention module, simply requires adjusting our tile partitioning to accomodate modality boundaries while maintaining the same optimization principles. We anticipate that FPSAttention will deliver comparable or even superior speedups on MMDiT due to the longer sequence lengths from concatenated multimodal tokens, which amplify the benefits of our sparsity and quantization co-design. We will futher try our FPSAttention with models using MMDiT like HunyuanVideo [2].
[2] Kong, Weijie, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong et al. "Hunyuanvideo: A systematic framework for large video generative models." arXiv preprint arXiv:2412.03603 (2024).
Dear Reviewer JYRy,
Did we satisfactorily answer your questions? Would you like us to clarify anything further? Feel free to let us know, many thanks.
Best regards,
Authors of #6707
Thanks for providing the rebuttal which addresses most of my concerns. After reading other reviews, I tend to keep my positive rating.
This paper proposes FPSAttention, a method that leverages FP8 quantization and sparse attention to accelerate video generation. FPSAttention is a training-aware approach that applies quantization and sparsity in 3D-tilewise manner and introduces a denoising step-aware adaptive strategy to adjust the granularity of both quantization and sparsity. Furthermore, the authors develop a dedicated GPU kernel to support the proposed attention mechanism. As a result, FPSAttention achieves an approximate 5× speedup in video generation without noticeable degradation in output quality.
优缺点分析
Strengths:
- The proposed method delivers significant speedup in video generation while maintaining high-quality outputs.
Weaknesses:
- The training overhead required to adapt a video diffusion model to FPSAttention is substantial—it demands 64 nodes with 8 H20 GPUs for 7 days. This raises concerns about the scalability and general applicability of the approach to diverse models and resource settings.
- The paper does not include any comparison with prior quantization methods for video diffusion, such as [1] and [2].
- There is no analysis on the direct combination of quantization and sparsity, which could provide insight into the benefit of the proposed joint method.
[1] Tian, Shilong, et al. "QVD: Post-training Quantization for Video Diffusion Models." Proceedings of the 32nd ACM International Conference on Multimedia. 2024.
[2] Zhao, Tianchen, et al. "Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation." arXiv preprint arXiv:2406.02540 (2024).
问题
- What are the training or calibration costs associated with the prior sparse methods listed in Table 2?
- Can you provide a comparison with existing quantization techniques for video diffusion, particularly those in [1] and [2]?
- What happens if prior quantization techniques are directly combined with sparse attention?
[1] Tian, Shilong, et al. "QVD: Post-training Quantization for Video Diffusion Models." Proceedings of the 32nd ACM International Conference on Multimedia. 2024.
[2] Zhao, Tianchen, et al. "Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation." arXiv preprint arXiv:2406.02540 (2024).
局限性
This paper does not discuss the limitations of the proposed approaches. I would gently suggest including a discussion of limitations in the paper.
最终评判理由
This paper presents a comprehensive approach to accelerating video generation by leveraging both FP8 quantization and sparse attention. While it requires an additional training stage, the associated overhead is reasonable given the significant speedup achieved by the proposed method. As my concerns have been addressed, I now lean toward accepting this paper.
格式问题
None
Response to Reviewer jgQE
Thanks to the reviewer for the valuable comments.
Q1: Training and calibration costs compared to baseline sparse attention methods
A1: There is a fundamental difference between training-free sparse methods and our trainable joint optimization method FPSAttention. The methods in Table 2 are predominantly training-free approaches with zero additional training cost—SageAttention, SpargeAtten, and STA require no training, while Sparse VideoGen only needs minimal online profiling with 64 query tokens. Our FPSAttention is training-aware but delivers superior quality-efficiency trade-offs at a reasonable fine-tuning cost across two aspects:
First, our training cost is minimal compared to video generative model pretraining. As an analogy, while video generation models require massive pretraining resources—OpenAI's Sora is estimated to need 4,200-10,500 H100 GPUs for one month [1]. Our empirical statistics show that the training overhead accounts for less than 1% of the pre-training.
Second, training-free methods cannot achieve effective joint optimization of quantization and sparsity. As shown in Figure 2, naive combination leads to severe quality degradation because "quantization errors can be magnified when combined with sparsity mechanisms" (line 51-52). Our supplementary Table A quantifies this: training-free joint application causes a 21.1% performance drop (0.8019→0.6325), while FPSAttention achieves a 1.8% improvement.
As a byproduct, our training represents a one-time investment that is rapidly amortized through 4.96× end-to-end speedup, translating to around 80% reduction in recurring inference costs. For production models serving millions of requests, this upfront cost becomes negligible compared to the cumulative inference savings.
[1] Plappert, M. (n.d.). Under the hood: How OpenAI's Sora model works. Factorial Funds Blog.
Q2: Comparison with previous methods, including QVD and ViDiT-Q quantization
A2: We acknowledge that comparing with QVD and ViDiT-Q would strengthen our paper and commit to adding these comparisons in our revision.
| Aspect | FPSAttention (Ours) | QVD [2] | ViDiT-Q [3] |
|---|---|---|---|
| Core Problem | Joint Sparsity & Quantization | Quantization Only | Quantization Only |
| Methodology | Training-Aware (Full Fine-tuning) | Post-Training (PTQ) | Dynamic Quantization with timestep awareness |
| Data Type | FP8 (E4M3/E5M2) | W8A8 (INT8) | W8A8/W4A8 (INT) |
| Sparsity | Learned structured sparsity | No sparsity | No sparsity |
| Step-Awareness | Joint scheduling of both S & Q | N/A | Timestep-level quantization |
The fundamental difference lies in our training-aware approach. QVD and ViDiT-Q employ post-training quantization (PTQ), while FPSAttention uses quantization-aware training (QAT). This eliminates the training-inference gap that limits PTQ methods. While QVD achieves near-lossless W8A8 quantization and ViDiT-Q reports 1.4-1.7× speedups, our training-aware joint optimization of quantization AND sparsity achieves 4.96× speedup by learning optimal patterns directly from data.
Our choice of FP8 over INT8 provides critical advantages for attention mechanisms. FP8 better captures the wide dynamic range of attention activations compared to INT8 used by QVD[2]/ViDiT-Q [3]. Moreover, FP8 operations are natively accelerated on modern GPUs (H100/H200), enabling our superior speedups [4]. While INT8 offers broader hardware compatibility, it can struggle with attention's dynamic range, particularly in regions with high sparsity [5].
We are conducting additional experiments comparing FPSAttention with ViDiT-Q. To best of our knowledge, the QVD is not opensource code yet, therefore we mainly focus on the experiment validation for ViDiT-Q. Preliminary results show that while both methods achieve comparable generation quality with timestep-aware strategies, our joint optimization delivers significantly higher speedups. This validates that quantization alone has inherent limits - by jointly optimizing with sparsity and leveraging hardware-native FP8, we push beyond these limits to achieve a superior speed-quality trade-off.
In our revision, we will add: (1) dedicated comparison experiments with ViDiT-Q, (2) analysis of how our denoising step-aware scheduling relates to ViDiT-Q's timestep-level quantization, and (3) discussion of these methods as complementary approaches serving different deployment scenarios.
Table: Comparison with ViDiT-Q on VBench Evaluation Metrics
| Method | Bit-width | Imaging Quality | Aesthetic Quality | Motion Smooth. | Dynamic Degree | BG. Consist. | Subject Consist. | Scene Consist. | Overall Consist. | Speedup |
|---|---|---|---|---|---|---|---|---|---|---|
| ViDiT-Q Baseline | 16/16 | 63.68 | 57.12 | 96.28 | 56.94 | 96.13 | 90.28 | 39.61 | 26.21 | 1.00× |
| ViDiT-Q | 8/8 | 63.48 (-0.20) | 56.95 (-0.17) | 96.14 (-0.14) | 61.11 (+4.17) | 95.84 (-0.29) | 90.24 (-0.04) | 38.22 (-1.39) | 26.06 (-0.15) | 1.40× |
| Wan2.1-14B (Ours) | BF16 | 67.15 | 62.04 | 98.28 | 69.44 | 96.91 | 95.28 | 36.69 | 69.12 | 1.00× |
| FPSAttention | FP8+Sparse | 71.03 (+3.88) | 62.40 (+0.36) | 98.04 (-0.24) | 83.89 (+14.45) | 93.77 (-3.14) | 94.35 (-0.93) | 31.82 (-4.87) | 68.93 (-0.19) | 4.96× |
Note: Values in parentheses show the change from respective baselines. Bold values indicate best performance.
Key Observations:
- ViDiT-Q shows minimal quality degradation (-0.20 to -1.39 on most metrics) while achieving 1.40× speedup
- FPSAttention demonstrates quality improvements on critical metrics (+3.88 Imaging Quality, +14.45 Dynamic Degree) despite aggressive compression
- Our method achieves 3.5× higher speedup (4.96× vs 1.40×) while maintaining or improving quality
This empirical evidence validates that joint training-aware optimization fundamentally improves the speed-quality trade-off, enabling simultaneous quality improvements and dramatic speedups that are impossible with quantization-only approaches.
[2] Tian, Shilong, et al. "QVD: Post-training Quantization for Video Diffusion Models." Proceedings of the 32nd ACM International Conference on Multimedia. 2024.
[3] Zhao, Tianchen, et al. "Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation." arXiv preprint arXiv:2406.02540 (2024).
[4] Kuzmin, A., Van Baalen, M., Ren, Y., Nagel, M., Peters, J., & Blankevoort, T. (2022). FP8 Quantization: The Power of the Exponent. arXiv preprint arXiv:2208.09225.
[5] Noune, B., Jones, P., Justus, D., Masters, D., & Luschi, C. (2024). Efficient Post-training Quantization with FP8 Formats. arXiv preprint arXiv:2309.14592.
Q3: Impact of naive sequential combination of quantization and sparsity
A3: Direct combination of prior quantization techniques with sparse attention leads to catastrophic quality degradation. As shown in Figure 2, training-free combination of FP8 quantization with STA produces visually corrupted outputs, while our extensive ablation study (Supplementary Table A) reveals a devastating 21.1% performance drop (0.6325 vs baseline 0.8019), with critical metrics like Human Action collapsing to 0.02 and Multiple Objects to 0.0.
As explained in Lines 49-60 of the paper, "this failure may be caused by sparsity mechanisms that prioritize retaining tokens with high-magnitude attention scores, while quantization disproportionately introduces large errors in these values." This motivates our FPSAttention that jointly optimizes quantization and sparsity patterns through unified tile-wise operations in a training-aware manner. This co-design ensures quantization and sparsity work synergistically, enabling 4.96× speedup while maintaining visual quality - demonstrating that training-aware optimization is not merely beneficial but essential for aggressive compression in video diffusion models.
Q4: Limitations discussion
A4: We thank the reviewer for this suggestion. We would like to clarify that we have included a discussion of limitations in Section B "Limitation and Societal Impacts" in our Supplementary Material. Our limitations discussion addresses several key aspects: our approach works best with modern FP8-capable GPUs (e.g., NVIDIA Hopper architectures) though it can still provide benefits on older hardware; the method requires quantization-aware training which involves moderate additional training time compared to post-training approaches; the denoising step-aware scheduling includes several hyperparameters (α₁, α₂, quantization granularities, window sizes) that need optimization for different architectures; and while our current evaluation focuses on Wan2.1, the approach shows promise for extension to other video diffusion transformers. We will move the limitations discussion in the main paper of the revised version.
Dear Reviewer jgQE,
Did we satisfactorily answer your questions? Would you like us to clarify anything further? Feel free to let us know, many thanks.
Best regards,
Authors of #6707
Thank you for your detailed response. As it has resolved my concerns, I will raise my score.
Dear Reviewer jgQE,
We are grateful for your thorough review and confirmation that our rebuttal have successfully addressed your concerns.
We deeply appreciate your continued strong support of our work, particularly your willingness to champion the paper. Your endorsement represents a significant vote of confidence in our research.
Thank you for your valuable insights and constructive engagement throughout the review process.
Sincerely,
Authors of Paper #6707
This paper introduces FPSAttention, a training-aware framework that jointly optimizes FP8 quantization and structured sparsity to accelerate attention in video diffusion models. FPSAttention operates by applying a unified, tile-wise granularity for both quantization and block sparsity, directly aligned with GPU hardware for efficient execution. The approach further incorporates a denoising step-aware schedule, dynamically adapting compression granularity and sparsity levels throughout the diffusion process based on model sensitivity. FPSAttention is implemented as a fused, hardware-friendly kernel and evaluated on large-scale Wan2.1 video diffusion models. Experiments on the VBench benchmark demonstrate up to 4.96× end-to-end speedup (7.09× kernel speedup) over the BF16 baseline at 720p resolution, without sacrificing generation quality. The method consistently outperforms previous quantization or sparsity-only baselines.
优缺点分析
Strengths
-
Substantially accelerates video diffusion inference by co-designing quantization and sparsity at training time, avoiding the quality drop seen in naive or training-free methods.
-
Aligns both the algorithm and implementation with GPU hardware (tile-wise design, fused kernel), enabling real wall-clock speedups, not just theoretical FLOP reductions.
-
Adapts compression strategy across the diffusion process using a denoising step-aware schedule, balancing efficiency and generation quality.
-
Demonstrates strong generalization to large-scale video models and achieves state-of-the-art efficiency and quality on standard benchmarks.
Weaknesses
-
Requires retraining or fine-tuning with the FPSAttention kernel; not directly applicable as a drop-in replacement for already-trained models.
-
Gains are mainly demonstrated on NVIDIA Hopper/Ada GPUs with FP8 and custom kernel support; potential benefits on other hardware are unclear.
-
Introduces new hyperparameters (tile size, sparsity window, schedule) that require tuning; the robustness and automation of these choices are not fully explored.
问题
1.Can FPSAttention be adapted to already-trained models through fine-tuning, or does it require training from scratch with the co-designed kernel?
-
What are the practical challenges or limitations when deploying FPSAttention on hardware without native FP8 or fused kernel support (e.g., older GPUs, non-NVIDIA platforms), and are there fallback strategies or ablation results for such cases?
-
For models using aggressive step-reduction or other efficient sampling methods, does FPSAttention provide further speedup or quality benefits, or do these alternative approaches outperform it?
局限性
Yes.
最终评判理由
I would like to keep my score as 4. The work itself provides a promising deployment optimization for video diffusions but the extreme long performance recovery process sets a relatively-high bar for the individual researchers to further dive deep into the proposed methods.
格式问题
No. The written quality is good.
Response to Reviewer zqGz
Thanks to the reviewer for the valuable comments.
Q1 : Adaptability of FPSAttention to pre-trained models vs. training from scratch
A1: FPSAttention is specifically designed as a post-training adaptation method that fine-tunes already-trained models, not a training-from-scratch approach.
As stated in Section 4, we "implemented our proposed framework on the Wan2.1 architecture (1.3B and 14B variants), preserving the original model structure while introducing FPSAttention." The 7-day training period mentioned represents the fine-tuning phase where we jointly optimize FP8 quantization parameters and sparsity patterns starting from pre-trained checkpoints—this is significantly shorter than the months typically required for training such large models from scratch.
This post-training fine-tuning approach offers crucial advantages over both training-from-scratch and training-free methods. Unlike training-free approaches that suffer 21.1% performance degradation when naively combining quantization and sparsity (Table A in supplementary), our training-aware fine-tuning enables the model to adapt to joint quantization-sparsity constraints, achieving up to 4.96× end-to-end speedup while maintaining generation quality. The joint optimization during fine-tuning allows the model to learn optimal patterns that preserve critical information pathways despite aggressive compression.
Q2: Hardware compatibility and fallback strategies for non-FP8 platforms
A2: We acknowledge that FPSAttention's full performance benefits require modern FP8-capable hardware, but our joint optimization framework remains valuable across diverse platforms. For instance, on AMD MI300 series GPUs, we can perform deep custom development of the corresponding kernel, similar to the ROCm version of FlashAttention [2]. This demonstrates the potential of our methodology for application across different platforms. Looking forward, as newer architectures like NVIDIA's B200 series introduce FP6 capabilities, our training-aware co-design methodology can extend to these lower precisions, potentially achieving even greater speedups while validating the importance of hardware-native quantization design. For legacy hardware without native FP8 support (e.g., A100, V100), our unified optimization framework still provides substantial benefits through a fallback strategy: FP16/BF16 mode where our tile-wise sparsity patterns remain effective, achieving approximately 3× speedup from sparsity alone. Crucially, even without FP8 acceleration, our training-aware joint optimization prevents the catastrophic quality degradation seen in naive combinations (Table A).
Q3: FPSAttention performance with step-reduction and alternative sampling methods
A3: We thank the reviewer for this insightful question. It highlights a crucial distinction between two orthogonal axes of diffusion model acceleration: (1) optimizing the sampling schedule to reduce the number of steps (inter-step optimization), and (2) accelerating the computation within each denoising step (intra-step optimization).
Our work, FPSAttention, fundamentally belongs to the second category. It focuses on accelerating the core attention mechanism, which is a major computational bottleneck within every single forward pass of diffusion transformers. Methods like aggressive step-reduction (e.g., DeepCache) or efficient samplers, on the other hand, belong to the first category.
These two approaches are, by design, complementary rather than competitive. To empirically validate this principle, we integrated FPSAttention with DeepCache [3], a representative and state-of-the-art step-reduction method. The results are summarized below:
| Method | Optimization Target | Speedup Factor |
|---|---|---|
| DeepCache | Number of denoising steps | 2.31x |
| FPSAttention | Compute per step | 4.96× |
| Combined | Total generation time | 8.02x |
As shown in the table, DeepCache alone achieves a 2.3× speedup by caching and reusing features across adjacent steps. FPSAttention alone provides a 4.96× speedup by optimizing the expensive attention operations within each step. When combined, they yield a cumulative 8.02× speedup. Our method serves as a foundational enhancement to the attention module, which benefits any sampling process that relies on it. Besides, training-based step-reduction approaches like sCM [4], which require additional GPU hours for finetuning, are also theoretically orthogonal and can be further combined to achieve real-time generation. We believe this can serve as a future direction for the community.
Reference
[1] Plappert, M. (n.d.). Under the hood: How OpenAI's Sora model works. Factorial Funds Blog.
[2] ROCm/flash-attention
[3] Ma, Xinyin, Gongfan Fang, and Xinchao Wang. "Deepcache: Accelerating diffusion models for free." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15762-15772. 2024.
[4] Lu, Cheng, and Yang Song. "Simplifying, Stabilizing and Scaling Continuous-time Consistency Models." The Thirteenth International Conference on Learning Representations.
Dear Reviewer zqGz,
Did we satisfactorily answer your questions? Would you like us to clarify anything further? Feel free to let us know, many thanks.
Best regards,
Authors of #6707
Dear authors,
Thanks for your detailed rebuttal.
My concerns have been taken good care of. Overall, it is a solid technical contribution from my perspective and it should be orthogonal to many of other optimizations, the only thing that held me back from assigning a higher score is that the fine-tuning recovery stage for the algorithm adaption towards the model is still way more expensive than I expected. And it should be orthogonal to many of other optimizations.
Thus, I tend to keep my score, thank you again.
Thank you again for your time and constructive engagement.
At first glance, the cost of this fine-tuning process appears substantial. However, it is important to contextualize this cost within the scope of full fine-tuning a 14B model. For comparison, we note that a concurrent study, VSA [1], utilized 32 H200 GPUs for 3,000 steps to fine-tune a much smaller 1.3B model, whereas our method required around 2000 steps to achieve loss convergence.
Notably, our experiments were conducted on H20 GPUs, which possess less than 15% of the computational power of an H200. Furthermore, the H200's memory capacity is 1.5 times that of the H20. Consequently, for fine-tuning the 14B model, our use of a larger number of GPUs was not driven by a need for computational power, but was instead necessitated by memory constraints, which required the implementation of complex N-dimensional parallelization strategies.
Reference
[1]: VSA: Faster Video Diffusion with Trainable Sparse Attention. arXiv preprint arXiv:2505.13389.
This submission is carefully reviewed and received unanimous positive feedback (4545). All the reviewers are impressed by the substantial speedup which is hard but important to achieve in video generation. This does not only show the impacts in academia but also is of great practical value in industry use case. And the proposed solution of joint quantization and sparsity is compatible with existing FlashAttention, which looks elegant and easy to plugin. With a detailed and strong rebuttal provided, most concerns are addressed as explicitly acknowledged by reviewers. After checking all the reviews and discussions, AC made a decision of acceptance. Authors are encouraged to add explanations and suggested experimental results by reviewers in the final version.