VEnhancer: Generative Space-Time Enhancement for Video Generation
摘要
评审与讨论
This paper proposes a video super-resolution algorithm called VEnhancer, which enhances video resolution in both spatial and temporal dimensions. The method is based on a pretrained video generation model, with a Spatio-Temporal (ST) Controller trained specifically for the task. The video to be enhanced is provided as a condition input to the ST-Controller. The authors constructed a test dataset, AIGC2023, to validate the effectiveness of the algorithm.
优点
- From the visual results provided by the authors, the proposed method enhances the video's spatial resolution and FPS. The details of objects are richer compared to the pre-enhancement version, and the smoothness of the video has also improved to some extent.
- This method can achieve both temporal and spatial video super-resolution simultaneously.
缺点
- The proposed method's overall idea is quite similar to ControlNet: both introduce a control branch by copying the encoder of the UNet model to the control branch and initializing the output layer to zero. The difference lies in their applications—this method uses the framework for video super-resolution, while ControlNet is designed for controllable generation. In this paper, the condition is the video to be super-resolved, whereas in ControlNet, conditions include modalities like depth, edge, and mask. The authors should explicitly discuss how their approach differs from or improves upon ControlNet for the specific task of video enhancement.
- For the video-aware conditioning (section 4.3), the noise augmentation technique proposed by the authors is a common trick that was first introduced by Cascaded Diffusion Models [Jonathan Ho et al., 2021] for multi-stage high-resolution image generation. Essentially, it incrementally upscales from low to high resolution, which is quite similar to this paper's application. Additionally, using the time embedding as a condition is a standard approach in diffusion models. Conditioning on the downscale factor is somewhat analogous to using FPS as a condition in video generation [Make-A-Video, Uriel Singer et al., 2022], allowing the generation process to better incorporate additional conditioning information. This approach isn't particularly novel in itself. The authors should clarify what specific innovations or improvements their approach offers over these existing techniques.
- The proposed method is for video super-resolution and, theoretically, should be applicable to both AI-generated videos and real videos. However, the authors only conducted experiments on AI-generated videos (section 5.3), making this comparison less comprehensive. I suggest that the authors include experiments on real-world videos or explain why their method might not be suitable for such videos if that's the case.
问题
- In terms of experimental results, the proposed method shows only minor improvements in Aesthetic Quality Dynamic, Degree of Motion, and Smoothness, while achieving relatively significant improvements in MUSIQ and DOVER. What causes this discrepancy?
- As shown in Table 3, the improvement of the proposed method on CogVideoX-5B is relatively minor and not as significant as on VideoCrafter-2. Could the authors explain the reasons for this?
- How many videos are there in the AIGC2023 dataset? What are the approximate resolutions and lengths of these videos? Is there any statistical information available? Will the authors release this test set in the future?
RE1: VEnhancer is proposed for achieving frame interpolation, spatial super-resolution, and video refinement in a single model. This is a challenging task as the proposed model should have sufficient generative ability to obtain high-quality videos and also preserve the original information well. Thus, we base our model on a fixed pretrained video diffusion prior to supply generative ability, and propose a space-time controller to condition the video generation across space and time axes for achieving generative enhancement. In the generative enhancement domain, we are the first work that adopts such design and significantly outperforms previous methods in generative enhancement for video generation. Besides, we propose space-time data augmentation and video-aware conditioning for achieving spatial & temporal super-resolution and video refinement in a single model and enabling arbitrary upscaling factors, interpolation ratios and refinement strengths for various user needs. While your mentioned "copying the encoder of the UNet model to the control branch and initializing the output layer to zero" are only very small parts of our whole design, which is used for weight initialization for our proposed space-time controller.
ControlNet is for controllable image generation, and the conditions are depth, edge, et.al. It cannot be directly used for video generation tasks, let alone video super-resolution, frame interpolation, and video refinement tasks.
RE2: The noise augmentation in our work also considers encoding this information into ST-Controller. More specifically, we reuse the MLP for time embedding and add a zero linear layer for efficient encoding. From the table below, we can find that the performance drops significantly without our proposed noise augmentation embedding (video super-resolution task on AIGC2023 test set). We have carefully designed this technique to improve the performance, and it is surely a contribution and a complement to the noise augmentation technique.
| DOVER | MUSIQ | |
|---|---|---|
| VEnhancer | 0.8586 | 59.4474 |
| w/o noise augmentation embedding | 0.7215 | 37.0897 |
Conditioning on the downscale factor is extremely different from conditioning on FPS, since the former is about spatial control while the latter is about temporal control. It is not fair to compare such different things. By the way, our proposed conditioning on the downscaling factor is very effective as shown in Figure 5 (manuscript). It is no doubt a new contribution for the community.
RE3: Thanks for your suggestion. Our work only focuses on enhancement for AI-generated videos. This is a challenging task, as reconstruction-based video super-resolution and space-time super-resolution methods often fail in removing AI-generated artifacts/distortions and regenerating faithful contents. By the way, VEnhancer is not just for video super-resolution, but can also deal with frame interpolation and video refinement, which are very important for producing high-quality videos.
Thank you to the authors for their response. After carefully reading the reply and considering other reviewers' comments, I believe the main concerns remain unresolved:
-
Core Design Similarity to ControlNet: The authors argue that "copying the encoder of the UNet model to the control branch and initializing the output layer to zero" is only a minor aspect of their method. However, from my perspective, this design represents the core innovation of ControlNet and is also central to the proposed space-time controller. The only difference is that ControlNet uses an image generation model (SD) with 2D CNNs as the base model, whereas this work employs a video generation model with 3D CNNs. Reviewer LBLJ also raised a similar concern about this point.
-
Lack of Novelty in Video-Aware Conditioning: The techniques used in the so-called video-aware conditioning, such as noise augmentation and conditioning on the downscale factor, are borrowed from existing work (see my Weakness 2). These do not, in essence, constitute a novel contribution.
-
Limited Applicability of the Super-Resolution Model: As a super-resolution model, its applicability is quite narrow—it can only super-resolve AI-generated videos and cannot handle real-world videos. This significantly limits its practical use.
RE1:
Core design differs from ControlNet significantly
“Copying the encoder of the UNet model to the control branch and initializing the output layer to zero" is not the core design of VEnhancer, and we have emphasized in our previous response. Now we rephrase it with three key points:
-
VEnhancer is designed for achieving frame interpolation, spatial super-resolution, and video refinement simultaneously in a single video diffusion model. This is a huge advancement since previous methods achieve these functionalities separately, that is, to train separate video diffusion models and conduct time-consuming one-by-one inference.
-
VEnhancer adopts fixed pretrained video diffusion prior to supply generative ability, and propose a trainable space-time controller to condition the video generation across space and time axes for achieving generative enhancement. This approach is novel and differs significantly from previous state-of-the-art methods, where full finetuning or zero-shot refinement is adopted.
-
We propose the novel space-time data augmentation and video-aware conditioning to train the space-time controller for achieving multi-functional space-time enhancement and enabling arbitrary upscaling factors, interpolation ratios and refinement strengths for various user needs. This is achieved for the first time, and yields to be effective through extensive experiments.
RE2:
Video-Aware Conditioning is Novel and Effective
-
Regarding noise augmentation, we don’t just borrow it, we actually improve it -- by incorporating our designed noise augmentation embedding which reuses the MLP in time embedding and newly introduces a zero linear layer. This strategy has been validated to be very effective according to our provided quantitative comparison (table in RE2).
-
Plus, at the first time, we devise a novel technique that enables conditioning on downscaling factor, that yields to be very effective in modifying texture details generation (Figure 5 in manuscript).
-
More importantly, these conditioning techniques are aware of which frame has such conditions, and which one does not. While your mentioned works does not have such ability.
To conclude, they do, in essence, constitute a novel contribution.
RE3:
Strong Applicability for Generative Space-Time Enhancement
It is too narrow that you keep saying VEnhancer is just a super-resolution model. VEnhancer is a generative enhancement model that can achieve video super-resolution, frame interpolation, and video refinement simultaneously in a single model. While these three functions are significant in producing high-definition videos in previous research on video generation, and now we devise a single model to handle all three functions simultaneously. Isn’t it a powerful and practical toolkit for video generation?
The paper presents a diffusion-based video space-time enhancement model. Specifically, the proposed VEnhancer leverages a control branch to integrate the low-resolution and low-frame-rate video prior into diffusion. Besides, a new data augmentation strategy is proposed to carry out the video super-resolution and enhancement with different functions, e.g., enhancing video with details in different levels. Comparisons with both video super-resolution methods and cascaded super-resolution methods verify the superiority of the proposal.
优点
- The proposed approach unifies the space-time video super-resolution through re-sampling operation over time-space axes, achieving multi-function in a single video diffusion model.
- Both of the objective video quality evaluation and subjective human evaluation show the effectiveness of the proposed approach in video enhancement.
- The paper is well-written and technical description is clear.
缺点
- One of my key concerns about the technical novelty is about the involving of the UNet-based control branch whose weights are copied from the original I2VGen-XL. The novelty seems limited since the controling approach has been proposed by ControlNet and the key frame information integration via feature summation is also intuitive.
- The technical design of space-time data augmentation should be investigated quantitatively in the main paper. Besides, the motivation of the position embedding encoded from the noise augmentation is unclear. What are the effects by exploiting this parameter in video enhancement?
- There should be more in-depth analysis or discussions with other SOTA approaches to demonstrate the technical contributions of the proposal. Only describing the visual results cannot give readers many insights, but only leads them think that the proposed approach is more like engineering rather than research.
- The ablation study should be conducted in the main paper instead of the appendix.
- Some format issue and typos: a) The "ramdom" in line 259; b) Remove period symbol in the section title “Conclusion and Limitation.”
问题
Please see the weaknesses.
伦理问题详情
N/A
RE1:
-
"Copying the weights from the base model as the control branch" is only a small part of our method. We use this strategy because it provides a better initialization for the space-time controller and helps convergence. Compared to state-of-the-art generative enhancement methods, VEnhancer is novel in architecture design since it is based on fixed video diffusion prior with trainable control branch, while previous methods directly full-finetune the base model. Our architecture design yields to be extremely effective in obtaining high-quality videos without sacrificing too much fidelity, leading to superior performance over previous methods.
-
The choice of key frame information integration is explored and discovered by experiments, not just an intuitive design. We have tried another alternative [Jonathan Ho, et al., 2022, Andreas Blattmann, et al., 2023a], that is to repeat the key frame and use the repeated ones as the condition for predicted frames. However, such approach fails when dealing with different interpolation ratios, as each frame could either receive correct condition (key frame) or inaccurate condition (the repeated ones) so the model gets confused during training and produce terrible results. On the other hand, some works [1, 2] concat key frame information for key frames, and concat zero images for predicted frames, as well as masking instruction for every frame. While our strategy is extremely simple and elegant, since we do nothing for predicted frames, but only perform feature summation for key frames.
[1] Guo, Yuwei, et al. Sparsectrl: Adding sparse controls to text-to-video diffusion models. European Conference on Computer Vision. Springer, Cham, 2025.
[2] Chen, Xinyuan, et al. Seine: Short-to-long video diffusion model for generative transition and prediction. The Twelfth International Conference on Learning Representations. 2023.
RE2: Thanks for your suggestions.
- Space-time augmentation plays a vital role in producing stable and high-quality results. We remove the time augmentation part and only perform downscaling and noise augmentation, and we train VEnhancer based on this newly created data augmentation. The comparison for video super-resolution (4x) on AIGC2023 is presented in the table below.
| DOVER | MUSIQ | |
|---|---|---|
| VEnhancer | 0.8586 | 59.4474 |
| w/o time augmentation | 0.799 | 43.1228 |
| w/o noise augmentation embedding | 0.7215 | 37.0897 |
It is observed that the performance w/o time augmentation is obviously worse than VEnhancer in DOVER (video quality assessment metric) and MUSIQ (image quality assessment metric). This indicates that key frames as conditions will boost the final video quality. On the one hand, as video frames are usually redundant, only using key frames for conditioning might be enough. On the other hand, randomly masking some frames will make the model generate the masked frames based on the generative video diffusion prior, leading to sharper results.
- The position embedding encoded from the noise augmentation is effective.
We remove the noise augmentation embedding and train VEnhancer as a comparison. The quantitative result is presented in the above table. It is obviously observed that there is a significant performance drop in DOVER and MUSIQ compared with VEnhancer. We checked the visual results and found obvious artifacts and noises. This indicates that the noise embedding could improve the augmented noise removal, leading to more stable results.
RE3: Thanks for your suggestions. Compared with previous SOTA methods, our method have the following contributions.
-
VEnhancer is designed for achieving frame interpolation, spatial super-resolution, and video refinement simultaneously in a single video diffusion model. This is a huge advancement since previous methods achieve these functionalities separately, that is, to train separate video diffusion models and conduct time-consuming one-by-one inference.
-
VEnhancer adopts fixed pretrained video diffusion prior to supply generative ability, and propose a trainable space-time controller to condition the video generation across space and time axes for achieving generative enhancement. This approach is novel and differs significantly from previous state-of-the-art methods, where full finetuning or zero shot refinement is adopted.
-
We propose the novel space-time data augmentation and video-aware conditioning to train the space-time controller for achieving multi-functional space-time enhancement and enabling arbitrary upscaling factors, interpolation ratios and refinement strengths for various user needs. This is achieved for the first time, and yields to be effective through extensive experiments.
RE4&5: Thanks for your suggestions, and we have reorganized our paper and fixed the typos.
The paper introduces VEnhancer, a novel generative method designed to enhance both the spatial and temporal quality of AI-generated videos. VEnhancer increases resolution, adds detail, and reduces artifacts and flickering through a single model, leveraging a space-time controller for conditioning on low-quality inputs.
Contributions
- Unified Model: VEnhancer combines spatial and temporal super-resolution and video refinement capabilities in one model.
- Space-Time Controller: Introduces a mechanism for injecting multi-frame conditions based on a pre-trained generative video prior.
- Performance: Demonstrates superiority over some of the existing state-of-the-art methods in video super-resolution through extensive experiments.
优点
- Handles multiple up-scaling in both space and time dimensions, allowing for versatile application scenarios.
- Improves the fidelity of generated videos while maintaining or enhancing detail, demonstrated through rigorous testing against current top methods.
缺点
- As with most diffusion models, the complexity of the model could lead to longer inference times, which may limit its applicability in real-time or low-resource scenarios.
- The model's performance heavily relies on the availability of high-quality training data, which might not always be available or feasible to collect in certain domains.
- For each different text-to-video model, a new VEnhancer need to be trained to accommodated the different architecture, limiting the use case of the proposed method. It would be more practical if the proposed method could be a standalone video super-resolution toolkit.
- Other diffusion based video super-resolution works is not compared in the paper, e.g. Upscale-a-video.
问题
- It would be helpful if the authors could discuss more about how they collect the data, i.e., filtering out 350K data from Panda 70M.
RE1: Thanks for your suggestion. Here we provide the table that compares VEnhancer with state-of-the-art generative enhancement methods in terms of inference time for video super-resolution (4x). Note that the input shape is 16x3x320x512.
| I2VGEN-XL (refiner) | LaVie-SR | Upscale-a-video | VEnhancer | |
|---|---|---|---|---|
| Inference time (s) | 261.21 | 602.07 | 454.81 | 216.62 |
From the above table, we can find VEnhancer is already the most efficient among state-of-the-art generative enhancement methods. The main reason is that VEnhancer only requires 15 inference steps, while other methods require at least 30 steps.
RE2: Thanks for your suggestion. In fact, the data collection in VEnhancer is not that difficult. The most important thing about data collection for video enhancement is high-resolution. There are some websites (e.g., pexels) where you can download free videos with 4k/8k resolution.
RE3: Thanks. VEnhancer is able to enhance different state-of-the-art text-to-video results (e.g., VideoCrafter2, CogVideoX, Kling, et.al.), which is already practical and can be regarded as a video super-resolution toolkit.
RE4: We compare Upscale-a-video for video super-resolution (4x) on AIGC2023. The quantitative comparison is illustrated in the following table.
| DOVER | MUSIQ | Aesthetic Quality | Dynamic Degree | Motion smoothness | |
|---|---|---|---|---|---|
| Upscale-a-video | 0.8158 | 44.7839 | 0.6592 | 0.575 | 0.9778 |
| VEnhancer | 0.8586 | 59.4474 | 0.6671 | 0.55 | 0.9781 |
It is observed that our VEnhancer is significantly better than Upscale-a-video in video quality assessment metric (DOVER), image quality assessment metric (MUSIQ) and Aesthetic Quality. This comparison demonstrates the superiority of our VEnhancer over state-of-the-art generative enhancement methods.
RE Questions: Regarding dataset filtering, please filter those videos with high resolution (>1080p), and use video quality assessment for further filtering.
The paper titled "VEnhancer: Generative Space-Time Enhancement for Video Generation" introduces a novel method for enhancing AI-generated videos both spatially and temporally using a single model. VEnhancer is capable of up-sampling the resolution and frame rate of low-quality videos with many scales, while adding spatial details and synthesizing detailed motion. It also removes artifacts and flickering, improving upon existing methods by integrating a Space-Time Controller (ST-Controller) trained with space-time data augmentation and video-aware conditioning. The method aims to be end-to-end trainable, enabling multi-function enhancement within one model, and claims to surpass current state-of-the-art in video super-resolution and space-time super-resolution for AI-generated content.
优点
-
Novelty: The paper presents a unified approach for generative spatial and temporal super-resolution, which is novel in the field of video generation. The integration of a pretrained generative video prior with a ST-Controller for conditioning is a creative solution that addresses the limitations of cascaded models. The concept of space-time data augmentation and video-aware conditioning is innovative and contributes to the training of the ST-Controller in an end-to-end manner.
-
Quality: The paper is well-structured, with a comprehensive presentation of the methodology, experiments, and results. The visual results and quantitative metrics provided are convincing and demonstrate the effectiveness of VEnhancer.
-
Significance: The work is relatively significant since it's an important complementary to current open-sourced video diffusion models.The ability to handle many up-sampling scales and to refine videos while maintaining content fidelity is a substantial contribution to the field.
缺点
- Some expressions of the paper are not clear and rigorous enough, and there are certain ambiguities, ambiguities or even errors. Below are the instances:
- Wrong notations, instead of 𝐼^(1:𝑚:𝑓), 𝐼^(1:𝑚:𝑓) is typically used to denote a sequence starting from 1, ending at 𝑓, with a step size of 𝑚. Similar cases for z, t, \sigma, and s. Besides, in Fig.3 z^{1:m}_t, t^{1:m} should be z^{1:f}_t, t^{1:f}
- Wrong illustrations. In Fig.2 Space-Time Data Augmentation part, both the noised videos (with noise strength t and \sigma) should be in latent space rather than still being natural videos. The noises are directly added to the latent videos.
- Some critical details are lacking such as how to do inference specifically like the initialization of z_t and z_{s, \sigma} (it should be more clear if you have a pseudocode algorithm), and the inference time and performance (like Vbench Results or other metrics) across different SSR and TSR scales.
问题
Are the shapes of inputs to ST-Controller different because of different interval m? How does the model handle inputs with different m?
RE1: Thanks for your suggestions. The notations and illustrations have been updated in the manuscript according to your comments.
RE2: Thanks for your suggestions. The initialization of z_t should be Gaussian noise, while z{s, \sigma} is fixed for each sampling step.
The inference time and performance across different SSR scales and TSR scales are illustrated in the following Table. When evaluating inference time, the input shape is 16x3x320x512. When evaluating different TSR scales, the SSR scale is fixed to 4. When evalutating different SSR scales, the TSR scale is fixed to 1.
| TSR scale | 1x | 2x | 4x | 8x |
|---|---|---|---|---|
| Inference time (s) | 216.62 | 545.95 | 1142.02 | 3003.06 |
| DOVER | 0.8586 | 0.8655 | 0.8609 | 0.8664 |
| MUSIQ | 58.4474 | 54.2732 | 51.294 | 51.0112 |
| SSR scale | 2x | 3x | 4x | 5x |
| Inference time (s) | 58.82 | 108.47 | 216.62 | 370.82 |
| DOVER | 0.8783 | 0.8688 | 0.8586 | 0.83455 |
| MUSIQ | 69.7624 | 63.5066 | 59.4474 | 46.037 |
For different TSR scales, the performance in DOVER (video quality assessment) and MUSIQ (image quality assessment) remains stable. For different SSR scales, the performance decreases slightly when SSR increases. It is because larger SSR scale will lead to smoother results. We will consider patch-based enhancement to deal with large SSR scales in the future.
For different TSR or SSR scales, the inference time grows when TSR or SSR scale increases. Here we also provide another table that compares VEnhancer with state-of-the-art generative enhancement methods in terms of inference time for video super-resolution (4x).
| I2VGEN-XL (refiner) | LaVie-SR | Upscale-a-video | VEnhancer | |
|---|---|---|---|---|
| Inference time (s) | 261.21 | 602.07 | 454.81 | 216.62 |
From the above table, we can find VEnhancer is already the most efficient among state-of-the-art generative enhancement methods. It is because VEnhancer only requires 15 inference steps, while other methods requires at least 30 steps.
RE Questions: The shapes of inputs to ST-Controller are different because of different interval m. During training, the max target frame length is set to a constant (e.g., 25). While during inference, the target frame length might be very large with higher TSR scale (i.e., interval m). Therefore, we split the video into multiple chunks with overlaps, and each chunk is processed independently for each sampling step. And we apply linear blending for the overlap part after each sampling in the latent space.
This paper received mixed scores: two borderline accept, one borderline reject, and one reject. Most reviewers acknowledged the simple and effective solution proposed in the paper. However, some reviewers raised concerns, including the high similarity to existing methods like ControlNet, limited original contributions (as many components for video-aware conditioning were borrowed from previous works), and the limited applicability of the proposed methods. The AC has reviewed the paper, the reviews, and the rebuttal. While super-resolution for AI-generated content is an intriguing and promising idea, the current version lacks a clear, significant contribution that could be recognized as a major advancement. Although two reviewers gave acceptance recommendations, their arguments lacked specific reasons strong enough to overturn the rejection opinions. Therefore, AC agrees that the current version is not ready for acceptance. Nevertheless, the AC strongly encourages the authors to address all the reviewers' concerns and resubmit to a future venue.
审稿人讨论附加意见
There were no reviewer discussion.
Reject