PaperHub
4.8
/10
Rejected4 位审稿人
最低1最高8标准差2.5
1
5
5
8
3.8
置信度
正确性3.0
贡献度2.3
表达2.8
ICLR 2025

UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05
TL;DR

A training-free method is proposed to enable pre-trained text-to-video models for both video motion editing and various appearence editing.

摘要

关键词
Diffusion ModelVideo Editing

评审与讨论

审稿意见
1

Recent advances in text-guided video editing have showcased promising results in appearance editing (e.g., stylization). However, video motion editing in the temporal dimension (e.g., from eating to waving), which distinguishes video editing from image editing, is underexplored. In this work, They present UniEdit, a tuning-free framework that supports both video motion and appearance editing by harnessing the power of a pre-trained text-to-video generator within an inversionthen-generation framework. However, this paper seems not latest version. Some new sota approaches are not included and had a comparison.

优点

In this work, they present UniEdit, a tuning-free framework that supports both video motion and appearance editing by harnessing the power of a pre-trained text-to-video generator within an inversionthen-generation framework. To realize motion editing while preserving source video content, based on the insights that temporal and spatial self-attention layers encode inter-frame and intra-frame dependency respectively, they introduce auxiliary motion-reference and reconstruction branches to produce text-guided motion and source features respectively. The obtained features are then injected into the main editing path via temporal and spatial self-attention layers.

缺点

  • Novelty. Using ReferenceNet is not a novel approach in video generation. What is the difference between yours and previous work.
  • Overclaim. "UniEdit represents a pioneering leap in text-guided, tuning-free video motion editing." Why this work is a pioneering leap?
  • Lack of comparison. Many new works, such as Revideo, COVE , accepted by NeurIPS 2024, are all not included in this paper.
  • Long video editing, All results are only 16 frames,which is too short.
  • Motion Editing. It seems that all motion are similar. How to editing it ?
  • Inconsistency. Such as cat -> dog. The consistency of the head turn is problematic and exhibits sudden changes.

问题

See weaknesses

评论

Dear reviewer eXku,

We thank the reviewer for the valuable feedback on our paper, we provide clarifications to the concerns below:

Novelty. Using ReferenceNet is not a novel approach in video generation. What is the difference between yours and previous work.

As you mentioned, using a ReferenceNet has been explored previously. However, our approach differs in several key ways:

  1. We address the challenging problem of text-guided motion editing in the temporal dimension, which cannot be achieved by simply adapting existing ReferenceNet-based methods.
  2. We provide insight that “the temporal attention layers of the generator encode the inter-frame dependency”, enabling training-free motion editing. Building upon this, we explore and investigate feature injection in temporal layers, an area that has not been thoroughly explored.
  3. Furthermore, simply performing feature injection on temporal layers results in severe content inconsistency with the source video (Tab. B). In response, we design UniEdit with content preservation and structure control on spatial layers and motion injection on temporal layers.
  4. Previous works in video editing are typically tailored to particular tasks. For instance, Rerender-A-Video [1] excelled in style transfer, while Video-P2P [2] focused on local object editing. In contrast, our proposed method can effectively handle motion editing and various appearance editing tasks, showcasing promising performance both visually (https://uni-edit.github.io/UniEdit/) and quantitatively (Tab. 1).

Thus we believe that UniEdit contributes to the advancement of video editing.

Overclaim. "UniEdit represents a pioneering leap in text-guided, tuning-free video motion editing." Why this work is a pioneering leap?

We make the statement because there is an absence of the tuning-free text-guided motion editing method. We have made the statements more precisely in the revised version in Line 94.

Lack of comparison. Many new works, such as Revideo, COVE , accepted by NeurIPS 2024, are all not included in this paper.

The decision time for NeurIPS 2024 overlaps with the submission period for ICLR 2025. Therefore, we did not compare the two methods you mentioned in our initial submission. In the revised version, we have included a comparison with ReVideo [1]. The qualitative results are shown in Fig. B (https://uni-edit.github.io/UniEdit/#Fig_B) on the project page. The main difference between the proposed UniEdit and ReVideo is that our method is text-guided appearance and motion editing, whereas ReVideo is based on editing the first frame and the trajectory. We have included the discussion about ReVideo in the related work section in the revised PDF file. Since COVE [2] has not released the code, we were unable to include it in our comparison.

Note that the editing results we obtained using ReVideo are based on their open-source code (https://github.com/MC-E/ReVideo). The specific process is as follows: First, we use the first frame edited by UniEdit as the image input for ReVideo. Second, for appearance editing, the trajectory input is obtained by tracking feature points in several initial frames. For motion editing, we draw the trajectory based on the editing results of UniEdit as the input. Finally, we generate the "editing mask" for each case as the input mask for ReVideo. Due to limited time, we were unable to obtain quantitative results for ReVideo. This is because it requires manually drawing trajectories and generating editing masks for hundreds of editing cases to input into ReVideo.

[1] Mou, Chong, et al. "ReVideo: Remake a Video with Motion and Content Control." arXiv preprint arXiv:2405.13865 (2024).

[2] Wang, Jiangshan, et al. "COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing." arXiv preprint arXiv:2406.08850 (2024).

Long video editing, All results are only 16 frames, which is too short.

This limitation is due to the base T2V model, not a flaw in the proposed method. Specifically, the base text-to-video generation model we used can only generate 16-frame videos and performs poorly when synthesizing longer or shorter videos. The proposed method can seamlessly support long video editing by simply replacing the base model with a long video generation model. To demonstrate this, we validated the effectiveness of our method on different base models by deploying it on LaVie [1] and VideoCrafter2 [2], as described in Section 5.2. We believe long video editing is a promising research direction, but it is not the focus of this work.

[1] Wang, Yaohui, et al. "Lavie: High-quality video generation with cascaded latent diffusion models." arXiv preprint arXiv:2309.15103 (2023).

[2] Chen, Haoxin, et al. "Videocrafter2: Overcoming data limitations for high-quality video diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

评论

Motion Editing. It seems that all motion are similar. How to editing it?

In our paper, "motion" refers to inter-frame dependencies (e.g., optical flow) along the time dimension. For appearance editing tasks such as object replacement, the source video and the edited video maintain the same inter-frame dependencies. However, motion editing requires modifying the temporal correspondence. For example, in Figure 1, the action of "playing guitar" in the original video changes to "waving" or "eating an apple" after motion editing, altering the inter-frame dependencies.

We are not entirely sure what you mean by "all motions are similar". We would appreciate it if you could provide further clarification.

Inconsistency. Such as cat -> dog. The consistency of the head turn is problematic and exhibits sudden changes.

In the "cat" to "dog" case mentioned in Fig G.1, the temporal inconsistency observed during the dog's head-turning process is due to the base model struggling with synthesizing videos with large dynamics. Similar cases are shown in LaVie's (Gallary)[https://github.com/Vchitect/LaVie/tree/main?tab=readme-ov-file#download-pre-trained-models], where the model can only generate videos with similar or smaller dynamic ranges. Therefore, when the original video has significant motion, the base model sometimes fails to handle it well. This issue can be resolved by using a more powerful base model.

To demonstrate this, we conducted experiments with a more powerful T2V checkpoint with the assistance of the authors of LaVie [1]. The video results, shown in Fig. C (https://uni-edit.github.io/UniEdit/#Fig_C) of the Rebuttal section on the project page, indicate that the inconsistency issue has been resolved, which suggests that the occasional inconsistency is due to the limitations of the base model rather than a flaw in UniEdit. This is further evidenced by 1), the quantitative metrics in Tables 1&2 and 2), numerous visually consistent editing results.

[1] Wang, Yaohui, et al. "Lavie: High-quality video generation with cascaded latent diffusion models." arXiv preprint arXiv:2309.15103 (2023).

评论

after rebuttal, the authors do not address my question. I still care about the novelty and contribution. So I don't think this paper is enough to publish. Sorry for this decision.

审稿意见
5

The paper proposes a method to simultaneously edit motion and appearance through text guidance. Unlike previous approaches that required fine-tuning a pretrained generator model for the target video, the proposed UniEdit method is tuning-free. It utilizes the temporal self-attention layer within the generator to naturally express motion. Additionally, the spatial self-attention layer preserves the regions that are not being edited. Furthermore, the method ensures that the spatial structure of the video remains intact during appearance editing.

优点

The method can be applied to various Text-to-Video models, clearly stating the intended goal of simultaneously editing the content's motion and appearance. The paper presents results applied to multiple Text-to-Video models. Experiments using various prompts demonstrate that this method is not limited to specific prompts. The video quality available on the project page is excellent, and the experimental results are easy to understand. Videos generated using UniEdit outperformed existing methods in terms of CLIP score, Frame Quality, and Temporal Quality.

缺点

The user study involved only 10 participants, which may lead to highly subjective results. Including subjective results in the main performance table could confuse readers, and it would be beneficial to either evaluate with more participants or clearly indicate near the table that the results may be subjective due to the limited number of participants. The subjective evaluation from only 10 participants being included in the quantitative results could cause confusion among readers.

问题

Please check if it is reasonable to have conducted the experiment with 10 subjects.

评论

Dear reviewer iKZP,

We thank the reviewer for the valuable feedback! We address the questions below:

Check if it is reasonable to have conducted the experiment with 10 subjects.

Thank you for your suggestion. In addition to the original 10 participants, we have invited an additional 20 participants to conduct the evaluation. We have updated the results in Table 1. We acknowledge that this sample size may not be sufficiently large due to the limited time and resources. Therefore, we have noted in the table that "the results may be subjective due to the limited sample size".

评论
  1. I appreciate your efforts to address the sample size concern by including 20 additional participants. However, even with the expanded group of 30 participants, the sample size may still be insufficient to ensure the reliability and generalizability of the experimental results.

  2. While I understand the limitation imposed by the base T2V model, I agree that 16 frames are too short to demonstrate practical utility, especially for long video editing tasks. Long video editing is an essential capability for many real-world applications, and addressing this limitation would significantly enhance the impact and applicability of your method.

  3. I agree with Reviewer eXku’s assessment regarding the novelty of the proposed method. While the emphasis on being tuning-free is appreciated, the limitations imposed by the baseline, particularly in extending the method to editability and long video generation, suggest that incorporating a fine-tuned model might be necessary. Additionally, the spatial and temporal attention mechanisms included in the proposed method have become standard in many video editing works. While such approaches were impactful in the early stages of video editing research, the field has progressed significantly. Thus, adopting more advanced, fine-tuning-based methods, such as those exemplified in recent works like ReVideo: Remake a Video with Motion and Content Control (NeurIPS 2024), seems necessary to achieve broader applicability.

Therefore, I believe this paper falls short of the standards required for acceptance at high-quality conferences like ICLR. If I have overlooked any crucial aspect, please let me know.

评论

Dear reviewer iKZP,

Thanks for the reply. We would like to make the following clarification:

The sample size may still be insufficient to ensure the reliability and generalizability of the experimental results.

  1. A similar participant size is widely used: 30 participants used in Rerender-A-Video [1] (SIGGRAGH 2023), 5 participants in Tune-A-Video [2] (ICCV 2023), and 20 participants used in FateZero [3] (ICCV 2023 Oral).

  2. In addition to the user study, we quantitatively demonstrated the effectiveness of our method using CLIP scores, MVFD, and VBench scores. We also provided editing results on three models—LaVie, VideoCrafter2, and CogVideoX—to prove the generalizability of our approach. We believe this sufficiently demonstrates the effectiveness of our method.

16 frames are too short to demonstrate practical utility, long video editing is crucial.

  1. The proposed method can edit longer videos (e.g., 49 frames or 81 frames) by implementing our method on other T2V models, like CogVideoX-2b [4], please check our results in Fig. E (https://uni-edit.github.io/UniEdit/#Fig_E) on the project page.

  2. The number of frames in the edited video is entirely dependent on the number of frames that the base T2V model can generate, rather than being limited by our method. This is a universal limitation for all video editing approaches based on T2V models, not a unique drawback of our method. For example, as you mentioned with ReVideo [5], it can only edit 14 frames because it is trained based on SVD [6], which only supports generating 14 frames.

  3. While long video editing is indeed significant, we believe it is not aligned with the focus of this paper, ReVideo, and many other related works. This topic is out of the scope of our study.

While the emphasis on being tuning-free is appreciated, the limitations imposed by the baseline, particularly in extending the method to editability and long video generation, suggest that incorporating a fine-tuned model might be necessary.

"A fine-tuned method" also cannot solve the long video editing problem you mentioned, as it is similarly limited by the number of frames the base model can edit, such as in the case of ReVideo.

Additionally, the spatial and temporal attention mechanisms included in the proposed method have become standard in many video editing works. While such approaches were impactful in the early stages of video editing research, the field has progressed significantly.

While the feature injection technique has been adapted to different tasks, how to perform text-guided motion editing and feature injection on temporal attention layers is under-explored. Additionally, we believe the value of UniEdit is further demonstrated by designing a unified framework that supports various editing tasks, which is more flexible and easier to use in practice.

Thus, adopting more advanced, fine-tuning-based methods, such as those exemplified in recent works like ReVideo: Remake a Video with Motion and Content Control (NeurIPS 2024), seems necessary to achieve broader applicability.

We gently disagree that fine-tuning-based methods are more advanced than zero-shot methods. Unlike image editing, video editing lacks large-scale paired video data (i.e., [source video, editing prompt, target video]) for extensive training. As a result, zero-shot editing is commonly used in the field. Compared to methods that are trained on a single or a limited set of videos, zero-shot editing in a feed-forward pass does not suffer from overfitting and offers advantages in both speed and flexibility. We believe that both tuning-free editing and fine-tuning-based methods are valuable research topics worth exploring.

We also include a qualitative comparison of the training-based method ReVideo and UniEdit in Fig. B on the project page (https://uni-edit.github.io/UniEdit/#Fig_B). It can be observed that ReVideo sometimes fails to generate edited videos with natural motion.

[1] Yang, Shuai, et al. "Rerender a video: Zero-shot text-guided video-to-video translation." SIGGRAPH Asia 2023.

[2] Wu, Jay Zhangjie, et al. "Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation." ICCV 2023.

[3] Qi, Chenyang, et al. "Fatezero: Fusing attentions for zero-shot text-based video editing." ICCV 2023.

[4] Yang, Zhuoyi, et al. "Cogvideox: Text-to-video diffusion models with an expert transformer." *arXiv (2024).

[5] Mou, Chong, et al. "ReVideo: Remake a Video with Motion and Content Control." NeurIPS 2024.

[6] Blattmann, Andreas, et al. "Stable video diffusion: Scaling latent video diffusion models to large datasets." arXiv (2023).

评论

Thank you for addressing some of my concerns. While some of my doubts have been resolved, I still find myself questioning the novelty of this paper.

While the paper demonstrates performance improvements, the proposed method feels like an incremental enhancement over existing techniques. Specifically:

  • The approach of injecting motion text seems like a minor modification rather than a significant technical breakthrough.
  • Compared to existing methods that inject optical flow information or motion trajectories, it is unclear if providing a motion prompt offers greater control in practice.

The contribution of this paper appears to rely on manipulating spatial-temporal attention—a concept already widely explored in video editing—using a slightly modified approach. Some examples of related works that adopt similar attention-based techniques for video editing include:

  • Video-P2P: Video Editing with Cross-attention Control
  • Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
  • Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models
  • FateZero: Fusing Attentions for Zero-shot Text-based Video Editing
  • Edit-A-Video: Single Video Editing with Object-Aware Consistency
  • FLATTEN: Optical Flow-guided Attention for Consistent Text-to-Video Editing

Given these existing methods, the incremental nature of the proposed approach raises questions about its originality and its position within the broader landscape of video editing research. Therefore, I will keep my score.

评论

Dear reviewer iKZP,

Thanks for the reply. We would like to make the following clarification:

All the works you listed focus solely on video appearance editing rather than motion editing, which is more challenging. Our work emphasizes designing novel feature injection techniques on the temporal-attention layers for addressing text-guided motion editing, which is also under-explored. Therefore, our approach significantly differs from existing works in terms of the task and technical implementations. We believe this is not incremental, a point also acknowledged by Reviewer u6DS.

We hope you can reconsider our work.

评论

Dear Reviewer iKZP,

Does the further response from the authors address your concerns? It would be helpful if you could provide your feedback.

Thank you.

AC

审稿意见
5

The authors propose UniEdit, addressing the task of video motion and appearance editing using pre-trained video diffusion models (VDMs) in a united and tuning-free manner. Specifically, they apply the common "inversion-and-generation" with two innovative auxiliary branches, i.e., the reconstruction branch for content preservation and the motion-reference branch for motion editing, in addition to the main editing path. The appearance editing, motion editing and structure preservation are realized by swapping attention features (QKV) in specific types and depth of attention layers, and denoising steps as well. UniEdit is compared with previous zero-shot/tuning-free methods and outperforms them both quantitatively and qualitatively.

优点

  • The paper demonstrates good clarity in writing, making it easy for readers to follow the flow of ideas.
  • The paper is well-motivated, offering various types of video editing while preserving non-target properties using a unified framework.
  • The design of attention replacement for appearance, structure, and motion editing is technically sound, leveraging the insights and functionality of spatial and temporal self-attention, respectively.
  • The paper and the auxiliary material show impressive editing results produced with their framework with substantial improvement on existing baselines, especially in temporal coherence.

缺点

  • The authors claim in L217 that "better tackle source video with large dynamics". However, most visual examples are with small motion magnitudes. Does the model struggle with appearance and motion editing in videos with larger dynamics? Temporal inconsistencies are noticeable in videos with large motion (see Fig G.1).
  • The authors claim in L312 that "UniEdit is not limited to specific video diffusion models", which is inaccurate. The proposed framework depends on the separate spatial and temporal attention design of video diffusion models, which is incompatible with full 3D attention DiT-based video generators. And 3D full attention is increasingly being proven to be more effective [1].
  • This work still follows the common "inversion-and-genertation" paradigm, potentially performing better on generated videos but struggling with out-of-domain real videos.
  • In Fig.4 (bottom-right case), motion-editing results show background changes, such as alterations in the lawn.
  • The user study involved an insufficient number of participants (only 10), making it difficult to support the conclusions effectively with robust experimental results.

[1] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer, Yang et al., 2024

Minor concerns:

  • [Clarity] Consider including a note below the table or in the caption indicating that higher values are better.

问题

  • [Ablation study] Why does the content preservation improve textual alignment a lot in Table 2?
  • I believe different methods have their own "success rate" in generation. Is there a way to measure this rate for fair comparison, beyond just showcasing several visual examples? Additionally, how to know if this tuning-free method really works or not (to what degree), compared to potential methods with large-scale training?

伦理问题详情

N/A

评论

Dear reviewer vcxP,

We thank the reviewer for the elaborate review! We address your concerns below:

The authors claim in L217 that "better tackle source video with large dynamics". However, most visual examples are with small motion magnitudes. Does the model struggle with appearance and motion editing in videos with larger dynamics? Temporal inconsistencies are noticeable in videos with large motion (see Fig G.1).

In the "cat" to "dog" case mentioned in Fig G.1, the temporal inconsistency observed during the dog's head-turning process is due to the base model struggling with synthesizing videos with large dynamics. Similar cases are shown in LaVie's Gallary (https://github.com/Vchitect/LaVie/tree/main?tab=readme-ov-file#download-pre-trained-models), where the model can only generate videos with similar or smaller dynamic ranges. Therefore, when the original video has significant motion, the base model sometimes fails to handle it well. This issue can be resolved by using a more powerful base model.

To demonstrate this, we conducted experiments with a more powerful T2V checkpoint with the assistance of the authors of LaVie [1]. The video results, shown in Fig. C (https://uni-edit.github.io/UniEdit/#Fig_C) of the Rebuttal section on the project page, indicate that the inconsistency issue has been resolved, which suggests that the occasional inconsistency is due to the limitations of the base model rather than a flaw in UniEdit. This is further evidenced by 1) the quantitative metrics in Tables 1&2 and 2) numerous visually consistent editing results.

[1] Wang, Yaohui, et al. "Lavie: High-quality video generation with cascaded latent diffusion models." arXiv preprint arXiv:2309.15103 (2023).

The authors claim in L312 that "UniEdit is not limited to specific video diffusion models", which is inaccurate. The proposed framework depends on the separate spatial and temporal attention design of video diffusion models, which is incompatible with full 3D attention DiT-based video generators. And 3D full attention is increasingly being proven to be more effective [1].

Thank you for your reminder! We have revised the claim in our paper. The inability to effectively adapt to models that use only 3D full attention [1] is indeed a limitation of our method. However, we also note that many Unet-based [2, 3, 4] and transformer-based [5] video generation models have separate attention layers, demonstrating that UniEdit has a relatively broad range of applications.

[1] Yang, Zhuoyi, et al. "Cogvideox: Text-to-video diffusion models with an expert transformer." arXiv preprint arXiv:2408.06072 (2024).

[2] Blattmann, Andreas, et al. "Stable video diffusion: Scaling latent video diffusion models to large datasets." arXiv preprint arXiv:2311.15127 (2023).

[3] Wang, Yaohui, et al. "Lavie: High-quality video generation with cascaded latent diffusion models." arXiv preprint arXiv:2309.15103 (2023).

[4] Chen, Haoxin, et al. "Videocrafter2: Overcoming data limitations for high-quality video diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[5] Ma, Xin, et al. "Latte: Latent diffusion transformer for video generation." arXiv preprint arXiv:2401.03048 (2024).

This work still follows the common "inversion-and-genertation" paradigm, potentially performing better on generated videos but struggling with out-of-domain real videos.

Yes, the "inversion-and-generation" paradigm can exhibit inferior performance on some real-world videos. This is primarily because the pipeline sometimes struggles to faithfully reconstruct the input video, leading to a content mismatch. However, we found that this issue can be effectively addressed by equipping UniEdit with null-text inversion [1]. Specifically, for real-world videos, we first optimize an unconditional embedding at each denoising step tt to ensure the model can accurately reconstruct the original video. Then, during the editing process, we replace the unconditional embedding in all branches with the optimized embedding. This approach ensures that the content of the original video is well-preserved, resulting in better performance.

[1] Mokady, Ron, et al. "Null-text inversion for editing real images using guided diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

评论

In Fig.4 (bottom-right case), motion-editing results show background changes, such as alterations in the lawn.

Occasionally, UniEdit's output may exhibit minor inconsistencies. These can be effectively addressed using the mask-guided coordination method designed in Section 4.3. Specifically, we can obtain foreground-background masks by either binarizing the attention maps of CA-S modules with a threshold or using SAM [1] segmentation. Then, we apply masked attention as formulated in Eq. 5 and Eq. 6. This approach further enhances the consistency between the edited video and the source video. We have showcased the improved results in Fig. D (https://uni-edit.github.io/UniEdit/#Fig_D) of the Rebuttal section on the project page.

[1] Kirillov, Alexander, et al. "Segment anything." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

The user study involved an insufficient number of participants (only 10), making it difficult to support the conclusions effectively with robust experimental results.

Thank you for your suggestion. In addition to the original 10 participants, we have invited an additional 20 participants to conduct the evaluation. We have updated the results in Table 1. We acknowledge that this sample size may not be sufficiently large due to the limited time and resources. Therefore, we have noted in the table that "the results may be subjective due to the limited sample size".

We also use automated metrics, i.e., CLIP scores, and VBench [1] metrics in Tab. 1 to further evaluate the effectiveness of UniEdit, and UniEdit outperforms baselines on all metrics. In addition, we include Fréchet video motion distance (FVMD) [2] to evaluate the motion consistency in the revised version.

[1] Huang, Ziqi, et al. "Vbench: Comprehensive benchmark suite for video generative models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Liu, Jiahe, et al. "Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos." First Workshop on Controllable Video Generation@ ICML24.

Consider including a note below the table or in the caption indicating that higher values are better.

Thanks for the advice! We have included an explanation in the caption of Tab. 1.

Why does the content preservation improve textual alignment a lot in Table 2?

The content preservation technique is designed to retain the content in the source video and edited video. Therefore, content preservation primarily improves the "Frame Similarity" metric in Tab. 2, which is formulated as the average frame cosine similarity between the source frame embedding and the edited frame embedding. On the other, we observe a relative marginal improvement in "Textual Alignment" (~+1.0) compared to "Frame Similarity" (~+7.0), and it's due to the target prompt may also contain descriptions corresponding to the unedited parts, therefore inherit contents from the source video could help improve the textual alignment to some extent.

评论

I believe different methods have their own "success rate" in generation. Is there a way to measure this rate for fair comparison, beyond just showcasing several visual examples? Additionally, how to know if this tuning-free method really works or not (to what degree), compared to potential methods with large-scale training?

Firstly, we believe that the lack of evaluation regarding the "success rate" is a significant issue in the field, and it is valuable to address. The primary reason for this is that defining success can be quite challenging. For instance, a method might generally produce results that align with the target prompt, but there may be some notable inconsistencies with the source video. In such cases, it is difficult to definitively determine whether it falls within the scope of successful editing.

As for the proposed UniEdit, the "success rate" primarily depends on whether the pre-trained T2V model is capable of generating videos that match the content described by the target prompt. Specifically, we can use the target prompt PtP_t as a condition to denoise without using UniEdit (i.e., using the vanilla T2V generation model) to obtain the generated result VtV_t. Though VtV_t may have a severe content mismatch or inconsistency with the source model, the generated result should at least conform to the description of PtP_t. In such cases, UniEdit has a high success rate of effectively injecting content features and motion features into the 'main editing path' to achieve an edited result that matches both the original video content and the target prompt.

On the other, large-scale training on paired data can indeed increase the "success rate" by improving the text-following ability of the base model, i.e., the base model has a higher probability of generating video align PtP_t. However, training-based methods often encounter a degradation in model generation quality. This issue arises because the paired data used for training is typically significantly lower in both quantity and diversity compared to the data used for training the base model, resulting in reduced quality of the editing outcomes. In contrast, training-free methods offer greater flexibility.

评论

Thanks for the rebuttal with the new experiments. Some of my concerns have been addressed. However, I still lean toward a negative assessment considering its weak generality, inferior paradigm, and less satisfactory results, so I will maintain my original score.

评论

Dear Reviewer vcxP,

Thanks for your prompt reply! We further address your concerns as follows:

weak generality

We have supplemented our experiments to demonstrate that our method remains effective on CogVideoX [1]. The experimental setup is as follows:

We used CogVideoX-2b as the base text-to-video generation model and applied the feature injection techniques from UniEdit to its 3D-attention layer. Specifically, CogVideoX performs full-attention within text features and visual features in each transformer block, while UniEdit is designed on visual features. Therefore, we extract the visual features with a mask and perform feature injection on the query, key, and value of these features, while keeping the text features unchanged. Additionally, the operations performed by UniEdit on the temporal-attention layer and spatial-attention layer were simplified to be applied on the 3D-attention (full-attention) layer.

The results are shown in Fig. E (https://uni-edit.github.io/UniEdit/#Fig_E) on the project page. It can be observed that UniEdit successfully executed stylization, object replacement, and motion editing. This demonstrates the versatility of UniEdit, showing that it can be used not only on Unet-based generation models like LaVie [2] and VideoCrafter2 [3] but also can be transferred with minimal changes to DiT-based models with 3D full attention design like CogVideoX.

[1] Yang, Zhuoyi, et al. "Cogvideox: Text-to-video diffusion models with an expert transformer." *arXiv preprint (2024).

[2] Wang, Yaohui, et al. "Lavie: High-quality video generation with cascaded latent diffusion models." *arXiv preprint (2023).

[3] Chen, Haoxin, et al. "Videocrafter2: Overcoming data limitations for high-quality video diffusion models." CVPR 2024.

inferior paradigm

We would like to explain that the paradigm of zero-shot video editing and inversion-then-generation pipeline is also widely used by recent works [1,2,3,4] and has several advantages:

a. Zero-Shot Video Editing

Unlike image editing, video editing lacks large-scale paired video data (i.e., [source video, editing prompt, target video]) for extensive training. As a result, zero-shot editing is commonly used in the field. Compared to methods that are trained on a single or a limited set of videos, zero-shot editing in a feed-forward pass does not suffer from overfitting and offers advantages in both speed and flexibility.

b. Inversion-then-Generation Pipeline

The inversion-then-generation pipeline is widely adopted in recent works such as COVE [1] at NeurIPS '24 and I2VEdit [2] at SIGGRAPH Asia '24, indicating its effectiveness in video editing.

Moreover, the performance of the inversion-then-generation pipeline used in this paper can be significantly improved through the following methods: 1) In our experiments, we found that null-text inversion [5] can achieve almost perfect reconstruction quality for nearly any input videos, ensuring high editing quality; 2) Using advanced inversion methods (e.g., EDM [6], Direct Inversion [7]) can further enhance the performance of this pipeline.

[1] Wang, Jiangshan, et al. "COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing." NeurIPS 2024.

[2] Ouyang, Wenqi, et al. "I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models." SIGGRAPH Asia 2024.

[3] Geyer, Michal, et al. "TokenFlow: Consistent Diffusion Features for Consistent Video Editing." ICLR 2024.

[4] Cong, Yuren, et al. "FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing." ICLR 2024.

[5] Mokady, Ron, et al. "Null-text inversion for editing real images using guided diffusion models." CVPR 2023.

[6] Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." NeurIPS 2022.

[7] Ju, Xuan, et al. "Pnp inversion: Boosting diffusion-based editing with 3 lines of code." ICLR 2024.

less satisfactory results

We demonstrate the performance of UniEdit with plenty of results on the project page, and qualitative and quantitative comparison with baseline methods.

To be specific, on our project page (https://uni-edit.github.io/UniEdit/), we provide the editing results of UniEdit with VideoCrafter2 and LaVie in Section A, results of UniEdit with CogVideoX-2b in Figure E of Section Rebuttal, and qualitative comparison with baselines in Figure B of Section Rebuttal. We include quantitative results with baselines in Table 1 and Table 7 in the paper, which surpass all baseline methods in terms of visual quality, temporal consistency, and textual alignment.

Moreover, the editing performance can be further improved with advanced T2V generation models, as evidenced in Figure C, Section Rebuttal.

We hope that our responses and the additional results address your concerns. We would be grateful if you would kindly let us know of any other concerns and if we could further assist in clarifying any other issues.

评论

Dear Reviewer vcxP,

Does the authors' further response address the concerns and alter the negative assessment?

Thank you.

AC

评论

Dear Reviewer vcxP:

Thanks again for the constructive comments and the time you dedicate to the paper!

We have supplemented our experiments to demonstrate that our method remains effective on CogVideoX. Since the discussion is about to close tomorrow, we would be grateful if you could let us know if we have addressed your concerns.

Thanks a lot, and with sincerest best wishes

Submission 5877 Authors

评论

Thank you for your responses, which address some of my concerns. However, the three main concerns I raised remain largely unresolved. Additionally, I partially agree with reviewer iKZP regarding the novelty issue. Lastly, I will maintain my rating.

审稿意见
8

UniEdit presents a novel approach to text-guided video editing, utilizing a tuning-free framework that leverages the capabilities of pre-trained text-to-video (T2V) diffusion models. The key strength of UniEdit lies in its ability to perform both motion and appearance editing within a unified architecture. This is a significant advancement compared to prior methods, which often focus on either motion or appearance editing, but not both. The core idea behind UniEdit is to invert the source video into the latent space of a pre-trained T2V model, and then guide the generation process using the target text prompt and auxiliary branches that control motion and appearance.

优点

  • Unified Framework for Motion and Appearance Editing: UniEdit effectively addresses a key challenge in video editing by providing a single framework that can handle both motion and appearance modifications. This is achieved by leveraging the inherent properties of spatial and temporal self-attention layers in the T2V model. The motion-reference branch, conditioned on the target prompt, injects motion features into the temporal self-attention layers, enabling text-guided motion changes. The reconstruction branch focuses on preserving source content by injecting features into the spatial self-attention layers.

  • Zero-Shot Editing Capability: UniEdit operates in a tuning-free manner, meaning it doesn't require fine-tuning the base T2V model for each new video or editing task. This makes it highly efficient and user-friendly, as it avoids the computational cost and complexity associated with model training.

  • Emphasis on Temporal Consistency: Maintaining temporal consistency is crucial for high-quality video editing. UniEdit tackles this challenge by injecting features from the reconstruction branch into the spatial self-attention layers, which helps preserve the content and structure of the source video. The use of mask-guided coordination further enhances the consistency of unedited areas.

缺点

  1. The problem is challenging, and I believe this paper only partially addresses it. The editing capabilities need to be more clearly defined. For example, are all text prompts representing motion control or appearance control feasible? The scope requires further clarification. The inherent ambiguity of language and the complexity of video editing tasks make it practically impossible to guarantee success for every possible textual input.

Defining Boundaries and Constraints: To enhance clarity regarding its scope, the UniEdit paper should clearly state:

  • Types of Edits Supported: Specify the categories of motion and appearance edits that UniEdit is designed to handle effectively.

  • Limitations in Prompt Complexity: Provide guidelines or examples of text prompts that might pose challenges for UniEdit, particularly those involving complex scenes or multiple simultaneous edits.

  • Known Failure Cases: Discuss specific scenarios where UniEdit is known to perform poorly, offering insights into the reasons for these limitations.

  1. UniEdit's core techniques are indeed inspired by existing approaches:

    • Auxiliary Branches: The concept of using auxiliary branches for specific tasks, such as reconstruction or motion guidance, is not entirely new.
    • Feature Injection via Self-Attention: Injecting features into specific layers of a neural network, especially within attention mechanisms, is a common practice in various image and video editing tasks.
    • Temporal and Spatial Self-Attention: The observation that temporal self-attention layers encode inter-frame dependencies, while spatial self-attention layers capture intra-frame information, is a widely recognized principle reported in other works like I2VEdit (https://i2vedit.github.io/) for diffusion-based methods and SIFA (https://github.com/FuchenUSTC/SIFA) for other video editing tasks.

    While UniEdit's core techniques are not entirely novel, the integration and adaptation of these mechanisms for tuning-free video motion editing demonstrate a degree of innovation.

  2. The improvements, while statistically significant, do not represent a leap in performance but rather a refinement of existing techniques. For instance, UniEdit's performance on Frame Quality and Temporal Quality metrics surpasses competing methods like TokenFlow and Rerender, but the margins are relatively modest (e.g., a 1-2% improvement in some metrics).

  3. As the authors have pointed out, UniEdit has a performance ceiling when performing both motion and appearance editing simultaneously and struggles with complex scenes.

  4. Missing Critical Baseline: ReVideo (https://mc-e.github.io/project/ReVideo/) should be included as a baseline. ReVideo's focus on remaking videos with control over both motion and content aligns well with UniEdit's capabilities. A direct comparison would reveal their strengths and weaknesses in tasks such as object replacement, style transfer, and motion modification.

    Including additional works for comparison will further strengthen this paper.

In addition, a discussion with the following related works is beneficial:

  • AnyV2V: Its compatibility with various image editing methods and emphasis on simplicity and consistency present an interesting comparison point for UniEdit.
  • I2VEdit: Comparing with I2VEdit would reveal the advantages and disadvantages of using an edited first frame versus direct text guidance for video editing. This comparison could be particularly insightful for tasks involving local edits, global style transfer, identity manipulation, and subject customization. The temporal consistency of I2VEdit appears to be quite robust.
  • FLATTEN: Comparing FLATTEN to UniEdit would showcase the differences between explicitly incorporating optical flow information and UniEdit's approach to motion injection.
  1. Metric Coverage: The evaluation relies heavily on CLIP scores and user preferences. While these metrics are valuable, the absence of more comprehensive metrics (e.g., temporal coherence measures beyond flickering, user engagement metrics) limits the robustness of the evaluation.

    • Flow Warping Error (Ewarp): This metric can be used to assess visual consistency by comparing the edited video frames warped according to the estimated optical flow of the source video.
    • Frame Variance Metric for Motion Detection (FVMD): This may provide a more objective measure of how well the model preserves motion consistency across frames.
  2. The evaluation dataset could include BalanceCC if possible.

  3. Sensitivity Analysis can be more detailed:

    • Sensitivity to Injection Steps: The quantitative analysis clearly shows that the choice of injection steps influences the quality and consistency of the output.
    • Sensitivity to Blend Layers and Steps: The qualitative exploration indicates a trade-off between stylization and fidelity controlled by the blend layers and steps. This suggests a need for adjustment based on the desired outcome.
  4. The related works are primarily categorized by broad topics (e.g., video generation, video editing) rather than by the specific innovations relevant to UniEdit (e.g., tuning-free frameworks, self-attention manipulation). This top-down organization makes it harder to discern how UniEdit differentiates itself from existing methods.

问题

Please refer to the weaknesses section for my questions. I may revise my rating based on the authors' rebuttal.

评论

Dear reviewer u6DS,

We would like to thank you for the very detailed review and the constructive feedback you provided! Below, we address your concerns in detail:

Defining Boundaries and Constraints of the proposed method.

Thank you for your valuable suggestion. We elaborate on the boundaries and constraints of UniEdit below:

Supported Edit Tasks

UniEdit supports several commonly used editing tasks, i.e., style transfer, rigid/non-rigid object replacement, background replacement, and motion editing.

Limitations

The limitations primarily lie in the following areas:

  1. Lack of Support for Non-Text Inputs: The designed UniEdit is a text-guided editing method, and does not accommodate user inputs in other forms, such as motion trajectories or skeletons. This limits UniEdit to perform precise editing, such as specifying the trajectory of an object's movement in pixel space.
  2. Text-to-Video Model's Text Following Capability: Given a target prompt to be edited, the base T2V model is required to effectively respond to the prompt and generate a video that aligns with the described content. Specifically, we can use the target prompt PtP_t as a condition to denoise without using UniEdit (i.e., using the vanilla T2V generation model) to obtain the generated result VtV_t. Though VtV_t may have a severe content mismatch or inconsistency with the source model, the generated result should at least conform to the description of PtP_t. In such cases, we can effectively inject content features and motion features into the 'main editing path' to achieve an edited result that matches both the original video content and the target prompt.

Analysis of Failure Cases

We include failure case visualizations and the corresponding analyses below in Appendix B.5:

We exhibit failure cases in Fig. 13. Fig. 13(a) showcases when editing multiple elements simultaneously, and we observe a relatively large inconsistency with the source video. A naive solution is to perform editing with UniEdit multiple times. Fig. 13(b) visualizes the results when editing a video with complex scenes, and the base T2V model sometimes could not precisely follow the semantics in the target prompt, resulting in incorrect editing. This may be caused by the base model’s limited text understanding power, as discussed in [1]. It could be alleviated by leveraging the reasoning power of MLLM [1] or adapting approaches in complex scenario editing [2].

[1] Huang, Yuzhou, et al. "Smartedit: Exploring complex instruction-based image editing with multimodal large language models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Mao, Qi, et al. "Mag-edit: Localized image editing in complex scenarios via mask-based attention-adjusted guidance." Proceedings of the 32nd ACM International Conference on Multimedia. 2024.

UniEdit's core techniques are indeed inspired by existing approaches. While UniEdit's core techniques are not entirely novel, the integration and adaptation of these mechanisms for tuning-free video motion editing demonstrate a degree of innovation.

We agree with your comments that the idea of using auxiliary branches and feature injection has been adapted to various scenarios. Additionally, we believe the value of UniEdit is further demonstrated in the following aspects:

  1. Exploring Text-Guided Motion Editing as a Valuable Task: Existing methods primarily control motion through additional supervision signals (e.g., optical flow [1], trajectories [2, 3]). Using text alone is more flexible and worth exploring.
  2. Designing a Unified Tuning-Free Framework that Supports Various Editing Tasks: Many works in the editing field focus on one or a few specific editing tasks. For example, [4, 5] can handle object replacement and stylization tasks but cannot perform motion editing; [6] achieves motion editing with a reference video but does not support other types of tasks. We believe that designing a framework for various editing tasks is more flexible and easier to use in practice.

[1] Yan, Wilson, et al. "Motion-conditioned image animation for video editing." arXiv preprint arXiv:2311.18827.

[2] Wang, Zhouxia, et al. "Motionctrl: A unified and flexible motion controller for video generation." ACM SIGGRAPH 2024 Conference Papers. 2024.

[3] Mou, Chong, et al. "ReVideo: Remake a Video with Motion and Content Control." arXiv preprint arXiv:2405.13865 (2024).

[4] Hertz, Amir, et al. "Prompt-to-Prompt Image Editing with Cross-Attention Control." The Eleventh International Conference on Learning Representations.

[5] Liu, Shaoteng, et al. "Video-p2p: Video editing with cross-attention control." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[6] Tu, Shuyuan, et al. "Motioneditor: Editing video motion via content-aware diffusion." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

评论

The improvements, while statistically significant, do not represent a leap in performance but rather a refinement of existing techniques. For instance, UniEdit's performance on Frame Quality and Temporal Quality metrics surpasses competing methods like TokenFlow and Rerender, but the margins are relatively modest.

We believe that metrics such as "Frame Quality" or "Temporal Quality" primarily reflect the visual quality of the generated video but do not fully capture the editing performance. For example, we observed that when performing motion editing, baseline methods tend to output videos that are almost identical to the original video. Such videos would still score highly in terms of "Frame Quality" and "Temporal Quality".

Another aspect where we surpass the baseline is our ability to handle various tasks. In contrast, baseline methods typically excel in one or a few of these tasks. For instance, Rerender-A-Video [1] specializes in stylization, while MasaCtrl [2] is designed for non-rigid editing tasks.

[1] Yang, Shuai, et al. "Rerender a video: Zero-shot text-guided video-to-video translation." SIGGRAPH Asia 2023 Conference Papers. 2023.

[2] Cao, Mingdeng, et al. "Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

UniEdit has a performance ceiling when performing both motion and appearance editing simultaneously and struggles with complex scenes.

Simultaneously performing multiple editing tasks and editing in complex scenarios is indeed a very challenging task. The capability of the base model also significantly affects video editing performance. For example, in the case shown in Fig. 13(b) where "edit one of the goldfish into a shark", the failure is mainly due to the model's inability to accurately follow the target prompt, resulting in the incorrect editing of all goldfish. Therefore, this could be improved by using more powerful base text-to-video generation models.

Furthermore, the performance of our method in these challenging scenarios can be enhanced by the mask-guided coordination proposed in section 4.3. We have included the improved results in Fig. A on the project page (https://uni-edit.github.io/UniEdit/#Fig_A). It can be observed that, after using mask-guided coordination, the background consistency in the "panda waving" example is improved. Meanwhile, the segmentation mask can effectively guide the model to correctly edit in the "goldfish to shark" example.

Missing Critical Baseline ReVideo. Including additional works for comparison will further strengthen this paper.

Thank you for your reminder! We have included qualitative comparison results in Fig. B on the project page (https://uni-edit.github.io/UniEdit/#Fig_B). It can be observed that though ReVideo [1] can accurately make the edited video follow the user's input trajectory, achieving more precise editing, it may sometimes fail to generate more natural edited videos. The main difference between the proposed UniEdit and ReVideo is that our method is text-guided appearance and motion editing, whereas ReVideo is based on editing the first frame and the trajectory.

Note that the editing results we obtained using ReVideo are based on their open-source code (https://github.com/MC-E/ReVideo). The specific process is as follows: First, we use the first frame edited by UniEdit as the image input for ReVideo. Second, for appearance editing, the trajectory input is obtained by tracking feature points in several initial frames. For motion editing, we draw the trajectory based on the editing results of UniEdit as the input. Finally, we generate the "editing mask" for each case as the input mask for ReVideo.

Due to limited time, we were unable to obtain quantitative results for ReVideo. This is because it requires manually drawing trajectories and generating editing masks for hundreds of editing cases to input into ReVideo.

Additionally, we have included the discussion about ReVideo in the related work section.

[1] Mou, Chong, et al. "ReVideo: Remake a Video with Motion and Content Control." arXiv preprint arXiv:2405.13865 (2024).

评论

A discussion with AnyV2V, I2VEdit, FLATTEN is beneficial.

Thanks for the valuable suggestion! We have integrated the discussion below into the 'Related Works' section in the revised version:

AnyV2V [1] innovatively decomposes the video editing task into two sub-tasks: image editing and video-referenced image-to-video generation. The designed framework supports various editing tasks by replacing the image editing tool. The primary difference with UniEdit is that UniEdit employs an end-to-end editing pipeline.

I2VEdit [2] also leverages image editing approaches to improve video editing performance and elaborately designs a 'motion alignment' training strategy to enhance temporal consistency. However, the technique that aligns the motion of the edited video with the original video, is inherently incompatible with motion editing.

FLATTEN [3] extracts optical flow from the source video and designs flow-guided attention to improve visual consistency. Although Flatten effectively enhances consistency in appearance editing, it is not suitable for motion editing, where the optical flow of the edited video should not be consistent with the source video.

In comparison, the proposed UniEdit performs video editing in an end-to-end manner. To support motion editing tasks, we do not employ motion alignment training or optical flow guidance techniques. Instead, we propose to perform motion editing with temporal feature injection, and maintain temporal consistency by leveraging a pre-trained text-to-video generation model.

[1] Ku, Max, et al. "Anyv2v: A plug-and-play framework for any video-to-video editing tasks." arXiv preprint arXiv:2403.14468 (2024).

[2] Ouyang, Wenqi, et al. "I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models." arXiv preprint arXiv:2405.16537 (2024).

[3] Cong, Yuren, et al. "FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing." The Twelfth International Conference on Learning Representations.

Metric Coverage on Flow Warping Error and Frame Variance Metric for Motion Detection. The evaluation dataset could include BalanceCC if possible.

Thank you for your suggestion! We have conducted additional experiments on 50 samples from miniBalanceCC [1] and included the Frame Variance Metric for Motion Detection (FVMD) [2] as a measure of motion consistency. The results are shown in the table below:

MethodFVMD↓Aesthetic Quality↑Imaging Quality↑Subject Consistency↑Motion Smoothness↑Temporal Flickering↑
TAV2060255.9559.5988.9491.8489.20
MasaCtrl∗1623054.3361.4792.4797.8895.39
FateZero2433953.0764.2789.8194.7192.11
Rerender2150351.7257.8089.5396.6494.75
TokenFlow2379854.8666.7892.2195.6493.77
UniEdit1456956.0967.8595.7498.0796.62

It's observed that UniEdit has superior performance in terms of motion consistency, frame quality, and temporal quality. We have integrated the results into Tab. 7 in appendix of the revised version.

Note that we did not calculate the Flow Warping Error (proposed in [3]) because the code for this metric is not open-sourced and there is no detailed explanation provided.

[1] Feng, Ruoyu, et al. "CCEdit: Creative and controllable video editing via diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Liu, Jiahe, et al. "Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos." First Workshop on Controllable Video Generation@ ICML24.

[3] Geyer, Michal, et al. "TokenFlow: Consistent Diffusion Features for Consistent Video Editing." The Twelfth International Conference on Learning Representations.

Sensitivity Analysis can be more detailed:

Thanks for the advice, we provide detailed sensitivity analysis in Section C on the project page (https://uni-edit.github.io/UniEdit/#sectionC).

评论

The related works are primarily categorized by broad topics (e.g., video generation, video editing) rather than by the specific innovations relevant to UniEdit (e.g., tuning-free frameworks, self-attention manipulation). This top-down organization makes it harder to discern how UniEdit differentiates itself from existing methods.

Thanks for the advice! In the revised PDF file of the main text, we have reorganized the related works section by categorizing video editing techniques into "tuning-free appearance editing," "training-based appearance editing," and "motion editing". Additionally, we have included discussions on more recent works, such as AnyV2V, I2VEdit, FLATTEN, and ReVideo.

评论

Thanks for the rebuttal and the majority of my concerns have been addressed, so I raised my rating to 8.

评论

Thanks again for your elaborate review and appreciation of our work! We are committed to continuously improving our work and will incorporate the results into the final version.

评论

We thank all the reviewers for your time and thoughtful feedback. We revised our manuscript and submitted a new version for review. We also added the video results corresponding to the rebuttal to our anonymous project page (https://uni-edit.github.io/UniEdit/#sectionRebuttal) for your reference.

Project Page

  1. Figure A: Including enhanced results of UniEdit with mask-guided coordination on challenging scenarios (Reviewer u6DS).
  2. Figure B: Adding comparison with ReVideo [1] (Reviewer u6DS, Reviewer eXku).
  3. Figure C: Adding enhanced results of UniEdit with improved base T2V generation model (Reviewer vcxP, Reviewer eXku).
  4. Figure D: Adding enhanced results of UniEdit with mask-guided coordination (Reviewer vcxP).

Main Paper

  1. We update the user study results on more participants (Reviewer iKZP, Reviewer vcxP).
  2. We reorganize and revise the related works section to make it clearer and add a detailed discussion of recent works (Reviewer u6DS).
  3. We fix some typos.

Appendix

  1. We include Fréchet video motion distance(FVMD) [2] to evaluate the motion consistency in Tab. 7 (Reviewer u6DS).

[1] Mou, Chong, et al. "ReVideo: Remake a Video with Motion and Content Control." arXiv preprint arXiv:2405.13865 (2024).

[2] Liu, Jiahe, et al. "Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos." First Workshop on Controllable Video Generation@ ICML24.

评论

Dear Reviewers,

Thanks again for the constructive comments and the time you dedicate to the paper! Many improvements have been made according to your suggestions and other reviewers' comments. We also updated the manuscript and summarized the changes below. We hope that our responses and the revision will address your concerns.

Since the discussion is about to close, we would be grateful if you would kindly let us know of any other concerns and if we could further assist in clarifying any other issues. We are glad to follow up with your further comments.

Thanks a lot again, and with sincerest best wishes

Submission 5877 Authors

AC 元评审

The paper presents a tuning-free framework for video motion and appearance editing with a pre-trained text-to-video base model. A primary objective of the editing process is to change the motion while preserving source video content. This goal is accomplished by adding auxiliary branches for motion reference and reconstruction, which produce text-guided motion and source features. These features are then integrated into the temporal and spatial self-attention layers to control inter-frame and intra-frame dependencies.

The paper initially received mixed and somewhat negative ratings: 6, 6, 5, and 3. After the rebuttal, the ratings became more varied with scores of 8, 5, 5, and 1. Despite the authors' efforts to address the reviewers' comments and provide additional results, three of the reviewers continued to express concerns regarding the novelty and technical contributions of the work. The area chair reviewed the paper and found the reviewers' arguments somewhat convincing. Consequently, the area chair did not identify strong reasons to overrule the reviewers' decision and regretfully recommended rejecting the paper.

审稿人讨论附加意见

  • It should be noted that the area chair did not view the absence of referring to the NeurIPS'24 paper "ReVideo" as an issue since the ICLR submission deadline was earlier.

  • In the original submission, the authors made some claims that were too broad, such as stating that they could "better tackle source video with large dynamics", that "UniEdit represents a pioneering leap in text-guided, tuning-free video motion editing", and that "UniEdit is not limited to specific video diffusion models." While the authors attempted to clarify these points in their rebuttal, more detailed analysis and evaluation are needed to verify their explanations. If the effectiveness of the base model is a concern, conducting additional experiments to investigate various settings would be beneficial (in addition to the qualitative results using VideoCrafter2 presented in Fig. 9 and those with the more powerful LaVie checkpoint shown on the project page.).

  • The so-called "motion editing" is a bit limiting, which is more like modifying the "action" and may be treated as an animation style, given that the video clip is short. A more general or practical objective of motion editing would be to control the movement or trajectories.

最终决定

Reject