PaperHub
4.7
/10
withdrawn3 位审稿人
最低3最高6标准差1.2
6
5
3
2.7
置信度
正确性2.3
贡献度2.3
表达2.3
ICLR 2025

VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors

OpenReviewPDF
提交: 2024-09-13更新: 2024-11-15
TL;DR

We propose a method for editing 3D object compositions in videos of static scenes with camera motion using video generative model priors.

摘要

关键词
Video EditingVideo Diffusion ModelsDiT3D Transformation

评审与讨论

审稿意见
6

This paper proposes a method to edit 3D object compositions in a video of a static scene. The editing is performed on the 3D reconstruction by lifting frame features and then projecting them back to each frame.

优点

  • The method is simple and the results look quite good.
  • The spatial attention map not only shows the edited objects but also includes the semantic results generated by the edits, indicating effective use of video prior information.

缺点

  • This task is limited to videos of static scenes. However, most videos actually contain dynamic scenes. Currently, it seems to only handle very simple motion edits.
  • This work uses DUST3R to obtain point clouds. However, DUST3R directly stitches together points corresponding to each pixel in every frame, which leads to high memory usage as the number of frames increases. So, can the current method handle longer videos?
  • Since the current approach uses a latent video model, each pixel in the noise corresponds to a patch in the original image. When randomly initializing regions that need inpainting in the noise, will this affect the pixels at the edges of the mask in the original image?

问题

The current results appear to be tested at low resolution. Can the method handle higher resolutions (1080p) effectively?

审稿意见
5

This paper proposes VideoHandles as a method for editing 3D object compositions in videos of static scenes with camera motion.

优点

  1. This work represents the first known generative approach specifically designed to edit object compositions in videos. The method’s unique approach fills a notable gap in the field, enabling greater flexibility and control in video object editing tasks.

  2. The use of self-attention map-based weighting, along with null-text prediction in foreground regions, is a significant enhancement. These techniques appear to effectively improve the quality and precision of edits, allowing for finer detail and better contextual integration in complex scenes.

缺点

Scalability to Multi-Object and Complex Background Scenarios The examples presented primarily involve single objects with slight viewpoint variations. It would be valuable to see how the proposed method performs with multiple objects in complex backgrounds. Clarifications on handling object interactions and background consistency in such scenarios would enhance the assessment of the method’s robustness.

Applicability in In-the-Wild Environments Most of the experiments are conducted on synthetic datasets (27 generated videos and only 2 real-world videos). To assess the generalization capabilities of the proposed method, it would be helpful to see additional evaluations in more diverse, in-the-wild scenes. Real-world settings typically introduce challenges like lighting variations, occlusion, and dynamic backgrounds, so further experimentation in these contexts could strengthen the evaluation.

Comparative Analysis with Drag-Based Image Editing Approaches There are alternative approaches in the literature, such as extending Drag-based Image Editing [2] to video contexts by integrating temporal consistency modules (e.g., using techniques like Animatediff [1]). A comparison between these methods and the proposed approach could provide insights into their respective strengths, particularly in terms of temporal coherence and frame consistency. Including such comparisons would offer a broader perspective on the landscape of video object editing methods.

[1] Shi, Yujun, et al. "InstaDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos." arXiv preprint arXiv:2405.13722 (2024).

[2] Guo, Yuwei, et al. "Animatediff: Animate your personalized text-to-image diffusion models without specific tuning." arXiv preprint arXiv:2307.04725 (2023).

问题

Previous methods have utilized compositional 3D scene generation for generating and editing videos. It would be insightful to discuss the advantages and limitations of this approach compared to the proposed pipeline. Specifically, a comparison on factors such as video quality, viewpoint variation, and adaptability to different types of scenarios would provide a clearer understanding of each method's strengths and trade-offs. This analysis could help identify scenarios where the proposed pipeline excels and areas where compositional 3D generation might offer more flexibility or straightforward application.

[1] Xu Y, Chai M, Shi Z, et al. Discoscene: Spatially disentangled generative radiance fields for controllable 3d-aware scene synthesis[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 4402-4412.

[2] Zhang X, Zheng Z, Gao D, et al. Multi-view consistent generative adversarial networks for compositional 3d-aware image synthesis[J]. International Journal of Computer Vision, 2023, 131(8): 2219-2242.

审稿意见
3

This paper studied video editing, specially 3D transformation of video objects. This paper only focuses on object composition in static scenes. The key idea is to lift the generated videos to a 3D representation, warp the intermediate features with the 3D transformation, and then re-render to videos. This method is simple and training-free. The video results are promising and reasonable. User studies and ablation studies are also conducted.

优点

  • The proposed method is conceptually straightforward yet robust, leveraging a generative model's intermediate features to achieve 3D object composition.
  • A notable advantage of this method is that it is training-free, making it adaptable to various mainstream video generative models without additional computational overhead.
  • The qualitative video results are promising where the proposed method outperforms several baselines.
  • The user studies and ablation experiments provide valuable insights into the method’s effectiveness, highlighting its advantages with concrete evidence.
  • Writing is mostly clear and easy to follow.

缺点

  • While the training-free nature is appealing for simplicity and efficiency, I am uncertain whether this is optimal given the current generation quality. Introducing trainable components could potentially improve the results without significantly increasing inference latency, which might ultimately be more desirable.
  • The visual results reveal noticeable artifacts in generated objects, particularly during 3D rotations. Both the realism and 3D consistency are limited, suggesting that further refinement is necessary for achieving high-fidelity, artifact-free edits.
  • The results primarily showcase edits on relatively simple objects with basic colors and textures. It remains unclear how well the proposed method generalizes to more complex objects with intricate details or textures. Examples with varied object complexity would be beneficial to assess generalization capabilities.
  • The related work section largely focuses on recent publications from 2023 and 2024. However, image and video editing have a rich history with foundational contributions across various fine-grained editing tasks. Including classic works could provide a more comprehensive context, benefiting readers’ understanding of the field's evolution and relevant advancements.
  • It's unclear to me how this method deal with consistency issue after the objects are transformed. For example, how to make sure that holes are filled in a consistent way; and how to make sure the unseen side of the object is consistent with other side, etc.
  • The effectiveness of this approach under conditions of rapid camera movement is unclear. Evaluating the method’s performance in scenarios with fast camera motion would provide insight into its robustness and usability in diverse video settings.

问题

  • It is unclear why the pre-trained OpenSora model was fine-tuned on the RealEstate10K dataset, given that the fine-tuning appears to degrade quality rather than enhance it. From the results, the original OpenSora model seems better suited for generating complex objects and scenes. Testing the proposed method with the original OpenSora model on more varied objects and scenes could provide a clearer assessment of its capabilities.
  • The statement in L265, “We empirically found that the features from the temporal self-attention layers tend to produce global changes,” lacks supporting experimental evidence. Simply stating this finding without providing experimental results is insufficient. Including relevant quantitative or visual evidence would substantiate this observation and strengthen the paper’s claims.
  • Since the intermediate representation is central to the paper’s contributions, there may be potential to improve the way these features are combined. Rather than relying on simple concatenation, exploring more sophisticated approaches (e.g., learnable fusion mechanisms) could enhance the method’s effectiveness and make better use of the generative model’s intermediate features.
撤稿通知

We thank the reviewers for their valuable comments. After careful consideration, we think our paper is inappropriate for ICLR and decided to withdraw our paper.