PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
3
2
3
4
ICML 2025

Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We present Cavia, the first framework that enables users to generate multiple videos of the same scene with precise control over camera motion, while simultaneously preserving object motion.

摘要

关键词
Video GenerationCamera Control3D Consistency

评审与讨论

审稿意见
3

The authors propose a novel framework of image-to-multi-view video generation, with controllable cameras. To achieve this, the design a flexible multi-frame/multi-view attention module, allowing for joint training of static scene video, monocular video and multi-view dynamic scene video. Experiments on monocular and multi-view vide generation validate the 3D consistency and temporal consistency of the proposed method.

给作者的问题

The training strategy of static scene videos could be explained more

论据与证据

Not all.

  1. The authors claim they focus on multi-view video generation, however, most of the multi-view experiments are conducted on 2 views with small camera angle change, which is confusing. 4-view experiments are provided in supplementary material but only qualitatively, where camera angle change is still not evident.
  2. The joint training strategy for static video is confusing. The authors expand them with F-1 frames, but their architecture design allow for only one-frame training. I don't know if the authors refers to V cameras move separately during F-1 frames, or they simply repeat the first frame to F-1 frames.
  3. Lack of comparison with state-of-the-art methods for camera controllable video generation and multi-view video generation.

方法与评估标准

Not all, please see experiment part.

理论论述

N/A

实验设计与分析

  1. monocular video generation: For 3D consistency, it is suggested to compare with ViewCrafter, following its evaluation pipeline.
  2. multi-view video generation: 2-view video generation with limited camera change is not convincing. Does the training data also contains such small camera change? Besides, the authors are encouraged to increase the number of views and conduct more experiments. Besides, the compared method, MotionCtrl and CamCtrl are not designed for multi-view video generation, so it is unfair. It is suggested to compare with SyncCamMaster.
  3. It is suggested to move the comparison with CVD to the main text and conduct quantitative experiments under its evaluation pipeline.

补充材料

Yes, I read the appendix and supplementary html.

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

I'm willing to increase my score if the authors address my concerns in response.

作者回复

We thank the reviewer T9u5 for their detailed comments and constructive suggestions. However, we respectfully disagree with the reviewer's concerns regarding our contribution. In response, we have added additional comparisons against concurrent works ViewCrafter and CVD, which are available on our anonymous webpage: https://cavia2025.github.io/. These new results clearly demonstrate that our proposed method, Cavia, significantly outperforms both ViewCrafter (Fig. A) and CVD (Fig. B). We address the reviewer's questions in detail below.

Q1. Multi-view experiments are mainly conducted on 2 views with small camera angle change. 4-view experiments’camera angle change is still not evident.

A1. We emphasize that camera control, the factor the reviewer prioritizes, is precisely where Cavia excels. Cavia is the first to enable multi-video generation with precise camera motion control while simultaneously preserving object motion. Concurrent works cited by the reviewer fail to achieve comparable quality. Specifically, ViewCrafter only generates static scenes, SynCamMaster is restricted to fixed-viewpoint videos, and CVD suffers from poor pixel quality with severe morphing artifacts.

Q2. 2-view video generation with limited camera change is not convincing. Does the training data also contains such small camera change? MotionCtrl and CamCtrl are not designed for multi-view video generation, so it is unfair. It is suggested to compare with SynCamMaster.

A2. As we illustrate above, SynCamMaster only supports video generation with fixed viewpoints, allowing no camera movement. In contrast, Cavia enables precise camera control while preserving object motion, a challenging and underexplored capability. While MotionCtrl and CameraCtrl are monocular methods, they are the closest relevant baselines as they also target precise camera control. Additionally, Cavia builds upon SVD, which supports video generation up to 14 frames, leading to naturally limited camera ranges. Extending this to larger ranges with a more powerful long context base video generation model is an exciting direction for future work.

We provide a detailed comparison with SynCamMaster in our response to reviewer vi5f’s Q2 (Tab. A). However, we note that SynCamMaster's official GitHub repository contains only boilerplate code, and the absence of essential components such as the model checkpoint and architecture makes it impossible to compare with SynCamMaster fairly.

Q3. The joint training strategy for static video is confusing. I don't know if the authors refers to V cameras move separately during F-1 frames, or they simply repeat the first frame to F-1 frames.

A3. We do not perform single-frame training. Our "static" videos refer to sequences where static scenes lack dynamic objects. During training, our Cavia framework consistently accepts FF frames, each with a distinct camera matrix.

Q4. For 3D consistency of monocular video generation, it is suggested to compare with ViewCrafter.

A4. We have included comparisons against ViewCrafter in Fig. A on https://cavia2025.github.io/#FigA . ViewCrafter primarily generates videos of static 3D scenes, likely due to its reliance on 3D point clouds. We evaluate ViewCrafter using test images and trajectories from the RealEstate10K dataset and observe that it often produces frames with noticeable color and lighting artifacts, as well as geometric distortions. In contrast, Cavia achieves visually pleasing results with accurate geometry. Furthermore, ViewCrafter requires approximately 4 minutes to generate a video sequence, whereas Cavia produces a set of 2-view videos in just 14 seconds.

Q5. It is suggested to move the comparison with CVD to the main text and conduct quantitative experiments under its evaluation pipeline.

A5. Thank you for the suggestion. We initially placed CVD’s results in the appendix due to space constraints but have now moved them to the main text. We have also provided additional comparisons using CVD’s recently released official codebase. The results can be found in Fig. B on https://cavia2025.github.io/#FigB . Our comparisons show that CVD suffers from severe morphing artifacts and unnatural object motion. It also fails to follow text prompt instructions and overlooks important details. In contrast, Cavia produces outputs with greater geometric consistency and more natural object motion.

For quantitative evaluation, we not only adopt SuperGlue, following CameraCtrl and CVD, but also incorporate COLMAP, a widely used tool in the 3D reconstruction community for estimating camera poses. COLMAP shares the same purpose as SuperGlue by measuring camera pose accuracy but is more effective in identifying morphing artifacts, as it focuses on global geometry rather than just local feature matching. Our extensive quantitative results in Tab. 1 and 2 demonstrate that Cavia significantly outperforms existing camera-control video generation methods.

审稿人评论

Thanks for the effort of the authors. My main concern still lies in experiments. Could the authors provide quantitative comparisons with ViewCrafter/SyncCamMaster/CVD under their evaluation settings? Now only qualitative comparisons are provided.

作者评论

Thank you for your comments. We’re glad that our rebuttal has resolved your other concerns.

As suggested, we conducted additional quantitative comparisons with CVD using their evaluation protocol. We reached out to the authors of CVD, who kindly shared the implementation details of their evaluation setup. The table below presents results on 100 samples, using the same metrics as in the CVD paper. We use SuperGlue to assess the pose accuracy of each generated frame relative to the first frame. For all metrics, higher values indicate better performance.

As shown, Cavia outperforms CVD by a significant margin. These results are consistent with our qualitative comparisons (Fig.9 and Fig.B ) and 3D reconstruction analysis (Fig. 10).

MethodAUC-Rot@5 \uparrowAUC-Rot@10 \uparrowAUC-Rot@20 \uparrowAUC-Trans@5 \uparrowAUC-Trans@10 \uparrowAUC-Trans@20 \uparrowPrec \uparrowMScore \uparrow
Cavia6.3617.1137.564.138.4218.1810.196.70
CVD7.1716.7633.312.594.939.865.683.29
审稿意见
2

This paper introduced a multi-view video diffusion model enhanced by view-integrated attention, called Cavia. Specifically, Cavia used cross-view attention and cross-frame attention to ensure multi-view and temporal consistency respectively. This model design also enabled Cavia to train jointly with diverse datasets. This paper also included detailed data curation workflows, and ablation studies to support the re-implementation and effectiveness.

给作者的问题

The questions are mainly based on the camera pose metrics, a discussion between syncammaster, and the capacity of Cavia to address large viewpoint changes as mentioned above.

论据与证据

Cavia claims that it could generate multi-view videos with precise camera control and object motion, but all results from the paper (especially for multi-view videos) suffer from small camera motion changes. It is still questionable whether Cavia can generate multi-view videos with large viewpoint changes like syncammaster.

方法与评估标准

Most experiments make sense. Some questions remain for the camera metric. The authors claimed that they normalize the camera pose scales in this paper, which is unlike previous work (CameraCtrl). However, the widely used camera metric (absolute pose error (APE), relative pose error (RPE)) should already contain the umeyama alignment to confirm the normalization of the camera pose. Why not include these metrics?

理论论述

No theoretical claims are in this paper.

实验设计与分析

Yes.

补充材料

Yes. Ablation and other extensive experiments parts.

与现有文献的关系

The key contributions are joint training and cross-view/cross-frame attention. However, the novelty of these components are constrained, which have been already proposed by syncammaster [ICLR2025].

遗漏的重要参考文献

This paper did not discuss a very related work called syncammaster published in ICLR2025. Considering the publishing time of syncammaster is very close to the ICML deadline, the authors are under no obligation to compare to it. But this paper should also discuss and clarify the difference between syncammaster.

其他优缺点

In my opinion, this paper should discuss and clarify their different contributions compared to syncammaster, which also includes similar view-attention, 3d-attention, and joint training on multi-view data, multi-view videos, and general videos. Another concern is the limitation of Cavia to address multi-view video generation with large viewpoint changes.

其他意见或建议

Some words are repeated in the abstract, for example, "to our best knowledge" and "to the best of our knowledge".

作者回复

We thank the reviewer vi5f for their valuable effort and for recognizing the strength of our experimental results. However, we respectfully disagree with the concerns regarding "large viewpoint changes like SynCamMaster." We would like to clarify that the concurrent work SynCamMaster is limited to generating fixed-viewpoint videos, whereas our proposed method, Cavia, enables precise camera control for each individual frame under a multi-view video generation setting.

Q1. Why not include absolute pose error (APE) and relative pose error (RPE) metrics?

A1. APE and RPE combine translation and rotation errors into a single value. However, following common practices in camera-controllable video generation (e.g., CameraCtrl, Collaborative Video Diffusion), we evaluate translation and rotation errors separately. Our chosen metrics align with the spirit of RPE, as they also measure relative differences between camera matrices but allow for a more detailed assessment.

Q2. Should discuss and clarify the difference between syncammaster [ICLR2025]. It is still questionable whether Cavia can generate multi-view videos with large viewpoint changes like syncammaster.

A2. We appreciate the reviewer's suggestion and will cite and discuss SynCamMaster accordingly. However, SynCamMaster is designed for generating videos with fixed camera viewpoints and does not support dynamic viewpoint changes. In contrast, Cavia provides precise per-frame camera control in multi-video generation scenarios.

The table below (Tab. A) provides a detailed comparison:

AspectSynCamMasterCavia
TaskText-to-video generationImage-to-video generation
ControlGenerates multi-view videos with fixed cameras; each video is viewpoint-frozenSupports precise camera control for each frame across multiple videos
Base ModelInternal text-to-video model (KLing’s team, unpublished)Uses SVD; achieves comparable object motion to SVD (see Fig. 6)
MethodFocuses on static camera videos; trains only cross-view synchronization modulesSupports frame-wise camera control; Trains both cross-view and cross-frame attention modules to ensure spatial-temporal consistency with dynamic camera control
DataBoth SynCamMaster and Cavia use data from 4D synthetic assets, 3D static scenes, and monocular videos; SynCamMaster collects static camera videos for monocular trainingEmploys a curated pipeline to collect high-quality monocular videos with accurate camera pose annotations, facilitating precise per-frame camera control
Joint Training StrategyRequires copying monocular video vv times and setting the same camera parameters across viewsMore compute-efficient; avoids data copying; flexible cross-view attention enables training/inferencing on arbitrary numbers of views
Viewpoint ChangesFixed viewpoint; cannot handle viewpoint changesAllows precise viewpoint control for each frame in multi-video generation

Q3. Limitation of Cavia to address multi-view video generation with large viewpoint changes.

A3. The current performance of Cavia is constrained by the limited context length of SVD (14 frames). However, Cavia's framework is general and scalable, enabling the generation of longer videos with larger viewpoint variations when paired with a stronger foundation model. It is worth emphasizing that SynCamMaster entirely lacks the ability to handle viewpoint changes. In contrast, Cavia is explicitly designed to address this limitation, directly responding to the reviewer's primary concern.

审稿人评论

Thank you for the rebuttal. I appreciate that some of my concerns have been addressed. However, I still have reservations regarding Cavia's capacity to handle significant viewpoint changes. Notably, the camera pose movements in all qualitative results presented in the paper appear quite subtle.

While the authors attribute this limitation to the constraints of SVD, the generalization and scalability of the proposed methods have not been rigorously validated within this study. Therefore, the per-frame camera control character of Cavia, as mentioned in Table A, may potentially restrict its ability to generalize to larger viewpoint changes.

作者评论

Thank you for your comments. We’re thrilled that our rebuttal has resolved your concerns.

We’d like to kindly remind you that though the tested camera pose movements may not appear significant yet, they are complex and compositional. More importantly, no other work achieves comparable performance. In particular, the concurrent work you highlighted, SynCamMaster, is not able to produce any viewpoint change. In contrast, we have systematically evaluated the generalization capabilities of our method both quantitatively (Tab. 1, 2, 3) and qualitatively (Fig. 2, 3) on testing images and camera trajectories that are unseen during training.

Regarding your thoughts on the importance of camera control, we completely agree. Precise camera pose control is exactly the focus of our work. Due to the limited sequence length of SVD (14 frames), we prioritized precision over range, resulting in the most accurate camera control framework currently available. We show in Tab. 1, 2, and 3 that our framework consistently outperforms all existing approaches aimed at precise camera control.

As you noted, “ the per-frame camera control character of Cavia may potentially restricting larger viewpoint changes.” While this is true, SynCamMaster, even with in-house video models, is restricted to fixed viewpoints with no ability to handle any viewpoint changes. Our proposed modules in Cavia are general and would benefit greatly from stronger foundation models. If we had access to powerful video models like Kling, the performance could be further improved. Please note that when we conducted the Cavia project in June 2024, SVD was the only publicly available image-to-video generation model.

In summary, Cavia represents the state of the art for the challenging task of per-frame precise camera control. As agreed by other reviewers, no existing method matches its ability to generate multiple videos of the same scene with accurate control over camera motion, while consistently preserving object motion.

审稿意见
3

This paper proposed a novel framework for camera-controlled multi-view video generation.

Based on SVD, it proposes to use 3D attention in both frame attention and view attention to ensure spatio-temporal consistency. In addition, it curated a mixed dataset from a lot of real and synthetic datasets for the model training. The experiments demonstrate the superior performance of the proposed methods.

给作者的问题

Will the authors release the model and the curated datasets?

论据与证据

The claims are supported by convincing evidence.

方法与评估标准

The proposed methods and evaluation criteria make sense.

理论论述

No theoretical claim.

实验设计与分析

I checked the experiments and ablative experiments. I found some baseline details are missing, e.g.,

  1. When comparing to SVD in Table 1 and 2, how did the authors achieve camera pose controllability in SVD?
  2. When comparing to MotionCtrl and CameraCtrl in Table 2, how did the authors implement 2-view video generation? Just run each of them twice separately?

补充材料

I reviewed the website provided in the supplementary material.

与现有文献的关系

This work contributes a lot to the multi-view video generation with camera control by proposing 3D attention for both spatial and temporal attention and mixed datasets for training.

遗漏的重要参考文献

No

其他优缺点

Strengths:

  1. The idea of using 3D attention in both spatial and temporal attention is interesting
  2. The curated mixed datasets are beneficial for the following works

Weakness:

  1. As discussed in Experimental Designs Or Analyses, it would be better if more baseline details were provided.
  2. Using 3D attention in the spatial and temporal module could increase the speed and memory requirement compared to baselines. A detailed speed analysis and memory requirement report are needed.
  3. In the provided results, I found the performance still limited, partially due to the highly challenging task. Some visual issues are: a. The motion magnitude in the results is still very small. b. Camera trajectories are simple. Not sure if the model can support 360-degree views.

update after rebuttal

Thanks for the rebuttal and the additional experiments. Overall, the primary concern remains the method’s ability to handle large viewpoint changes in dynamic scenes. While direct comparisons with ViewCrafter and SyncCamMaster could be informative, I understand that these works are concurrent or not officially published, so such comparisons are not required.

Taking into account both the strengths and limitations of the paper, I will maintain my original score.

其他意见或建议

Some concurrent works can be discussed:

Sun, W., Chen, S., Liu, F., Chen, Z., Duan, Y., Zhang, J., & Wang, Y. (2024). Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928.

Wu, R., Gao, R., Poole, B., Trevithick, A., Zheng, C., Barron, J. T., & Holynski, A. (2024). Cat4d: Create anything in 4d with multi-view video diffusion models. arXiv preprint arXiv:2411.18613.

Bai, J., Xia, M., Wang, X., Yuan, Z., Fu, X., Liu, Z., ... & Zhang, D. (2024). SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints. arXiv preprint arXiv:2412.07760.

Zhao, Y., Lin, C. C., Lin, K., Yan, Z., Li, L., Yang, Z., ... & Wang, L. (2024). Genxd: Generating any 3d and 4d scenes. arXiv preprint arXiv:2411.02319.

作者回复

We thank reviewer KtKL for the detailed comments and for recognizing our strong performance compared to existing works. Below are our detailed responses to the questions raised.

Q1. Baseline details are missing: 1) How did SVD achieve camera pose? 2) How are MotionCtrl and CameraCtrl implemented for 2-view video generation?

A1. Thank you for this insightful question. We clarify that the SVD results in our experiments are obtained from the vanilla SVD model without any modification; thus, SVD does not offer camera pose controllability. We include SVD's results solely to showcase the base model's motion generation capability, with additional visualizations provided in Fig. 6.

For 2-view comparisons involving MotionCtrl and CameraCtrl, we independently run each view with the corresponding camera poses, as these methods are originally designed for monocular video generation.

Q2. Speed and memory report on the 3D attention is needed.

A2. Indeed, the proposed 3D attention introduces additional computational overhead due to the extended sequence length. However, thanks to the 8×\times compression of SVD's latent space, our attention operates on latent features sized 14×32×3214\times 32\times 32 for a 256×256256 \times 256 video. As shown in the table below, the increased cost in speed and memory remains acceptable.

Cross-View AttentionCross-frame AttentionMax sequence length in AttentionInference StepTrainingTrain/Mem
NoNo32x320.16s1.68it/s32.36G
YesNo2x32x320.18s1.66it/s32.37G
NoYes14x32x320.26s1.62it/s33.85G
YesYes14x32x320.28s1.52it/s34.14G

Q3. Performance is limited for this highly challenging task. Some visual issues are: a. The motion magnitude in the results is still very small. b. Camera trajectories are simple. Not sure if the model can support 360-degree views.

A3. We wish to emphasize that the factors you prioritize most, such as object motion strength and camera trajectory complexity, are precisely the areas in which Cavia significantly outperforms prior works.

  • Object motion strength: SVD's results serve as a reference for motion generation capabilities, with additional visualizations in Fig. 6. More importantly, existing camera-controllable video generation methods, such as MotionCtrl and CameraCtrl, are limited to generating videos of static scenes. In contrast, Cavia retains the ability to generate object motion comparable to the base SVD model.
  • Camera trajectory complexity: Extensive experiments in Fig. 3 and Tab. 1 demonstrate Cavia's superior camera control compared to state-of-the-art methods. However, the current version does not support 360-degree view generation due to the limited context length of SVD (14 frames). We plan to explore longer video generation with more capable base models in future work.

Q4. Some concurrent works can be discussed: DimensionX (arXiv:2411.04928), CAT4D (arXiv:2411.18613), SynCamMaster (arXiv:2412.07760), GenXD (arXiv:2411.02319).

A4. We will include and discuss these concurrent works in our revision.

  • DimensionX: Utilizes separate S-director and T-director modules to generate time-frozen and viewpoint-frozen videos independently. Cavia, instead, generates multi-view videos simultaneously, ensuring better spatial-temporal consistency.
  • CAT4D: Converts monocular videos into 4D scenes. In contrast, Cavia generates multi-view videos from a single image input.
  • SynCamMaster: Generates multi-view videos but only supports fixed cameras, resulting in viewpoint-frozen videos. Cavia enables precise camera control for each frame of multiple generated videos.
  • GenXD: Employs a masked video diffusion framework for camera-controllable monocular video generation. Cavia, however, targets multi-view video generation, which is inherently more challenging.

While these works are commendable, none address camera-controllable multi-view video generation. Cavia is the first framework to generate multiple videos of the same scene with precise per-frame camera control, while maintaining object motion quality comparable to SVD.

Q5. Will the authors release the model and the curated datasets?

A5. We are currently awaiting legal approval for the public release of the model and curated datasets. Meanwhile, we have provided sufficient details to facilitate the reproducibility of our model and dataset.

审稿人评论

I thank the author's rebuttal and additional results. I have read the reviews from other reviewers.

While this paper may lack some quantitative comparisons to methods such as ViewCrafter, SyncCamMaster, and CVD, I agree with Reviewer vi5f that ViewCrafter (arXiv) and SyncCamMaster (ICLR 2025) can be considered concurrent works, and thus the authors are not obligated to include direct comparisons.

However, to further strengthen the validation of the proposed method, the authors could consider adding quantitative comparisons with CVD (NeurIPS 2024, code released).

作者评论

Thank you for supporting our submission and acknowledging that ViewCrafter (arXiv) and SyncCamMaster (ICLR 2025) are concurrent works. We’d like to clarify that ViewCrafter only generates static scenes, while SyncCamMaster produces videos from fixed viewpoints. Neither focuses on precise camera control for general image-to-video generation, which is the core of Cavia. In this sense, they are less directly related to Cavia than other works we’ve already compared, such as MotionCtrl, CameraCtrl, and CVD.

As suggested, we conducted additional quantitative comparisons between Cavia and CVD using CVD’s evaluation metrics. The results show that Cavia significantly outperforms CVD in camera pose accuracy, consistent with our earlier qualitative comparisons (Fig.9 and Fig.B ) and 3D reconstruction comparisons (Fig. 10).

MethodAUC-Rot@5 \uparrowAUC-Rot@10 \uparrowAUC-Rot@20 \uparrowAUC-Trans@5 \uparrowAUC-Trans@10 \uparrowAUC-Trans@20 \uparrowPrec \uparrowMScore \uparrow
Cavia6.3617.1137.564.138.4218.1810.196.70
CVD7.1716.7633.312.594.939.865.683.29
审稿意见
4

This paper introduces a novel framework named Cavia for generating multi-view videos with camera controllability.

The primary contributions consist of two key components: 1) Cross-view and cross-frame 3D attention mechanisms designed to enhance consistency across different viewpoints and temporal frames. 2) A joint training strategy that effectively utilizes a carefully curated combination of static, monocular dynamic, and multi-view dynamic videos to ensure geometric consistency, realistic object motion, and background preservation.

The experimental results demonstrate superior geometric accuracy and perceptual quality compared to existing baseline methods.

给作者的问题

Please see the weakness and suggestion part.

论据与证据

The claims presented in the paper are intuitively sound. However, the experimental results are insufficient to comprehensively validate the individual contributions of each component of the proposed method, including the cross-view 3D attention mechanism, the cross-frame 3D attention mechanism, and the joint training strategy.

方法与评估标准

Yes. The effectiveness of the proposed method is primarily validated using the evaluation metrics and benchmark datasets presented in Table 1 and Table 2.

理论论述

There is no theoretical claim.

实验设计与分析

Yes. There are no obvious issues in the experimental design and analysis.

补充材料

Yes. The Supplementary Material includes videos that qualitatively compare the baseline methods with the proposed method, as well as ablation studies for each component of the method. The provided video demonstrations illustrate smoother video consistency compared to other methods.

与现有文献的关系

The key constributions are related to multi-view image generation such as V3D[1], IM-3D[2], SV3D[3] and Vivid-1-to-3[4].

The primary distinction lies in that these methods focus on generating static 3D objects or scenes, whereas this work introduces vivid object motion into multi-view dynamic video generation within complex scenes.

[1] V3d: Video diffusion models are effective 3d generators, arxiv 2024

[2] F. Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation, ICML 2024

[3] Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion, arxiv 2024

[4] Vivid-1-to-3: Novel view synthesis with video diffusion models, CVPR 2024

遗漏的重要参考文献

There is no essential references not discussed.

其他优缺点

Strengths:

  1. This paper proposes two noval cross-attention mechanisms, namely cross-view and cross-frame 3D attentions, to enhance the multi-view consistency of generated videos. The evaluation results presented in the paper appear promising.
  2. The supplementary materials are sufficient to demonstrate the effectiveness of the proposed method.

Weaknesses:

  1. The presentation of the method requires improvement. Specifically, there is a lack of formal mathematical equations to clarify how the two types of cross-attention mechanisms are calculated.
  2. Additional quantitative ablation studies should be included in the main text to better illustrate the role and contribution of each component in the method’s design.

其他意见或建议

A List of Issues:

  • ​Abstract (Lines L033 and L036): There is a repetition of the phrase "To our best knowledge, ..." in both lines. This redundancy should be addressed for conciseness and clarity.

  • ​Figure 1(c): As I understand it, the attention mechanism represents four dimensions: V (View), F (Frame), H (Height), and W (Width). However, the figure lacks textual annotations to indicate each dimension, making it difficult to interpret. Additionally, the use of different colors is confusing, and there is no legend to explain their meanings. This information should be clearly labeled to improve readability.

  • ​Figure 1(a): The figure is unclear and lacks sufficient detail. Specifically:

  1. The inputs to the cross-attention mechanism are not clearly indicated.
  2. The roles of the key, query, and value are not labeled or explained.
  3. There are no formal equations provided to illustrate how the two types of cross-attention (cross-view and cross-frame) are calculated.

---post-rebuttal comments---

I have no doubt about the contribution and technical merit of this work. The proposed Cavia framework addresses an interesting multi-view image-to-video generation under precise camera control. Regarding the discussion on viewpoint changes, I find the authors’ clarification reasonable. They provided evidence that the demonstrated camera motions are substantial and aligned with community standards.

作者回复

We thank the reviewer RKXa for their positive evaluation of the technical novelty and superior performance of our method. As suggested, we will enhance the presentation of our method and figures in the revised draft. Formal mathematical equations will be added to clarify the computation of the proposed attention modules, and the typos in the abstract will be corrected as recommended. Below, we provide detailed responses to the additional comments:

Q1. Additional quantitative ablation studies should be included in the main text.

A1. Due to space limitations, we initially placed the ablation studies in the supplementary material. However, we are happy to incorporate these studies into the main manuscript as suggested.

Q2. Figure 1(c) lacks textual annotations.

A2. Thank you for pointing this out. Your understanding is correct: V (View), F (Frame), H (Height), and W (Width) denote the respective dimensions. We used purple and orange to indicate corresponding blocks in Fig. 1(a). We apologize for any confusion and will update Fig. 1 with clear textual annotations and a legend to improve readability.

Q3. Figure 1(a) lacks sufficient detail. Specifically:(1) The inputs to the cross-attention mechanism are not clearly indicated. (2) The roles of the key, query, and value are not labeled or explained. (3) There are no formal equations provided to illustrate how the two types of cross-attention (cross-view and cross-frame) are calculated.

A3. Thank you for raising these important points. Our cross-view and cross-frame attention modules are both self-attention mechanisms, where the key, query, and value are derived from the same input features. We will clarify this in the revised draft and will also include formal equations to illustrate the computation of both cross-view and cross-frame attention modules.

审稿人评论

Thank you for the rebuttal. In the revision, I recomand improving the presentation and move additional comparison results from the supplementary material to the main text to enhance clarity and impact. Since there are no essential issues with the paper’s core contributions, I maintain my original rating.

作者评论

Thank you again for your positive assessment, highlighting our contributions, and valuable suggestions. We've made changes in our revised draft as suggested by the reviewer.

最终决定

This paper proposes a multi-view video diffusion model enhanced by view-integrated attention mechanisms. By incorporating both cross-view and cross-frame attention, it ensures multi-view and temporal consistency, enabling joint training across diverse datasets. The paper includes detailed data curation steps and comprehensive ablation studies, making it both reproducible and technically solid.

The authors and reviewers engaged in a very highly active discussion. The final debate centered on whether the demonstrated camera motion was sufficient to support the paper’s claimed contributions. The authors provided a thorough and well-structured response. They clarified that Cavia actually achieves substantial camera motions, including both significant translations and rotations beyond 20 degrees, comparable to or exceeding several cited baselines. More importantly, they argued that Cavia differs fundamentally from some of the referenced methods by generating multi-view, dynamic videos in a single pass, without stitching or post-processing, thus preserving temporal and spatial coherence. They also emphasized that Cavia addresses a distinct and challenging task: multi-view image-to-video generation under precise per-frame camera control, which is not tackled by the cited works.

Overall, I find the authors’ rebuttal to be convincing. They address the concerns with specific comparisons and clarify misconceptions about both their method and others. Given the novelty of the task, the strength of the technical contributions, and the clarity of the response, I support acceptance of this paper.