Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-view Synthesis from Monocular Videos
摘要
评审与讨论
The paper proposes a framework for novel view synthesis given a monocular video. The proposed method utilizes rasterized point clouds in the novel views to condition a video diffusion model, which is trained to fill in the missing information. The main novelty of the paper is to design a training algorithm to utilize only monocular videos (without concurrent multiviews) to train the video diffusion model, utilizing covisibility. Test-time fine-tuning is also conducted and important to improve the quality of the results.
优缺点分析
Strengths
-
The method is simple and makes sense. Utilizing the known information (reconstructed point clouds from structure from motion) and fill in the missing information with a generative model is intuitive and effective.
-
The paper provides results on various types of scenes (real, synthetic, with/without scene dynamics).
Weaknesses:
-
The paper may benefit from more discussions about why using co-visibility when the actual novel-view image and depth map is unknown can simulate the mask at source view if novel view were presented. Also in what condition does it hold? If my understanding is correct, it seems to be that the proposed way to create the mask does not consider if the novel view has an occluder that additionally blocks point clouds from the source view -- e.g., imagine the novel view is partially behind a corner but the corner is not visible in the source view. Also, the proposed method to create a mask seems to assume pinhole camera models -- if the camera has a sizable aperture, more information may be available (through defocus blur). Since the self-supervision training by simulating masks is the main novelty of the paper, it seems that the discussion is a bit light in the paper.
-
The paper does not answer a few questions:
(1) Does the method produce consistent multiview videos? If I sample twice the generative model using the same input monocular video but from two different camera trajectories, will the content of the unobserved areas in the two trajectories be consistent with each other? Does the test-time finetuning improve the consistency between multiple views?
(2) How long does the test-time finetuning take, how many gpus does it need?
(3) What is the performance of the model before test-time finetuning on DyCheck?
(4) How long do the comparison methods take in Table 1 and Table 2 (including test-time optimization and rendering)?
-
While utilizing rasterized point clouds in the novel view is effective, it also makes the system depend heavily on the quality of the structure from motion (ie, depth map, camera pose and intrinsics). The difference in quality can be observed easily between the Kubric and in-the-wild results. The in-the-wild results (eg, Superman and Friends) seem to have camera breathing that makes the characters’ faces look unnatural. In comparison, when using ground-truth depth and camera parameters the Kubric results do not have the problem.
-
The writing can be slightly improved. When I was reading the paper, I was not sure whether the method requires concurrent multiview videos or not until I finished reading the method part (e.g, Line 39). It was also not clear in Line 108 and 114 that whether g_src contains the entire information about the scene (ie, full 3D) or not. Line 122 seems to have a type (V_src or V_novel).
-
It seems from Figure 5, the finetuning reduces the contrast of the results. Any reason why?
-
Would the proposed method to generate camera trajectories create cameras inside objects?
问题
I asked the questions above, as an outline:
- Would having occlusion / defocus effect in the novel views and source videos affect the masks?
- How long does the test-time finetuning take (on how many gpus) and how does it compare to baselines?
- Does the proposed method produce consistent multiview videos when sample the model multiple times with different trajectories, and does the test-time finetuning help?
- What is the performance of the model before finetuning on DyCheck?
局限性
yes
最终评判理由
I like the proposed method and find it interesting. The rebuttal addressed my questions. I recommend the authors to include the discussions in the rebuttal to the revision.
格式问题
No
We appreciate your comment that our proposed method for dynamic novel-view synthesis is "simple" and the "test-time finetuning is important". We also show "results on various types of scenes (real, synthetic, with/without scene dynamics)". We address your concerns below.
Q1: Consistent multi-view generation
Every time CogNVS is run with a new camera trajectory, it would result in a different plausible output. Having said that, we observe that test-time finetuning does increase the consistency of the inpainted regions with respect to the content of the visible scene. If 3D multi-view consistency across samples is desired, we suggest the following: CogNVS should be used iteratively by inferring and fusing the point clouds from all previous novel-view generations. This is a fairly straightforward and common strategy adopted by many generative models [R1, R2, R3]. We have verified this on our end by generating a panorama trajectory in chunks for an outdoor scene which looks spatiotemporally consistent. Unfortunately, visuals cannot be uploaded as part of this rebuttal.
[R1] Text2room: Extracting textured 3d meshes from 2d text-to-image models, ICCV23.
[R2] Wonderjourney: Going from anywhere to everywhere, CVPR24.
[R3] 4DiM: Controlling space and time with diffusion models, ICLR25.
Q2: Performance on DyCheck before test-time finetuning
We include the performance on DyCheck before test-time finetuning below. For completeness, we also include a comparison on Kubric 4D and ParallelDomain 4D. Across all datasets, the takeaway is that test-time finetuning benefits all metrics by adapting the weights of CogNVS to suit the test scene’s appearance, texture, lighting and 3D cues. This is one of our primary contributions in this work.
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | FID ↓ | KID ↓ |
|---|---|---|---|---|---|
| w/o test-time finetuning | 15.00 | 0.375 | 0.664 | 172.02 | 0.073 |
| w/ test-time finetuning | 15.18 | 0.382 | 0.622 | 94.48 | 0.030 |
Table A. Test-time finetuning improves performances on DyCheck.
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | FID ↓ | KID ↓ |
|---|---|---|---|---|---|
| w/o test-time finetuning | 20.14 | 0.654 | 0.260 | 114.00 | 0.012 |
| w/ test-time finetuning | 22.63 | 0.760 | 0.232 | 102.47 | 0.008 |
Table B. Test-time finetuning improves performances on Kubric-4D.
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | FID ↓ | KID ↓ |
|---|---|---|---|---|---|
| w/o test-time finetuning | 23.16 | 0.741 | 0.320 | 108.58 | 0.038 |
| w/ test-time finetuning | 24.34 | 0.797 | 0.302 | 102.43 | 0.033 |
Table C. Test-time finetuning improves performances on ParallelDomain-4D.
Q3: Duration of test-time finetuning
CogNVS takes 70/140 mins for test-time finetuning on short/long videos and another 5 mins for final rendering. A detailed analysis of average using 8 A6000 GPUs for our method and all baselines on DyCheck evaluation set with 400-frame sequences is included below. In general, our optimization is faster than some test-time optimization approaches with the additional benefit of being able to inpaint/hallucinate unknown regions in a spatiotemporally consistent manner. Additionally, our inference is on par with other feed-forward methods.
| Method | Optimization | Rendering / Inference |
|---|---|---|
| MegaSAM (CVPR '25) | 9 min | Real-time |
| MoSca (CVPR '25) | 66 min | Real-time |
| 4DGS (CVPR '24) | 151 min | Real-time |
| Shape of Motion (ICCV '25) | 237 min | Real-time |
| GCD (ECCV '24) | - | 2 min |
| TrajectoryCrafter (ICCV '25) | - | 5 min |
| CogNVS | 140 min | 5 min |
Table D. Run time analysis of ours and baselines.
Note that per the graph in Fig. 8 (left) in supplement, our method achieves 96% of its performance in just half the number of finetuning steps. This duration can always be traded-off with the required compute. This has recently been referred to as “inference-time scaling” for diffusion models [R4].
[R4] Inference-time scaling for diffusion models beyond scaling denoising steps, arXiv 2501.09732.
Q4: Effect of occlusion / defocus in novel-views on mask
During mask creation, we do not explicitly hallucinate new occluders in the novel view to reject more points from the point cloud during reprojection. However, this is not a bottleneck for training, or for inference, as the random nature of our training mask creation, in expectation, covers cases where an occluder would be visible in the novel view. Even if this was not the case, the inference from CogNVS would be unaffected as all that CogNVS does is pixel-level inpainting given a set of masked pixels. The resulting inpainted region can hallucinate new objects which act as occluders in the novel view. Note that all occlusion scenarios except “containment” are covered in the generated training data. As you notice correctly, for the scope of the paper, we only model a pinhole camera as the methods we use for preprocessing (e.g., MegaSAM) are also trained for a pinhole camera assumption. We will gladly add this discussion to the main paper.
Q5: Dependent on quality of structure-from-motion
This is correct. Since our method takes as input a partial rendering of the input scene from the novel-view, its performance will increase with better structure-from-motion approaches which can give us better depth maps, camera poses and intrinsics. As you notice correctly, our generations are better when groundtruth depth and cameras are available, e.g., in Kubric 4D.
Q6: Improved writing
Thanks for your suggestion. g_src only contains information about the scene from the source view visible in V_src (i.e., not full 3D). Line 122 in fact does not have a typo and it should be V_src because we use the source video as groundtruth during finetuning. Nevertheless, we will make the writing more clear as suggested.
Q7: Contrast is reduced after finetuning
This can happen in some cases when the scene is out-of-distribution (e.g. the textured ground plane in this case) and finetuning is not enough for adaptation. We will highlight this as a limitation of our method more clearly in the main text.
Q8: Cameras inside objects
Our current setup does not allow generating cameras inside objects because the training datasets we consider do not cover these cases.
I thank the authors for replying to my questions. I would recommend including the discussions to the future manuscript, including incorporating the runtime results (Q3) into the main table.
Thank you for your valuable comments. We will incorporate the additional discussions and results in our revised manuscript.
This paper proposes a warping-then-inpainting paradigm for dynamic novel view synthesis from monocular video. Begin with reconstructed dynamic scenes (e.g., MegaSAM ), and novel views can be rendered with different camera trajectories. The main challenge lies in inpainting disoccluded regions in rendered novel views. To address this, the authors fine-tune the video generation model CogVideo for video inpainting, using constructed training pairs of masked and source images. Additionally, they further fine-tune the model on test scenes to reduce domain gaps and enhance inpainting quality. Experimental results demonstrate that the proposed method achieves superior performance in most cases.
优缺点分析
Strengths: (1) This paper proposes a simple pipeline for dynamic novel view synthesis from a monocular video, yielding good results. (2) This paper fine-tunes a video inpainting model, suitable for filling in disoccluded regions in novel view synthesis tasks. (3) The paper is well organized and easy to follow.
Weakness: (1) Since the warping and then inpainting pipeline is widely used in view synthesis and 3D scene creation (e.g., Text2Room, etc.), it is not suitable to claim it as a contribution. (2) The generated content lacks 3D consistency. The output of this method is a camera-controlled video (cannot support free-view exploration), which means that disoccluded regions can be generated with different content when conditioned on a new camera trajectory in the second generation. (3) The technique contributions seem to be limited. Fine-tune a video generation model to inpaint the disoccluded regions in novel views without insightful designs during the training. Other components in this paper are built upon existing approaches.
问题
(1) Do the performance improvements come from more advanced 4D initialization and video generation models? As mentioned in the paper, the quality of the reconstructed scene influences the performance. Given the pipeline’s similarity to TrajCrafter (warping + novel view rendering + finetuned video inpainting), how does the proposed method quantitatively and qualitatively outperform it? Are gains attributable to the choice of CogVideo, warping strategy, fine-tuning approach, and reconstruction methods?
(2) By warping the monocular video into novel views (some pixels are occluded and discarded) and then warping it back, we can also create the training pairs. This approach has been widely used in TrajCrafter and previous methods, such as [1]. I believe the underlying motivation for the warp-back strategy is the same as the design of this paper. Or what are the advantages of this paper's mask creation design?
(3) About per-scene optimization. Will the finetuning on a specific scene cause the overfitting problem?
[1] Single-View View Synthesis in the Wild with Learned Adaptive Multiplane Images
局限性
yes
最终评判理由
My concern will not be solved.
格式问题
No
We appreciate your comment that we propose a "simple pipeline" yielding "good results". We address your concerns below.
Q1: Claiming inpainting as a contribution
Thank you for your comment. We want to highlight that we do not claim "warping then inpainting" as a contribution (L52-55). In fact, we sufficiently cite other works which employ a similar inpainting strategy (c.f. MegaScenes, NVS-Solver, Gen3C, TrajectoryCrafter). We apologize for the oversight on Text2Room. We will gladly cite it and clarify these details further in the manuscript.
Q2: Multimodal generations lack 3D consistency
It is correct that every time CogNVS is run with a new camera trajectory, it would result in a different plausible output. Having said that, we observe that test-time finetuning does increase the consistency of the inpainted regions with respect to the content of the visible scene. If 3D multi-view consistency across samples is desired, we suggest the following: CogNVS should be used iteratively by inferring and fusing the point clouds from all previous novel-view generations. This is a fairly straightforward and common strategy adopted by many generative models [R1, R2, R3]. We have verified this on our end by generating a panorama trajectory in chunks for an outdoor scene which looks spatiotemporally consistent. Unfortunately, visuals cannot be uploaded as part of this rebuttal.
[R1] Text2room: Extracting textured 3d meshes from 2d text-to-image models, ICCV23.
[R2] Wonderjourney: Going from anywhere to everywhere, CVPR24.
[R3] 4DiM: Controlling space and time with diffusion models, ICLR25.
Q3: Limited technical contribution
As elucidated in the main paper, we would like to reiterate that our main technical contributions are, (1) training a dynamic novel-view synthesis method purely on 2D videos, and (2) self-supervised finetuning with input videos at test time using our data generation strategy. These help our method achieve state-of-the-art performance on dynamic novel-view synthesis. Reviewer N7JL in their review regards these contributions as "key differences" that are methodologically novel as compared to prior works, and to the best of our knowledge, these contributions have not appeared in the literature before.
TrajectoryCrafter is one similar and concurrent work (to be published at ICCV in October). While both TrajectoryCrafter and CogNVS finetune a video generation model to an inpainting model, TrajectoryCrafter’s contribution is a “multiview” model that requires the input view as a conditioning. Our architecture is simpler as it simply inpaints in. Because of this we are able to finetune CogNVS at test-time, and it is impossible to do so with TrajectoryCrafter as you cannot have access to the groundtruth novel view at test-time. Empirically, this test-time finetuning enabled by our architecture design manifests in improved performance metrics. See below for quantitative comparison.
Q4: Primary source of performance improvement
Please note that as stated above, our primary source of improvement comes from the last stage of training – self-supervised finetuning of CogNVS at test time on a test video. For example, on Kubric 4D, TrajectoryCrafter’s PSNR↑/FID↓ is 20.9/130.2. Our pretrained CogNVS without test-time finetuning (with a similar CogVideoX backbone, warping strategy and groundtruth point cloud as TrajectoryCrafter) achieves a PSNR↑/FID↓ of 20.1/114.0, which is on par with the performance of TrajectoryCrafter out of the box. Upon test-time finetuning, CogNVS achieves a PSNR↑/FID↓ of 22.6/102.5, which surpasses the performance of TrajectoryCrafter. As we cover in the main text, test-time finetuning helps CogNVS adjust to the target scene of interest, and adapt the weights to suit the scene’s appearance, texture, lighting and 3D cues.
Q5: Mask creation design
The underlying motivation for the warp-back strategy is indeed the same as TrajectoryCrafter and [1]. On another note, two distinctions of our work with respect to TrajectoryCrafter (as touched upon in Q3) are that, (1) we train CogNVS purely on 2D videos, and (2) use a self-supervised test-time finetuning phase to adapt CogNVS to the test video, which is not possible to do with TrajectoryCrafter's architecture design. Finally, we want to point out that TrajectoryCrafter will be published at ICCV in October and therefore, should be regarded as concurrent work.
Q6: Would finetuning cause overfitting?
Yes, overfitting to the target scene is the intention with test-time finetuning. This is the same concept by which NeRFs or Gaussian Splats are overfit to a single scene at test-time. In practice, we do not find that this overfitting is detrimental to novel-view synthesis in any way. Regardless, the duration of test-time finetuning can always be adjusted to favor generating a more diverse scene at the cost of lower adherence to the target scene.
Thanks for the efforts. I have thoroughly read the rebuttal, but still have some concerns about contributions.
"(1) We decompose dynamic view synthesis into three stages of reconstruction, inpainting and test-time finetuning." The author claims the whole pipeline as a contribution. However, reconstruction, warping, and then inpainting have been widely explored, such as [1]. I acknowledge the new per-scene fine-tuning scheme, which is a small contribution. Additionally, I recommend including a graph showing the relationship between performance and fine-tuning iterations to verify overfitting issues.
"(2) ) we use a large corpus of only 2D videos for training CogNVS" Fine-tuning the inpainting model from monocular input is not new. Except for TrajectoryCrafter, previous methods [2] [3] employ a warp then warp-back strategy to construct the training pairs from monocular input. Moreover, the authors also agree "The underlying motivation for the warp-back strategy is indeed the same as TrajectoryCrafter and [2]". Therefore, the second contribution is weak.
The TrajectoryCrafter can be fine-tuned. The authors claim that TrajectoryCrafter can not be fine-tuned because we cannot have access to the ground truth novel view at test time. If I were right, TrajectoryCrafter is trained on two types of datasets, the self-constructed (warp and warp-back) and multi-view datasets. The fine-tuning can be implemented on the self-constructed training pairs. Another question: the TrajectoryCrafter additionally adds the input video as a condition, but this paper removes this condition. What's the motivation for this design?
[1] Text2room: Extracting textured 3d meshes from 2d text-to-image models [2] Single-View View Synthesis in the Wild with Learned Adaptive Multiplane Images [3] SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input
Thank you for your thoughtful comments in this discussion. We address your concerns below.
Q1: Can TrajectoryCrafter be test-time finetuned? Why do we not condition on the input video?
TrajectoryCrafter’s full architecture cannot be used for test-time finetuning (i.e. per-scene finetuning), as it requires the input video as a conditioning. TrajectoryCrafter is in practice trained with triplets of data – input video conditioning, target view render, and target view ground-truth – which it has access to from multi-view static datasets (c.f. “only the multi-view static data provides the required triplets” in their Sec. 3.4 Training scheme).
If we were to test-time finetune the full architecture of TrajectoryCrafter with “self-constructed video pairs”, we would only have access to the target view render and the target view ground-truth (i.e., the input video conditioning would not be available). This is why TrajectoryCrafter cannot be test-time finetuned.
On the other hand, we do not use the input video conditioning in our architecture and therefore, we can easily train on only 2D videos and test-time finetune on any video at inference.
Q2: Contribution of test-time finetuning is “small”
We would like to clarify that the contribution of test-time finetuning (i.e., per-scene finetuning) is important. As you noted earlier, our simple pipeline achieves “superior performance in most cases,” and one major contributor to our model’s state-of-the-art performance is indeed the proposed test-time finetuning (we covered this in our previous response in Q4 above). This operation consistently improves performance quantitatively and qualitatively on all evaluation datasets. For instance, as shown in the table below, on Kubric-4D, CogNVS performs comparably to TrajectoryCrafter before test-time finetuning, but significantly outperforms it across all metrics afterward.
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | FID ↓ | KID ↓ |
|---|---|---|---|---|---|
| TrajectoryCrafter | 20.93 | 0.730 | 0.257 | 130.20 | 0.024 |
| CogNVS w/o test-time finetuning | 20.14 | 0.654 | 0.260 | 114.00 | 0.012 |
| CogNVS w/ test-time finetuning | 22.63 | 0.760 | 0.232 | 102.47 | 0.008 |
Table A: Test-time finetuning improves performance on Kubric-4D.
We also appreciate you pointing out relevant prior works that we had previously overlooked. We will cite them and elaborate on the similarities and distinctions in the revised manuscript. We understand that some of the phrasing in the current paper may suggest we are claiming every component of the "reconstruct, inpaint, finetune" pipeline as a novel contribution, but that is not our intention. We will revise the wording to more clearly separate prior work from our core contributions.
Q3: Include a graph showing the relationship between performance and test-time finetuning iterations
We included a graph illustrating the relationship between performance and test-time finetuning iterations in Fig. 8 (left) in supplement. The results show that within our default finetuning iterations, performance consistently improves, suggesting that such “overfitting” is beneficial. We will include more detailed plots across additional datasets in the revised version.
Thanks for the authors' responses.
(1) The way to fine-tune the TrajectoryCrafter.
Given an input video with 20 frames, we can randomly split it into two sequences (e.g., 110, 1120). The first sequence can be used as a condition, and the second sequence is used to construct the training pairs with the warp and warp-back strategy.
(2) About contribution. I acknowledge that fine-tuning can boost the performance. However, if fine-tuning alone is the paper's greatest contribution, I think it falls short of the bar set by NeurIPS, just from my perspective. The second contribution of training an inpainting model with only monocular input is not new, and has been explored in previous works I mentioned.
Thanks for your comment.
Q1: Splitting the video into two sequences to finetune TrajectoryCrafter
TrajectoryCrafter adopts this strategy to create triplets of data for training on static multi-view datasets. However, such a data curation strategy cannot be used for videos of dynamic scenes! Specifically, when we split a dynamic video into two sequences (e.g., frames 1–10 and 11–20) as source videos (i.e., input video conditioning) and target videos, the content at each time step between the two sequences is misaligned due to differences in their dynamic foreground. Therefore, TrajectoryCrafter still cannot be test-time finetuned on dynamic videos.
Q2: Contribution of our work
We understand that the assessment of contributions is subjective – we will leave the final decision to the AC. As a final comment, some of the papers being discussed (e.g., SpatialDreamer, TrajectoryCrafter) were not published at the time of our submission to NeurIPS or have not been published yet, and therefore are concurrent work. It is natural for independently developed methods to share common ideas with each other, especially when these ideas are fundamentally the most intuitive next steps for the community.
Apart from this, we still believe our work would contribute meaningfully to the dynamic novel-view synthesis community, and set a strong baseline that potentially inspires future work in this direction.
Thanks for the response.
"TrajectoryCrafter still cannot be test-time fine-tuned on dynamic videos due to differences in the two sequences' dynamic foreground". This is a performance issue, not an inability of TrajectoryCrafter to be fine-tuned. You need experimental results to support your claim, not just your guess. The worst situation is that these conditions do not help target view generation, which degenerates to your setting. Moreover, other alternative split schemes can provide a better alignment of dynamic content, for example, every 2nd frame is used as the condition, and the other frames are used as target views.
Why discuss the recent work TrajectoryCrafter? First, this paper includes results from TrajectoryCrafter, and other reviewers have also mentioned the similarities between the two works. Even if we could consider TrajectoryCrafter as concurrent work, we cannot use the non-verified fact that TrajectoryCrafter cannot be fine-tuned to convince the reviewers that the proposed method is superior to TrajectoryCrafter.
I would like to clarify why I think the key contributions are weak. The first contribution is a "reconstruct + inpaint + finetune" pipeline. This pipeline has been widely explored since Text2Room [1]. The second contribution is an inpainting model trained from monocular input. However, constructing training pairs from monocular input has been proposed in previous works. I named both previous and recent works that use the warp and warp-back strategy to construct training pairs and train the inpainting model from monocular input. However, the authors argue that these papers (SpatialDreamer and TrajectoryCrafter) should be concurrent works without mentioning more previous work [2] in their final comment. Moreover, the authors also agree "The underlying motivation for the warp-back strategy is indeed the same as TrajectoryCrafter and [2]" in the first response.
As the author clearly indicates that this is their final comment, my concern is also clearly not solved. I decide to lower my score.
[1]Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models. ICCV 2023 [2] Single-View View Synthesis in the Wild with Learned Adaptive Multiplane Images. Siggraph 2022
Sorry for the continued confusion and thanks for engaging in a thorough discussion!
First of all, we see where you’re coming from regarding finetuning TrajectoryCrafter. The main issue lies in their architecture conditioning on a reference / input video. It seems we all agree on the fact that using our self-supervised data generation strategy, it is not possible to finetune TrajectoryCrafter, which is why we are now pivoting the discussion to a new data generation strategy of splitting a video sequence into source and target clips. This way to finetune TrajectoryCrafter is possible but futile because, as we already mentioned, the data to train on (holding out every 2nd frame, as per your new suggestion) is misaligned in terms of its dynamics. This naturally does not follow the task of dynamic novel view synthesis, where novel views have to be generated for the timesteps already observed in the input. Rather than this ill-posed way of finetuning, one easy fix to TrajectoryCrafter’s architecture to make it more scalable with self-supervised finetuning on 2D videos is to remove the input conditioning, which amounts to an architecture that is similar to our method. We already show in this work, that such an architecture for inpainting, followed by the proposed test-time finetuning, is a powerful approach that surpasses the performance of TrajectoryCrafter and other prior art.
We hope this long discussion on adapting diffusion models to train scalably on 2D videos, points to the fact that this is a hard problem and a valuable one to solve, like we do in this work. While certain components of our approach could have appeared before in literature independently or concurrently, it is important for the community to be able to access a well-built, thoroughly tested, and state-of-the-art pipeline for the task of dynamic novel view synthesis. We believe our work does exactly this, and hope the community can build on our method in future.
This paper considers novel view synthesis from monocular videos. The method can be summarized by first creating a 3D reconstruction which is rendered to the target views followed by a fine-tuned video diffusion model which refine the rendering and in-paint the missing regions. The diffusion model is based on a pre-trained CogVideoX model and then fine-tuned on a large set of videos for which the authors present a method to create plausible occlusion masks from monocular videos which the diffusion model is conditioned on. Furthermore, in addition to the general pre-training, the diffusion model is refined in the same way for each video at test time. Experimental results show excellent qualitative results and quantitative results outperforming previous state-of-the-art for the task.
优缺点分析
Strengths:
- The main strength of the paper is that results look very good qualitatively, and better than the previous state-of-the-art which the method is compared to.
- The key difference methodologically compared to prior work such as NVSsolver [84] or Gen3C (which both use 3D reconstruction+inpainting with video diffusion models for NVS for monocular videos) is to fine-tune for each test video. This is shown quantitatively to greatly improve the performance.
- The data generation strategy is well designed to resemble point cloud renderrings by using co-visible regions in reconstructed videos, and it is shown to improve metrics compared to training with simpler masking heuristics. It should be noted though that the data curation method shares many similarities with TrajectoryCrafter [85], something which should be highlighted clearer in the paper.
Weaknesses:
- The method shares many similarities with TrajectoryCrafter [85], in the sense that a 3D reconstruction is created and a video diffusion model is conditioned on renderings of the 3D reconstruction in the target poses, and the data curation strategy used to create data for pre-training is similar. An interesting experiment would be to use TrajctoryCrafter and the proposed test-time fune-tuning. This would be interesting since the difference between CogNVS and TrajCrafter in Table 1 is in line with the difference between using test-time fine-tuning or not in the ablation in Table 3.
- While the results are improved by the fine-tuning of the video diffusion model per test video, it also significantly increases the run-time. It would have been good to also include run-time comparisons of the different methods. The run time of the test-time fine-tuning is not stated directly but can be calculated. To my understanding the diffusion model sampling takes 5 minutes per video while the per-video diffusion model fine-tuning takes about 1,2/2,4 hours (short vs long videos) with 8 GPUs, for each video, so about 10-20 GPU hours for a single video. (inferred from lines 188-192 in the paper - 200 or 400 steps of test-time fine-tuning vs 12,000 steps for the pre-training which takes 3 days with 8 A6000 GPUs - please correct in the rebuttal if something is misunderstood or incorrect)
问题
- Please see weaknesses.
- To my understanding, is encoded by the VAE encoder to , where pixels that are not co-visible are simply zeros. Are missing regions simply treated as zero-values in , and were alternatives such as providing masks as input to the diffusion models tested? The diffusion model denoises in the latent space which has a reduced spatial resolution, e.g. 1/8th. How are regions in 8x8 patches in the image treated when there are a few missing values only? Are they treated simply as black pixels among the co-visible pixels in the 8x8 patch and encoded as such by the VAE?
局限性
yes
最终评判理由
The authors provided detailed answers to all questions. I keep my borderline accept rating and expect the authors to include all run-time results in the main paper.
格式问题
no
We appreciate your comment that our data generation strategy is "well-designed", our proposed self-supervised finetuning of test videos is a "key difference methodologically" and that our method’s "results look very good qualitatively, and better than the previous state-of-the-art". We address your concerns below.
Q1: Test-time finetuning with TrajectoryCrafter
We understand the similarity between the data generation strategy of our method and TrajectoryCrafter, but we want to clarify that unlike TrajectoryCrafter we train our model with only 2D videos. While both methods finetune a video generation model to an inpainting model, TrajectoryCrafter’s contribution is a “multiview” model that requires the input view as a conditioning. Our architecture is simpler as it simply inpaints in. Because of this we are able to finetune CogNVS at test-time, and it is impossible to do so with TrajectoryCrafter as you cannot have access to the groundtruth novel view at test-time. Finally, we want to point out that TrajectoryCrafter will be published at ICCV in October and therefore, should be regarded as concurrent work.
Q2: Runtime analysis with baselines
Your analysis of the runtime is correct – CogNVS takes 70/140 mins for test-time finetuning on short/long videos. However, per the graph in Fig. 8 (left) in supplement, note that our method achieves 96% of its performance in just half the number of finetuning steps. This duration can always be traded-off with the required compute. This has recently been referred to as “inference-time scaling” for diffusion models [R4].
Nevertheless, we provide a more detailed analysis of optimization and rendering/inference runtime in comparison to baselines below. These are average runtimes for the DyCheck evaluation set. We run finetuning on 8 A6000 GPUs, and for a 400-frame sequence, our method takes ~2 hrs for test-time finetuning and another 5 mins for final rendering. In general, our optimization is faster than some test-time optimization approaches with the additional benefit of being able to inpaint/hallucinate unknown regions in a spatiotemporally consistent manner. Additionally, our inference is on par with other feed-forward methods.
| Method | Optimization | Rendering / Inference |
|---|---|---|
| MegaSAM (CVPR '25) | 9 min | Real-time |
| MoSca (CVPR '25) | 66 min | Real-time |
| 4DGS (CVPR '24) | 151 min | Real-time |
| Shape of Motion (ICCV '25) | 237 min | Real-time |
| GCD (ECCV '24) | - | 2 min |
| TrajectoryCrafter (ICCV '25) | - | 5 min |
| CogNVS | 140 min | 5 min |
Table A. Run time analysis of ours and baselines.
[R4] Inference-time scaling for diffusion models beyond scaling denoising steps, arXiv 2501.09732.
Q3: VAE encoding of missing pixels
Yes, we simply treat missing regions as zeros in the image-space. While there are other ways to encode missing pixels, e.g. providing explicit masks as you suggest (which TrajectoryCrafter also uses), it is not clear what advantage this brings. For instance, we find that TrajectoryCrafter is unable to inpaint unknown regions in some Kubric 4D examples even though they have been explicitly marked as unknown in their input masks. We instead take a data-driven approach – we rely on the model to infer where and how to inpaint missing regions based on context.
The latent space of the VAE can capture semantic information, including the context provided by neighboring pixels. Therefore, black pixels or missing pixels (even though both are represented by 0s) should have different latent embeddings. Similarly, 8x8 patches with few or many missing pixels should have different latent embeddings. In practice, we empirically observe that such patches are handled well by CogNVS without explicit masking.
The authors provided detailed answers to all questions. I keep my borderline accept rating and expect the authors to include all run-time results in the main paper.
Thank you for your comments. We will include all run-time results in the revised manuscript.
The paper tackles the important task of dynamic novel-view synthesis. It introduces CogNVS, a video diffusion framework to inpaint hidden pixels in novel views and is able to generalize to novel videos via test-time finetuning. Experiments show state-of-the-art results across multiple benchmarks (Kubric-4D, ParallelDomain-4D, DyCheck) without requiring multi-view training data or ground-truth geometry.
优缺点分析
Strength
- I like the idea of constructing self-supervised learning pairs on 2D videos to avoid costly 3D geometric data supervision. It is easier to scale up and learn the scene dynamics in a data-driven way.
- Qualitative results show improved temporal consistency and 3D plausibility over baseline methods.
Weakness
- Almost all human and animal characters in the generated videos look somewhat distorted (see the in-the-wild bear case, superman case, and man hiking case in the webpage). The face and body feel flattened. Is this a result of inaccurate MegaSAM estimation?
- Given the randomly constructed pairs for training, what is the maximum rotation angle or the maximum portation of unseen pixels can this model generate robust outputs?
问题
- Almost all human and animal characters in the generated videos look somewhat distorted (see the in-the-wild bear case, superman case, and man hiking case in the webpage). The face and body feel flattened. Is this a result of inaccurate MegaSAM estimation?
- Given the randomly constructed pairs for training, what is the maximum rotation angle or the maximum portation of unseen pixels can this model generate robust outputs?
局限性
yes
最终评判理由
The rebuttal addresses all of my concerns, therefore I raise my rating to accept.
格式问题
none
We appreciate your comment that we solve an "important" task of dynamic novel-view synthesis by "avoiding costly 3D geometric data", building a pipeline that is "easier to scale up" and "improves temporal consistency and 3D plausibility over the baselines". We address your concerns below.
Q1: Distorted foreground objects
This is correct. The distortion that is visible in in-the-wild examples (e.g. left leg of a man hiking) is because of the distortion in the MegaSAM outputs. We will include visualization of renders from MegaSAM that are input to CogNVS to make this clear. Additionally, we note that there is no such distortion in the Kubric 4D objects as ground truth metric depth is available in this case.
Q2: Maximum rotation angle
As we show in Kubric 4D evaluation and some in-the-wild examples, CogNVS can plausibly generate novel views with up to 90 degrees of variation (up, down, left, or right). We also attempted to generate panorama views of scenes (outward facing 360 degree view) by dividing the camera trajectory into 49-frame chunks and progressively generating the new views conditioned on the past generations – a fairly straightforward and common strategy adopted by many generative models [R1, R2, R3]. Qualitative analysis suggests that CogNVS can still generate 3D plausible and temporally coherent novel-views even in such extreme cases. Unfortunately, visuals cannot be uploaded as part of this rebuttal but we have verified this on our end.
Additionally, we want to point out that our method is not able to generate table top views of object-centric scenes (inward facing 360 degree view) beyond 90 degrees of camera deviation, likely because such training data was not seen by the model during pretraining.
[R1] Text2room: Extracting textured 3d meshes from 2d text-to-image models, ICCV23.
[R2] Wonderjourney: Going from anywhere to everywhere, CVPR24.
[R3] 4DiM: Controlling space and time with diffusion models, ICLR25.
Thanks for the detailed response that addresses my concerns. Therefore I raise my score. Please add the promised visualizations and discussions in the final version.
Thanks for raising the score. We will incorporate the promised visualizations and discussions in the final version.
This paper introduces a method for dynamic novel view synthesis from monocular video. It received one acceptance, two boardline acceptance and one rejection. Reviewers praise the good qualitative results, self-supervised learning paradigm with 2D videos to avoid costly 3D geometric data supervision, well designed test-time finetuning strategy. However, Reviewer Tsgf participates extensively in the discussion worries about the novelty of the whole pipeline and suggests that fine-tuning an inpainting model from monocular input is not new. Both Reviewer N7JL and Reviewer Tsgf suggest the experiment of fine-tuning on TrajctoryCrafter. The authors explained in the rebuttal and discussion that why TrajectoryCrafter cannot be test-time finetuned on dynamic videos and the motivation to remove the input video as a condition. The ACs had an in-depth discussion on this point and suggest the concern is a minor issue but the authors are urged to describe what are the unique aspects of the proposed research comparing to TrajectoryCrafter in the final version.