Imagine360: Immersive 360 Video Generation from Perspective Anchor
Imagine360 creates high-quality, immersive 360 videos from perspective video anchors.
摘要
评审与讨论
This paper presents a novel method, Imagine360, to lift standard perspective video into 360° video. The authors first design a dual-branch denoising U-Net equipped with cross-domain spherical attention to capture both global context and local details. Additionally, they introduce an antipodal mask to model long-range motion dependencies. To handle diverse perspective video inputs, a rotation-aware data sampling strategy is proposed. Experimental results demonstrate the effectiveness of the proposed approach.
优缺点分析
Strengths:
-
The paper is well-written and easy to follow.
-
The use of a dual-branch U-Net to capture both global and local correlations is an interesting and effective design choice.
-
The demo is engaging and clearly demonstrates the practical applications of the proposed method.
Weaknesses:
-
The visual quality of the generated videos is relatively poor. Although the proposed method achieves state-of-the-art performance compared to the baselines, the results still exhibit noticeable artifacts, such as blurring and temporal inconsistencies between adjacent frames. Given that recent video generation models (e.g., CogVideo and Wan-2.1) are capable of producing higher-quality outputs, it would be better if the authors could further explain these limitations.
-
It is unclear whether the baseline models were trained using the same dataset collected by the authors. If not, the comparisons may not be entirely fair, and this should be clarified.
-
The proposed method involves a relatively complex pipeline. It would be helpful if the authors provided a comparison of training and inference costs to better assess the model’s efficiency and practical applicability.
问题
Please refer to the Strengths And Weaknesses section.
局限性
Yes.
格式问题
Figure 3 overlaps with surrounding text, causing readability issues. The authors should revise the layout to ensure all content is clearly visible.
Response for Reviewer jUcp
Thanks for the positive feedback and thoughtful suggestions. We will address the concerns and questions as below.
Q.1 The visual quality is relatively poor. Given that recent video generation models are capable of producing higher-quality outputs, it would be better if the authors could further explain these limitations.
Thanks for the comment. The visual artifact primarily stems from our backbone, AnimateDiff, which is known to produce temporal flickers in many cases. Despite this limitation, our method improves upon AnimateDiff in both visual quality and temporal consistency, as shown in Table 1. To further address this issue, we plan to upgrade the backbone to a more advanced video generation model (Wan2.1/2.2), with technical modifications from UNet with DiT. This update is expected to significantly reduce visual artifacts and improve overall temporal coherence. We will update our results with the upgraded backbone in the final version.
Q.2 It is unclear whether the baseline models were trained using the same dataset collected by the authors. If not, the comparisons may not be entirely fair, and this should be clarified.
Thank you for the thoughtful suggestion. The results of the baseline models reported in Table 1 are not trained on YouTube360. However, we have included the performance of 360DVD trained on YouTube360 in Table C of the supplementary material.
We agree that it is important to clarify the training data used for each model and to also report baselines trained on the same dataset. We add the results of baselines+LoRA trained on our data (marked with *) in the table below. Most of the metrics improve with our training data, especially EPE, which reflects 360 motion correctness. We will move the results from Table C to the main paper, clarify the training data source in Table 1, and include additional results of other retrained baselines in the final version.
| IQ | AQ | MS | SC | EPE | |
|---|---|---|---|---|---|
| FYC | 0.6978 | 0.6432 | 0.9771 | 0.9529 | 3.1767 |
| Animatediff | 0.7159 | 0.5840 | 0.9780 | 0.9338 | 3.5393 |
| 360dvd | 0.5537 | 0.4745 | 0.9701 | 0.9629 | 3.1904 |
| FYC+LoRA* | 0.7075 | 0.6345 | 0.9805 | 0.9656 | 2.9330 |
| Animatediff+LoRA* | 0.6945 | 0.6019 | 0.9795 | 0.9237 | 3.0068 |
| 360DVD* | 0.6991 | 0.5381 | 0.9739 | 0.9696 | 3.1466 |
| Ours | 0.7487 | 0.6439 | 0.9806 | 0.9710 | 2.5583 |
Q.3 Add comparison of training and inference costs to better assess the model’s efficiency and practical applicability.
Thanks for the valuable advice.
We provide an inference comparison table in terms of runtime and VRAM below. The experiments are conducted on single NVIDIA A800 card. We report the average runtime based on tests using 16-frame video clips. The memory footprint is evaluated with nvidia-smi.
From the inference cost comparison, we observe that compared to another V2V model Follow-Your-Canvas (FYC), our model is faster in inference speed and consumes less VRAM. The main reason is that FYC requires multiple round of outpainting to ensure outpaint quality, and they also train & infer in float32. Thanks to our dual-branch structure, Imagine360 achieves good visual quality in a single pass of inference.
AnimateDiff and 360DVD have faster inference speeds, which is within expectation, as they are text-guided and image-guided frameworks. In addition, by employing multiple memory optimization techniques (bfloat16,vae slicing,cuda.empty_cache, intermediate tensor cleanup, etc.), results show the dual-branch structure incurs only a small and affordable increase in memory usage, while achieving better generation quality and fast inference speed compared to the baselines.
In terms of training cost, AnimateDiff is trained on WebVid10M dataset using 16 A100 GPUs for 5 epochs. Follow-Your-Canvas (FYC) is trained on Panda70M using 8 A800 GPUs for 50K steps. 360DVD is trained on WEB360 (which includes both RGB and optical flow videos) using a single GPU with ~50GB of VRAM for 100K steps. In contrast, our model is trained on a significantly smaller dataset, containing only ~9K panoramic RGB videos, using 8 A100 GPUs for 20K steps. Despite requiring generally less training data and compute, our method achieves superior 360 video generation quality compared to all baselines, demonstrating both data and compute efficiency.
| AnimateDiff (I2V) | 360DVD (T2V) | Follow-Your-Canvas (V2V) | Ours (V2V) | |
|---|---|---|---|---|
| Inference Runtime (sec) | 122.65 | 93.18 | 381.89 | 182.76 |
| max VRAM (GB) | 20.83 | 18.06 | 27.99 | 25.59 |
Q.4 Figure 3 overlaps with surrounding text, causing readability issues.
Thanks for pointing this out! We will revise the paper and fix the formatting issues in the final version.
This paper develops a pipeline for generating panoramic videos from a monocular perspective video. Specifically, the input perspective video is projected onto a panoramic canvas, where unobserved regions are outpainted by a finetuned dual-branch video inpainting model. This method comprises three key designs: the dual-branch video outpainting, antipodal mask, and rotation-aware design. The dual-batch design synergistically outpaints videos in panoramic and perspective domains, ensuring global and local high-quality visual and motion patterns. The antipodal mask is designed to enhance the motion connection of front-and-back regions. To align the pose changes of the input perspective video, the rotation-aware design is employed to augment the training data. Finally, experimental results demonstrate the superior performance of this proposed method over existing video outpainting and text2pano video generation methods.
优缺点分析
Strengths: (1) This paper develops a pipeline that can convert an input monocular perspective video into a panoramic 360 video. (2) Antipodal mask explicitly connects regions with motion relevance. (3) This paper is well-organized and easy to follow.
Weaknesses: (1) The technique contribution is limited. This paper is a simple combination of [1-3] and additionally adds a perspective video as input. As shown in Table 2, the most significant design is the dual-branch denoising outpainting, which has been proposed by [1]. And the rotation-aware design is essentially camera poses augmentation.
(2) The antimodal mask design needs further verification. In Table 2, using the antimodal mask design brings minor improvements. Although Fig. 7 demonstrates the efficacy of the antimodal mask, only one case cannot exhibit the generalization capability. Moreover, I notice that the video in supp (a white SUV driving down a dirt road in a forested mountain valley) also has the problem of strange motion patterns (around the image boundary), which do not yield inverse motion effects. I suggest providing two perspective views, including both the input video and its antipodal views, for better comparisons.
[1] Taming stable diffusion for text to 360 panorama image generation [2] 360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model [3] AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
问题
(1) Since the MonST3R is employed to estimate the camera poses during inference, why not directly use this model to predict camera trajectories from enormous monocular videos, and use these predicted camera poses to replace the rotation-aware design during training? By doing this, we can reduce the domain gap, and the predicted camera poses are more realistic.
(2) Please clarify the contributions of this paper.
局限性
No
最终评判理由
The technical contributions are weak, but I raised my score because this is a new task, from input monocular video to 360 video.
格式问题
No
Response to Reviewer opro
Thanks for the constructive comments and inspiring suggestions. We sincerely apologize for not including more qualitative examples due to the NeurIPS rebuttal policy, but we will add them in the final version of the paper. The comments are addressed in detail as follows.
Q.1 The technique contribution is limited. This paper is a simple combination of [1-3] and additionally adds a perspective video as input. As shown in Table 2, the most significant design is the dual-branch denoising outpainting, which has been proposed by [1]. And the rotation-aware design is essentially camera poses augmentation.
Thanks for the comment. We respectfully disagree with the claim that our method is a simple combination of [1–3] with an added perspective video input. These works address different tasks: [1] is text-to-image, [2] is text-to-360° video, and [3] is image-to-video. None of them solves the task of generating 360 videos from perspective video signals. Moreover, handling video-conditioned 360 video generation is nontrivial and requires careful modeling of dynamic input camera motion and 360 motion structure. To address these challenges, we introduce two new components: (1) the rotation-aware designs to improve robustness to changing camera trajectories, and (2) antipodal attention masking to model reversed camera motion in 360 views. As shown in Table 2, both modules lead to clear gains, especially in 360 motion metrics like EPE and OmniFVD.
While we follow [1] in adopting a dual-branch denoising structure, our pipeline is a smart, cohesive integration of relevant generative components, tailored for perspective-to-360 video generation. The proposed modules are key to addressing the unique challenges of this new setup. Additionally, we contribute a clean and high-quality 360 video dataset with temporal interval annotations, which we believe will benefit the community.
We already included a side-by-side comparison with PanFusion in Appendix C, highlighting our differences in task domain, pipeline design, and temporal modeling. We will move this comparison to the main paper for a clearer comparison. We hope the reviewers and AC will consider the novelty of the task, the framework design, and the dataset contribution when evaluating our work.
Q.2 The antimodal mask design needs further verification. In Table 2, using the antimodal mask design brings minor improvements. Also, SUV case in supp. contain strange motion patterns. Suggest providing input and antipodal views for better comparisons.
Thanks for the suggestion.
We would like to clarify that the antipodal mask is designed to address a unique challenge in 360 video generation, that antipodal regions should undergo reverse motion under camera translation. The purpose of this module is not to directly improve image quality or video smoothness, but rather to provide a geometry-aware bias that facilitates global motion modeling on the sphere. In Table 2, we see modest gains on VBench IQ and MS, which is as expected, since these metrics capture overall appearance and temporal smoothness. We also see non-trivial gains on OmniFVD (257.6 → 204.0) and EPE (2.7169 → 2.5583), which reflect 360 motion correctness, the intended benefits of antipodal modeling, indicating improved motion for 360 video generation.
We also acknowledge that antipodal motion remains a challenging issue. While our method does not fully solve it (SUV case), it provides clear improvements. This is supported by quantitative gains on large evaluation data (Table 2). We strongly agree that visualizing the input and antipodal views is a good idea, and we will include more qualitative video results in the final version to further illustrate the effectiveness of this module.
Q.3 Why not directly use MonST3R to predict camera trajectories from enormous monocular videos, and use these predicted camera poses to replace the rotation-aware design during training? By doing this, we can reduce the domain gap, and the predicted camera poses are more realistic.
Thanks for the insightful suggestion. Our current approach employs random trajectory sampling during training to expose our model to diverse trajectory patterns and improve robustness. Sampling from a camera trajectory library collected by MonST3R on massive real videos seems a promising alternative, as it better reflects the real-world camera motion distribution and may reduce the domain gap. While collecting large-scale real videos and extracting camera trajectories from them is hard to accomplish with our current resources in a short time, we will continue to implement and verify this solution through experiments. The results will be reported in the final version.
Thanks for the response. It addresses most of my concerns. It is hoped that the final version could include more discussions of limitations (antipodal motion) and more qualitative comparisons (MonST3R's trajectory, input video + inverse antipodal view).
Thanks for your constructive comments. We'll incorporate the discussed qualitative results and experiments in the final version.
If you have further concerns or suggestions, please feel free to comment and we're happy to address them.
This paper presents Imagine360, a new method for generating 360° panoramic videos from standard perspective video inputs. The core contributions include a dual-branch denoising architecture (for perspective and panoramic views), an antipodal attention mask to model spherical motion consistency, and rotation-aware designs to handle diverse camera poses. The authors also introduce a new curated dataset, YouTube360, consisting of 10K high-quality 360° video clips. Extensive experiments (including comparisons against SOTA methods and ablation studies) demonstrate clear improvements in both perceptual quality and motion coherence.
优缺点分析
Strengths
-
The task of perspective-to-360° video generation is both new and practically relevant. The paper clearly motivates the gap between current methods (which rely on panoramic guidance) and the proposed task.
-
The dual-branch video denoising framework, along with cross-domain attention and the antipodal mask, is a clever adaptation of recent generative modeling ideas to the spherical video domain. The use of antipodal attention for capturing global motion dependencies is well-motivated. The proposed rotation-aware training and inference makes the proposed method more robust on general inputs
-
The experimental results are extensive. In Table 1, the proposed method outperforms all baselines across nearly every metric. The ablations in Table 2 show that each proposed module contributes meaningfully to the performance of the full method. Qualitative comparisons are also provided to show the improvement of the proposed method.
-
New dataset. This paper introduces a new dataset YouTube360, which addresses the training data bottleneck in this domain. This dataset is a valuable resource, especially given the lack of clean, high-quality 360° training videos for generative models.
Weaknesses
- The dual-branch design mostly follows PanFusion, which makes the contribution of this part a bit weak.
- The dual-branch architecture and rotation-aware modules likely add computational overhead. A discussion or table showing runtime comparisons or inference speed would be valuable for understanding the practical trade-offs.
- While the new dataset is helpful for training, it remains unclear how the model performs on in-the-wild or low-quality user-generated videos. A discussion or stress test would make the paper’s robustness claims more convincing.
问题
Please refer to the questions in Weaknesses. In addtion,
-
Somehow I think the generated videos are flickering if looking closely, a discussion on why this happens and how this could be improved would be beneficial
-
Why does this paper choose MonST3R for camera pose estimation? How robust it is and what if using more robust models like MegaSAM etc?
局限性
yes
最终评判理由
The rebuttal has addressed my previous questions and I would like to vote for accept.
格式问题
No
Response to Reviewer vJmB
We thank the reviewer for acknowledging our pipeline as "a clever adaptation of recent generative modeling ideas to the spherical video domain" and our data contribution as " addresses the training data bottleneck", "valuable". The concerns and questions are addressed in detail as follows.
Q.1 The dual-branch design mostly follows PanFusion, which makes the contribution of this part a bit weak.
Thanks for the comment. As detailed in Appendix C, we provide a side-by-side comparison between our work and PanFusion. Our work differs significantly in several key aspects, including the task domain, the overall pipeline design, and the introduction of temporal modeling.
While we adopt a dual-branch denoising structure, our method goes beyond PanFusion through a cohesive integration of antipodal-aware attention, rotation-aware camera modeling. These designs are specifically tailored for the novel task of perspective-to-360° video generation, which has not been systematically explored before. The effectiveness of each component is validated through ablations (Table 2), and qualitative results demonstrate good generalization to in-the-wild scenarios.
In addition to the method, we also contribute a clean and high-quality 360 video dataset with temporal interval annotations, which we believe will be valuable for future research.
We will move the PanFusion comparison to the main paper to make the differences more prominent. We hope the reviewers and AC will consider the novelty of the task, the thoughtful framework design choices, and the data contribution when evaluating the overall contribution of our work.
Q.2 The dual-branch architecture and rotation-aware modules likely add computational overhead. Add a discussion or table showing runtime comparisons or inference speed for understanding the practical trade-offs.
Thanks for the advice. We provide a runtime breakdown table and an efficiency comparison table in terms of runtime and VRAM consumption below. All experiments are conducted on a single NVIDIA A800 GPU. We report the average runtime based on tests using 16-frame video clips. GPU memory usage is monitored using nvidia-smi.
In the runtime breakdown table, we observe that the external camera estimation module from the rotation-aware designs indeed takes up a considerable amount of time. While optimizing the inference speed of the off-the-shelf camera pose estimator is beyond the scope of this work, we will seek to upgrade this module with a faster and more robust camera estimator.
The dual-branch architecture, on the other hand, eliminates the need for multi-round inpainting as required in Follow-Your-Canvas (FYC), thus reducing overall inference time. With memory optimization techniques applied, our method achieves a peak VRAM usage of 25.59 GB, which is lower than FYC’s 27.99 GB.
| Model Load | Input IO | Output IO | Camera Estimation | Inference | |
|---|---|---|---|---|---|
| Runtime (sec) | 13.29 | 0.36 | 1.74 | 47.88 | 182.76 |
| AnimateDiff (I2V) | 360DVD (T2V) | Follow-Your-Canvas (V2V) | Ours (V2V) | |
|---|---|---|---|---|
| Inference Runtime (sec) | 122.65 | 93.18 | 381.89 | 182.76 |
| max VRAM (GB) | 20.83 | 18.06 | 27.99 | 25.59 |
Q.3 Add a discussion or stress test to strengthen the model's robustness claims.
Thanks for the constructive suggestion. We agree that demonstrating robustness under challenging, real-world cases is important. Our current quantitative evaluation (Table 1) already includes a wide range of diverse, in-the-wild scenarios, such as randomly sampled videos from RE10K and 360-1M, as well as videos generated by CogVideoX using random prompts from GPT-4o. To further strengthen our robustness claims, we will curate a set of harder corner cases and include a dedicated stress test and analysis in the final version.
Q.4 A discussion on video flickering and how this could be improved.
Thanks for the comment. The video flickering primarily stems from our backbone, AnimateDiff, which is known to produce temporal flickers in many cases. Despite this limitation, our method improves upon AnimateDiff in both visual quality and temporal consistency, as shown in Table 1. To further address this issue, we plan to upgrade the backbone to a more advanced video generation model (Wan2.1/2.2), with technical modifications from UNet with DiT. This update is expected to significantly reduce flickering and improve overall temporal coherence. We will update our results with the upgraded backbone in the final version.
Q.5 Why choose MonST3R for camera pose estimation. How robust it is and what if using more robust models like MegaSAM etc?
Thanks for the comment. We chose MonST3R because it was one of the most robust camera pose estimation models when we primarily developed our method. Empirically, we find that it produces accurate and stable pose estimations in most test cases. As camera pose estimation continues to advance, MonST3R can be seamlessly replaced with more robust models, such as MegaSAM, helping our method to handle even more in-the-wild scenarios. We will upgrade MonST3R to MegaSAM and provide updated results and codebase in the final version.
I thank the authors for the rebuttal, which has addressed my previous concerns. Thus, I raise my score. Please incoporate the promised discussions and results in the revised version of the paper and codebase.
A minor comment: Using nvidia-smi to measure GPU memory usage might not be very precise, it would be more reliable to use some built-in CUDA memory management utilities provided by PyTorch (if the code was written by PyTorch).
Thank you for the valuable comment!
Indeed, we noticed that the built-in PyTorch utilities, such as torch.cuda.memory_allocated() and torch.cuda.memory_reserved(), report lower memory usage compared to nvidia-smi. This is because nvidia-smi reports additional overhead beyond tensor allocations, including CUDA context initialization, cuDNN/cuBLAS workspace buffers, and other runtime memory usage that PyTorch does not explicitly track.
In practice, when loading large models, users often rely on the nvidia-smi reported memory usage, as it more accurately reflects the total GPU memory usage and helps prevent OOM errors. Therefore, we chose to report the nvidia-smi values in our results.
That said, we also measured the memory usage using torch.cuda.memory_allocated() as reported in the table below, and we will include these numbers in the final version for reference.
| max VRAM (GB) | AnimateDiff (I2V) | 360DVD (T2V) | Follow-Your-Canvas (V2V) | Ours (V2V) |
|---|---|---|---|---|
nvidia-smi | 20.83 | 18.06 | 27.99 | 25.59 |
torch.cuda.memory_allocated() | 18.08 | 14.68 | 24.93 | 20.86 |
Thanks again for the insightful feedback!
This paper introduces a novel framework that converts standard videos into immersive 360° content. Key innovations include a dual-branch denoising architecture with spherical attention, an antipodal mask for motion modeling, and rotation-aware processing. Besides, They also introduce a new 360 videos dataset. Experiments demonstrate superior performance over baselines in metrics and user studies, with ablations validating each component.
优缺点分析
Strengths:
- The proposed task of perspective-to-360° video generation is new, bridging accessibility with high-end content creation for immersive media. The method demonstrates generalization to in-the-wild videos, significantly boosting its practical utility.
- Using a cross-domain spherical attention to model reversed motion patterns sounds new and is proved effective by ablations.
- The rotation-aware modules for in-the-wild videos with different camera poses makes sense and enhances the generation robustness from looking at ablations.
- Data contribution: provides high-quality 360 video datasets with highlight time intervals to the community.
Weaknesses:
- While inference speed is mentioned, no clear comparison to baseline efficiency or resource requirements is available.
- Missing details of metric computation: EPE is originally computed in depth estimation. How is it used here to evaluate motion quality? Also, how exactly is the OmniFVD extended from OmniFID here?
- The [41] and [42] both refer to the same citation in the reference section
问题
Questions:
- When MonST3R’s camera pose estimation fails or is inaccurate, does it manifest as noticeable artifacts? Can you discuss possible solutions and illustrate the robustness of the system?
- How does the inference speed compare to baseline efficiency and resource requirements?
- Missing details of metric computation: EPE is originally computed in depth estimation. How is it used here to evaluate motion quality? Also, how exactly is the OmniFVD extended from OmniFID here?
局限性
yes
最终评判理由
I appreciate the authors for the rebuttal, which has addressed my concerns. I will keep my score as positive.
格式问题
no formatting concerns
Responses to Reviewer 6m1K
We thank the reviewer for the positive feedback and constructive suggestions. We address the comments as below:
Q.1 No clear comparison to baseline on efficiency and resource requirement.
Thanks for the advice. We add a runtime efficiency breakdown table below and an efficiency comparison table in terms of runtime and VRAM consumption. All experiments are conducted on a single NVIDIA A800 GPU. We report the average runtime based on tests using 16-frame video clips. For all methods listed in the comparison table, only inference time is measured. GPU memory usage is monitored using nvidia-smi.
From the efficiency comparison, we observe that compared to another V2V model Follow-Your-Canvas (FYC), our model is faster in inference speed and consumes less VRAM. The main reason is that FYC requires multiple rounds of outpainting to ensure outpaint quality, and they also train and infer in float32. Thanks to our dual-branch structure, Imagine360 achieves good visual quality in a single pass of inference.
AnimateDiff and 360DVD have faster inference speeds, which is within expectation, as they are text-guided and image-guided frameworks. In addition, by employing multiple memory optimization techniques (bfloat16,vae slicing,cuda.empty_cache, intermediate tensor cleanup, etc.), our dual-branch structure does not increase VRAM usage by a lot.
In the runtime breakdown table of our method, we observe that the external camera estimation module takes up a considerable amount of time. Although optimizing the inference speed of off-the-shelf camera pose estimator is beyond the scope of this work, we will seek to upgrade this module with a faster and more robust camera estimator. Both efficiency tables will be added to the final version with a detailed discussion.
| Model Load | Input IO | Output IO | Camera Estimation | Inference | |
|---|---|---|---|---|---|
| Runtime (sec) | 13.29 | 0.36 | 1.74 | 47.88 | 182.76 |
| AnimateDiff(I2V) | 360DVD (T2V) | Follow-Your-Canvas (V2V) | Ours (V2V) | |
|---|---|---|---|---|
| Inference Runtime (sec) | 122.65 | 93.18 | 381.89 | 182.76 |
| max VRAM (GB) | 20.83 | 18.06 | 27.99 | 25.59 |
Q.2 Missing details of EPE and OmniFVD metric computation.
Sorry for the confusion and thanks for pointing this out. The EPE metric is adapted from optical flow to evaluate motion accuracy. We compute optical flow for both the generated video and the ground-truth using PanoFlow, and define EPE as the pixel-wise Euclidean distance between the two flows: .
For OmniFVD, we follow OmniFID by projecting the 360 video into six cubemap views, compute FVD on each view, and report the average as the final OmniFVD score. We will clarify these details in the final version.
Q.3 When MonST3R’s camera pose estimation fails or is inaccurate, does it manifest as noticeable artifacts? Can you discuss possible solutions and illustrate the robustness of the system?
Thanks for the comment. Empirically, we find that MonST3R produces accurate and stable pose estimations in most test cases. In rare and challenging corner cases, estimating precise camera poses remains an open problem and is beyond the scope of this paper. Nevertheless, as camera pose estimation continues to advance, MonST3R can be seamlessly replaced with more powerful models, enabling our method to handle even more in-the-wild scenarios robustly.
Q.4 The [41] and [42] both refer to the same citation in the reference section.
Thanks for pointing this out. We will revise the paper and fix this in the final version.
I appreciate the authors for the rebuttal, which has addressed my concerns. I will keep my score as positive.
Thanks for the positive feedback. We'll add the discussed comparison and discussions in the revised version.
Dear Reviewers,
Thank you again for your thoughtful and constructive reviews.
The authors have now submitted their rebuttals. I encourage you to read the responses carefully and engage in discussion. Please aim to complete the discussions by August 6th.
Your AC
All reviewers agree that the idea of perspective-to-360° video generation is novel and technical contribution and evaluation are mostly solid and . The AC recommends accepting the paper, but asks the authors to carefully address the concerns raised by the reviewers in the camera-ready version, including the computation and memory usage comparison, discussion on video flickering, new results etc.