PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
5
5
4
4.0
置信度
创新性2.8
质量2.8
清晰度3.5
重要性3.0
NeurIPS 2025

PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29

摘要

关键词
Pose estimation ; Video generation; diffusion model

评审与讨论

审稿意见
5

This paper proposes PoseCrafter, a framework designed for extreme pose estimation where input image pairs exhibit small or no overlap—a scenario in which existing feature-based and learning-based methods often fail due to the lack of reliable correspondences.

PoseCrafter follows the general strategy of InterPose by leveraging pretrained video generation models to synthesize intermediate frames for downstream pose estimation. Beyond standard video interpolation, it further incorporates a pose-conditioned novel view synthesis model to generate more geometrically plausible intermediate frames. To ensure consistency with the input views, the framework introduces a Feature Matching Selector (FMS) that identifies the most informative synthesized frames based on RANSAC inlier counts.

Extensive experiments on four benchmarks—Cambridge Landmarks, ScanNet, DL3DV-10K, and NAVI—demonstrate that PoseCrafter consistently outperforms previous state-of-the-art methods in extreme pose estimation, without requiring additional training or supervision.

优缺点分析

Strengths

  • The use of a pose-conditioned video generation model, combined with feature matching for frame selection, is both intuitive and effective.
  • The training-free nature of the approach makes it highly practical and accessible for a wide range of users.
  • The method is evaluated on four diverse datasets with yaw ranges selected to represent small and no-overlap scenarios. PoseCrafter consistently shows significant improvements over strong baselines such as DUSt3R and InterPose. The ablation studies are thorough and well-structured.
  • The paper is clearly written and easy to follow.

Weaknesses

  • Limited novelty in algorithmic components: While the overall system design is effective, the core components—video interpolation, pose-conditioned view synthesis, and RANSAC-based frame selection—are based on existing techniques.
  • Computational cost not addressed: The pipeline involves several sequential stages, including video interpolation, novel view synthesis, feature extraction, and RANSAC-based filtering, which may lead to high computational overhead. A runtime analysis or efficiency comparison would help clarify the method’s practicality for real-world use.

问题

A runtime analysis or comparison would help assess its practical applicability.

局限性

yes

最终评判理由

This paper is technically solid and well validated. In terms of algorithmic novelty, I believe it is above the acceptance bar. Overall, I find it to be a valuable contribution that the community would be interested in reading.

格式问题

No

作者回复

We thank the reviewer for acknowledging the effectiveness and practicality of our approach. Following the suggestion of the reviewer, we provide detailed responses to address the reviewer's concerns.

Q1: Clarification about novelty.

The novelty of our PoseCrafter lies elsewhere despite building on existing modules such as video interpolation, pose-conditioned novel view, and RANSAC filtering. Our main contribution is in the adaptation and integration of these components to holistically address the under-explored and challenging problem of pose estimation with small or no visual overlap.

  • We are the first to propose a hybrid video generation (HVG) module that combines video interpolation and pose-conditioned novel view synthesis to generate high-quality intermediate frames for subsequent pose estimation.
  • We introduce a Feature Matching Selector (FMS) that deterministically selects the most informative frames suitable for pose estimation. This eliminates the need for expensive statistical self-consistency scoring in existing works.
  • Our method requires no additional training or fine-tuning, yet achieves significant performance improvements over existing methods. This not only greatly reduces training costs but also makes our approach a flexible framework, where components can be potentially replaced with more effective alternatives in the future.

We believe that the overall system design and its demonstrated impact across four benchmarks reflect a meaningful and practical innovation, which should be beneficial for the community and related researchers.

Q2: Memory Cost and Runtime Discussion.

We thank the reviewer for their important points regarding runtime and computational efficiency. To address this concern, we provide a detailed comparison of the runtime and memory usage of different stages of our pipelines and those of InterPose. For fair evaluation, we divide the pipeline into two stages: video generation and pose estimation. The video synthesis stage in both our method and InterPose is referred to as video generation. The subsequent process to obtain the final pose using self-consistency score in InterPose or feature matching selector in our method is referred to as pose estimation.

Since the original implementation of InterPose is not publicly available, we reimplemented its pipeline according to the details described in the paper. We refer to this reproduction as InterPose^{\ddagger} for subsequent comparison. The results are summarized in the table below.

We can see that our approach significantly reduces the pose estimation time while achieving comparable or even higher pose accuracy. This improvement comes from the fact that InterPose generates 8 video sequences for each image pair and samples a subset of 11 frames from each video to compute the self-consistency score. In contrast, our approach employs a leaner and more efficient pipeline. We generate only a single hybrid video and apply a deterministic feature-based frame selection strategy.

Although our approach has a slightly higher memory footprint during the video generation phase, the pose estimation phase remains lightweight and practical in terms of memory and speed.

We will include this analysis in the revised manuscript to better demonstrate the computational efficiency and scalability of our approach.

MethodVideo Generation RuntimePose Estimation RuntimeVideo Generation Memory CostPose Estimation Memory Cost
InterPose^{\ddagger}3.2min20.29min14.6GB3.1GB
Ours3.8min0.18min22.8GB3.6GB
评论

I appreciate the authors for addressing my questions.

While I still believe the novelty of the paper is somewhat limited, I will maintain my rating as accept. The experiments are well-structured and solid, and more importantly, the paper demonstrates that a straightforward composition method, when carefully integrated, can lead to significant improvements—something I believe the community will appreciate.

评论

We sincerely thank you for your thoughtful and constructive feedback and positive assessment of our work. Although our pipeline is simple, it is effective, and we are pleased that it may benefit the community. We look forward to expanding and improving our framework in future studies. Thank you again for your valuable support.

评论

Dear Reviewer hdsC,

The authors have replied to your comments. Please let us know if this response addresses your concern about Novelty.

Best,

AC

评论

Dear AC and the authors,

This paper is technically solid and well validated. In terms of algorithmic novelty, I believe it is above the acceptance bar. Overall, I find it to be a valuable contribution that the community would be interested in reading. As I mentioned in the previous comment, I am keeping my final score as ACCEPT.

Thank you.

Best regards, hdsC

评论

Dear Reviewer,

The authors have replied to your concerns, and it would be very helpful if you could engage in a discussion to clarify your concerns.

Best,

AC

审稿意见
5

The paper presents a novel method for pose estimation under extreme viewpoint changes. It builds upon a recent state-of-the-art method that proposes to solve the wide-baseline problem by using generative models to synthesize novel views closing the viewpoint gap between the query images. It introduces a two step method that uses DynamicCrafter as a prior on the interpolated trajectory, and then refines the generated views with ViewCrafter and keyframe selection scheme to achieve high fidelity images. Lastly, a robust feature matching scheme is used to select only reliable images before using Dust3r for the final prediction

优缺点分析

Strengths:

  • The experimental evaluation is comprehensive and it shows convincing results with nice improvements over SOTA
  • the paper is well written and easy to follow
  • The proposed method is simple, meant in a positive way. There is not a substantial novelty and it's a nicely engineered pipeline, that however shows remarkable results and it's a nice application of generative models for closing the wide baseline gap in pose estimation tasks.

Weaknesses:

  • the method relies on the video generated by DynamicCrafter, which being a diffusion-based model is subject to variability to plausible hypothesis and might result in an early point of failure of the pipeline
  • given that one of the claims "against" the prior work InterPose is the cost of generating multiple video hypothesis, the theme of efficiency should be discussed. Since the paper is using multiple foundation models for video generation plus a refinement scheme, as a reader I am unable to estimate the runtime or memory requirements. Including such discussion is fundamental, in my opinion. even if it takes minutes for an image pair, it's ok, but it should be mentioned
  • the ablation on the feature of choice is not very comprehensive of methods

问题

  • although I acknowledge it is a concurrent work, I would be very curious of the performance of the novel VGGT[3] model in this context. Supposedly it improves a lot upon the baseline of Dust3r. I think authors should consider adding it either as a replacement for Dust3r in the proposed pipeline or a standalone competitor. I do not think it is mandatory but could make the paper more interesting
  • I am curious about the choice of using ORB descriptors. did the authors experiment with a dense comparison from a backbone as in [2]? or with a pairwise matcher such as RoMa, LoFTr
  • some related works should be cited. [1, 2] employ a similar render&compare framework, although not for wide-baseline task. Especially the feature matching algorithm is very similar to the one employed in [2]. [1] Zhang, Zichao, Torsten Sattler, and Davide Scaramuzza. "Reference pose generation for long-term visual localization via learned features and view synthesis." International Journal of Computer Vision 129.4 (2021): 821-844. [2] Trivigno, Gabriele, et al. "The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. [3] Wang, Jianyuan, et al. "Vggt: Visual geometry grounded transformer." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. [4] Edstedt, Johan, et al. "RoMa: Robust dense feature matching." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. [5] Sun, Jiaming, et al. "LoFTR: Detector-free local feature matching with transformers." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

局限性

yes

最终评判理由

I did not find any major concerns in the paper in the first place. The additional ablations and experiments on feature matching and inference time/memory footprint are useful for readers to better understand the problem and the proposed solution, strengthening the paper. I think the contributions are solid and experimental results convincing, thus I raise my score to 5.

格式问题

作者回复

We thank the reviewer for the thoughtful and constructive feedback. We give our response to each concern in the following:

Q1: Would the variability of diffusion-based DynamiCrafter causes early point of failure?

We thank the reviewer for raising this important concern. To evaluate the impact of generative stochasticity in DynamiCrafter, we conducted experiments on the Cambridge Landmarks dataset under yaw changes of [50^{\circ}-65^{\circ}] using 5 different random seeds to generate intermediate video sequences. The results are shown in the following table. We can find that our method consistently outperforms other methods within a smaller range of variations. This indicates that while some variability is inherent to the diffusion-based generative process, our pipeline remains robust and stable across different runs. We will include this analysis in the revised version to better illustrate the reliability of our method.

MethodInputMRE↓R@5°R@15°R@30°AUC₃₀↑
Dust3RPair18.1440.3471.2582.9961.98
InterPose_{w/o SCS}$$‡DynamiCrafter16.1142.7075.7087.3565.72
InterPose^‡DynamiCrafter13.6151.8181.5083.3070.47
Oursw/oFMS_{w/o FMS}Hybrid Video13.37±0.2954.57±0.3588.43±0.7692.34±0.5674.65±1.32
OursHybrid Video11.46 ± 0.0355.38 ± 0.5490.94 ± 1.1194.54 ± 0.8677.49 ± 0.51

Q2: Runtime and Memory Cost Discussion.

We thank the reviewer for highlighting the importance of efficiency. Although InterPose generates 8 video sequences for each image pair and samples a subset of 11 frames from each video to compute the self-consistency score, our approach adopts a more streamlined and efficient pipeline. Specifically, we generate only a single hybrid video and apply a deterministic and efficient frame selection strategy based on feature matching.

To address the reviewer's concerns, we provide a detailed comparison of runtime and memory usage at different stages. For fair evaluation, we divide the pipeline into two stages: video generation and pose estimation. The video synthesis stage in both our method and InterPose is referred to as video generation. The subsequent process using the self-consistency score in InterPose or the feature matching selector in our method is referred to as pose estimation. Since the original implementation of InterPose is not publicly available, we re-implemented its pipeline based on the details provided in the paper. We refer to this reproduction as InterPose^‡ for all subsequent comparisons. As shown in the table below, our approach significantly reduces pose estimation time while achieving comparable or even higher pose accuracy. Although our approach incurs a slightly higher memory cost during video generation, it remains affordable for practical usage. We will include this analysis in the revised version to clarify the computational efficiency and scalability of our approach.

MethodVideo Generation RuntimePose Estimation RuntimeVideo Generation Memory CostPose Estimation Memory Cost
InterPose^‡3.2min20.29min14.6GB3.1GB
Ours3.8min0.18min22.8GB3.6GB

Q3: Evaluation with RoMa [4] and LoFTR [5].

To evaluate the impact of different feature matching strategies, we conducted experiments on the Cambridge Landmarks dataset under yaw changes of [50^{\circ}-65^{\circ}], using RoMa and LoFTR as alternatives to ORB in our Feature Matching Selector (FMS).

Interestingly, we observed that each matcher led to the selection of a different subset of intermediate frames, ultimately affecting the pose estimation outcome. As shown in the table below, both RoMa and LoFTR resulted in inferior results compared to ORB. We believe the possible reason lies in the difference in robustness across methods. Learning-based methods such as RoMa and LoFTR may perform less reliably on our synthesized intermediate views. This is likely because they are trained on high-quality natural images. In contrast, ORB is more appropriate in our setting due to its strong robustness and higher efficiency. We will include these comparative results and analysis in the revised version to further demonstrate the robustness and efficiency of our frame selection strategy.

MethodMRE↓R@5°R@15°R@30°AUC₃₀↑
RoMa13.8953.1388.8991.6774.76
LoFTR12.4154.5189.2493.4076.30
ORB11.4055.2189.9393.7577.41

Q4: How would VGGT [3] perform in our context?

We sincerely thank the reviewer for the valuable suggestions. To evaluate the performance of VGGT in our pipeline, we conducted comparative experiments on the Cambridge Landmarks dataset under yaw changes of [50^{\circ}-65^{\circ}].

We first evaluate the two models in complex scenarios where only image pairs with small or no overlap were input. As shown in the second and fourth rows of the table below, DUSt3R consistently outperforms VGGT on our data with small or no overlap.

We then integrate these two models into our pipeline and evaluate their performance. As shown in the third and fifth rows, both configurations achieve significant improvements over their directly estimated counterparts. The version using DUSt3R achieves a higher accuracy than the version using VGGT. These results demonstrate that our method is compatible with different pose estimators and consistently enhances their performance.

Compared to VGGT, DUSt3R is better suited for pose estimation on small or non-overlapping image pairs. Furthermore, our method also yields more noticeable improvements when applied to DUSt3R. Although we believe that fine-tuning VGGT on small-overlap data or adjusting the hyperparameters of our pipeline could further improve the performance of VGGT, these efforts are beyond the scope of this work and rebuttal. We plan to explore them in future work.

MethodInputMRE↓R@5°R@15°R@30°AUC₃₀↑
Dust3RPair18.1440.3471.2582.9961.98
OursHybrid Video11.4055.2189.9393.7577.41
VGGTPair20.1740.0070.1782.2960.54
VGGTOurs_{Ours}Hybrid Video17.8842.4384.4085.7665.15

Q5: Related works and citations.

We thank the reviewer for pointing out the relevant work [1, 2, 3, 4, 5]. Among them, [1, 2, 4, 5] explore different feature matching strategies and are highly relevant to our Feature Matching Selector module. We will include proper citations and discussion of these works in the revised version to better contextualize our contribution. Furthermore, we have already conducted experimental analyses of the matching methods proposed in [4] (RoMa) and [5] (LoFTR), as detailed in our response to Q3. These empirical comparisons will also be incorporated into the main paper. VGGT [3] is a recently proposed method designed for camera pose estimation and is also highly relevant to our task. We have included comparative experiments using VGGT in our response to Q4. , and will add the corresponding results and citations in the revised version.

[1] Zhang, Zichao, Torsten Sattler, and Davide Scaramuzza. "Reference pose generation for long-term visual localization via learned features and view synthesis." International Journal of Computer Vision 129.4 (2021): 821-844.

[2] Trivigno, Gabriele, et al. "The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[3] Wang, Jianyuan, et al. "Vggt: Visual geometry grounded transformer." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.

[4] Edstedt, Johan, et al. "RoMa: Robust dense feature matching." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[5] Sun, Jiaming, et al. "LoFTR: Detector-free local feature matching with transformers." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

评论

I thank the authors for the reply and for the effort of experimenting with novel architectures such as VGGT, and provide additional ablations on the feature matching models. although I do find counterintuitive the fact that ORB still outperforms learned features on synthetic images, I do not think this diminishes the value of the paper.

评论

We sincerely thank the reviewer for the positive and thoughtful feedback, and we appreciate your recognition of our efforts. We will include the suggested ablations and analyses in the revised version.

Regarding the observation that ORB outperforms learned features on synthetic data, we agree that this is a somewhat counterintuitive yet interesting result. A possible explanation is that there may still be a distribution gap between the training data of learned features and the synthesized images used in our experiments. Due to the limited time available during the rebuttal phase, we were unable to conduct a deeper analysis of this phenomenon, but we plan to explore it further in future work.

Thank you again for your constructive insights.

评论

Dear Reviewer 6p5A,

Today is the last day to engage in the discussion and the authors have replied to your concerns.

Best,

AC

审稿意见
5

This paper introduces, PoseCrafter, a novel approach to improve extreme-view pose estimation using video models. Given a pair of input images with minimal visual overlap, PoseCrafter combines a video interpolation model and a pose-conditioned novel view synthesis model to provide informative context for pose estimation methods, e.g., DUSt3R. To select the most informative generated video frames, PoseCrafter uses feature matching between those generated frames and the input images. The video frames with most inlier matching are selected for pose estimation. The proposed method is verified on multiple benchmarks and compared preferably with prior method, e.g., InterPose.

优缺点分析

Strengths:

  1. The proposed method is intuitive and straightforward. It leverages the recent progress in video generative models to improve pose estimation under extreme viewpoints.
  2. The writing is clear and easy to understand.
  3. The proposed method compares preferably with prior method on four datasets.

Weaknesses:

  1. Incomplete evaluation: (a) The proposed method is only tested on a single video interpolation model, i.e., DynamiCrafter, which may make the results biased. Could authors provide experiments on more (recent) models? (b) In terms of feature matching methods, could authors provide results on more recent dense matching methods, e.g., RoMa [1]? (c) The authors should provide an analysis of the inference time for frame selection in both the proposed method and the baseline, InterPose.
  2. An easy alternative to the proposed feature matching approach would be to simply run DUSt3R again on the generated hybrid video and keep the frames with high confidence. The reviewer is curious about how this approach would perform in practice (in terms of both accuracy and inference time).

[1] Edstedt et al., RoMa: Robust Dense Feature Matching, CVPR 2024.

问题

  1. The improvements achieved by the proposed method appear to be marginal on NAVI and DL3DV compared to the other datasets. Do the authors have any intuition as to why this might be the case?

局限性

The authors discuss the technical limitations of the proposed method; however, its potential negative societal impacts are not addressed. Since the method uses video generative models to synthesize intermediate frames, users should be cautious about the content of the generated videos, as they may reflect biases inherent in the underlying video models.

最终评判理由

The additional experiments provided by the authors during the rebuttal, including inference time analysis, alternative video models, and frame selection strategies, highlight the effectiveness of the proposed method and resolve my concerns. I therefore raise my score to 5.

格式问题

No.

作者回复

We thank the reviewer for the thoughtful and constructive feedback. Our responses to the concerns raised by the reviewer are as follows:

Q1: The proposed method is only tested on a single video interpolation model. Could authors provide experiments on more (recent) models?

Since our task focuses on input image pairs with small or no overlap, we require video interpolation models that can handle large motion or significant viewpoint changes.

To validate the generality of our pipeline, we additionally experimented with ToonCrafter [1]. This recent model is known for its efficiency in handling large-motion interpolation. We evaluated the model on the Cambridge Landmarks benchmark under yaw changes ranging from [50^{\circ}-65^{\circ}]. As shown in the table below, ToonCrafter [1] performs slightly worse than DynamiCrafter. This is likely because ToonCrafter is primarily designed for cartoon-style data. However, it still achieves clear improvements over directly using only the input image pair. These results confirm that our hybrid pose estimation framework is compatible with a variety of video generation backbones. We will include them in the revised version to emphasize the flexibility and robustness of our approach.

MethodVideo Generation ModelMRE↓R@5°R@15°R@30°AUC₃₀↑
DUSt3RNone18.1440.3471.2582.9961.98
OursDynamiCrafter11.4055.2189.9393.7577.41
OursToonCrafter12.0853.9987.1689.1973.11

Q2: Evaluation on more recent feature matching methods such as RoMa[2].

We follow the recommendation of the reviewer to incorporate RoMa as an alternative feature matching method within our Feature Matching Selector and evaluated it on the Cambridge Landmarks benchmark under yaw changes of [50^{\circ}-65^{\circ}].

As shown in the table below, RoMa selected a different set of intermediate frames compared to ORB, while its performance falls behind ORB. One possible reason is that RoMa may not be robust enough to handle our synthesized intermediate views, as it was trained primarily on the natural high-quality images. In contrast, ORB demonstrates greater robustness in our scenarios and offers a significant advantage in computational efficiency. For these reasons, we adopt ORB for the Feature Matching Selector (FMS) in this work. We will include these comparative results and analysis in the revised version.

Feature Matching MethodMRE↓R@5°R@15°R@30°AUC₃₀↑
RoMa13.8953.1388.8991.6774.76
ORB11.4055.2189.9393.7577.41

Q3: Inference time analysis.

We thank the reviewers for the valuable suggestions and have added a comparison of the inference time required for frame selection between our method and InterPose. For a fair evaluation, we define the pose estimation time as the duration taken to obtain the final pose. For InterPose, this includes the time spent computing the Self-Consistency Score. For our method, it includes the time spent using the Feature Matching Selector. We then compare the two methods based on this definition. Since the original implementation of InterPose is not publicly available, we re-implemented its pipeline based on the details provided in the paper. We refer to this reproduction as InterPose^‡ for subsequent comparisons.

As shown in the table below, our method offers a substantial advantage in computational efficiency. In contrast, InterPose requires the generation of 8 video sequences for each input image pair. From each sequence, it samples m=11m = 11 frame subsets with 10 random samples and 1 uniformly spaced sample. It then computes pose estimates for each subset. A self-consistency score is evaluated across these subsets to determine the final pose. This process incurs significant computational overhead due to repeated inference and scoring steps.

In contrast, our method leverages a simple yet effective deterministic frame selection strategy based on feature matching. Features are extracted only once, and the most informative intermediate frames are selected directly for pose estimation. This significantly reduces time costs while maintaining highly competitive accuracy. These results highlight the practical benefits of our pipeline for large-scale applications. We will include this analysis in the revised version.

MethodFrame Selection StrategyPose Estimation Time
OursFeature Matching Selector0.18 min
InterPose^‡Self-Consistency Score20.29 min

Q4: Alternative: Using DUSt3R confidence to rank video frames.

We thank the reviewer for suggesting the use of DUSt3R confidence scores as an alternative strategy for video frame selection. To evaluate its effectiveness, we conducted experiments by selecting top-ranked frames from hybrid videos based on four confidence thresholds (20%, 40%, 60%, and 80%). The corresponding results are shown in the table below.

Analysis shows that setting the threshold to 20% or 40% results in decreased performance of pose estimation compared to using the hybrid video sequence. Increasing the threshold to 60% or 80% improves the results, but the accuracy still lags behind our proposed feature matching selector (FMS).

In terms of computational efficiency, confidence-based selection incurs considerable overhead. Specifically, all video frames must first be processed by DUSt3R to compute the confidence map before any selection can be made. This step alone incurs significant time and memory costs. Furthermore, as the threshold increases, the number of selected frames increases, and the overall time continues to grow. In contrast, our FMS achieves superior efficiency and accuracy with only a single feature extraction step to immediately select the most informative frames. We will incorporate this analysis into the revised version to highlight the practical advantages of our approach in terms of computational efficiency and pose estimation performance.

MethodMRE↓R@5°R@15°R@30°AUC₃₀↑Pose Estimation Time
Conf(20%)14.6654.1785.0789.2472.452.79min
Conf(40%)14.3657.3087.1590.9774.242.91min
Conf(60%)12.3554.1790.2893.0676.603.47min
Conf(80%)12.4153.4790.6393.0676.464.13min
Oursw/oFMS_{w/o FMS}13.2454.5189.2492.7176.132.56min
Ours11.4055.2189.9393.7577.410.18min

Q5: Smaller gain on NAVI and DL3DV-10K.

We appreciate the reviewer’s observation. Our method is specifically designed for the challenging setting where the input images have small or no overlap. As a result, its improvements are particularly significant on outward-facing datasets such as Cambridge Landmarks and ScanNet, where the overlapping regions between widely spaced views are minimal. In contrast, center-facing datasets such as DL3DV-10K and NAVI present a different scenario. Although the camera angles vary significantly, the captured objects remain largely centered. This results in substantial visual overlap between frames. Consequently, baseline methods already perform relatively well on these datasets with little room for improvement. The smaller performance gains in these cases are therefore expected and consistent with the nature of the problem being addressed. We will include this explanation in the revised version to clarify the performance differences between dataset types and to better highlight the strength of our method in small-overlap scenarios.

[1] Xing J, Liu H, Xia M, et al. Tooncrafter: Generative cartoon interpolation[J]. ACM Transactions on Graphics (TOG), 2024, 43(6): 1-11.

[2] Edstedt J, Sun Q, Bökman G, et al. Roma: Robust dense feature matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 19790-19800.

评论

I appreciate the authors’ efforts in the rebuttal, and my concerns have been resolved. I have decided to raise my score to 5. I hope the authors will incorporate the new results and analysis in the revised version.

评论

We also deeply appreciate the reviewer's highly constructive feedback. Your detailed comments have significantly contributed to improving the quality of our work. We will incorporate the new results and analyses in the final version accordingly.

审稿意见
4

The paper PoseCrafter introduces a method for camera pose estimation in scenarios with extremely sparse visual input, specifically when input images have little to no overlap. The proposed pipeline begins by generating intermediate "relay" frames between the sparse input views using a pretrained video generation model, DynamicCrafter. From the generated sequence, the frames adjacent to the original start and end images are selected as the most reliable relay candidates. These are further refined using ViewCrafter, resulting in a higher-quality intermediate video sequence.

To identify the most informative synthesized views for downstream pose estimation, the method employs a Feature Matching Selector. For each generated frame, local features are extracted and compared with those from the original start and end frames. The top-K frames with the highest feature match scores are selected. These selected frames, along with the original inputs, are then passed to DUSt3R for final pose estimation.

优缺点分析

Strengths

  • The paper is well written and easy to follow
  • Experiments show promising improvement over previous methods.

Weaknesses

  1. The design choice for selecting the "most reliable relay frames" was based on an ablation study on the Cambridge Landmarks dataset, which is also one of the evaluation benchmarks (Table 1). This raises a concern about generalization: the heuristic might be overly tuned to this dataset and not transferable to other settings with different scene dynamics or motion characteristics. Besides this, this is deterministic and simple, yet not justified theoretically. It could be too heuristic and ad hoc, lacking a principled foundation.
  2. The method looks like a combination between different off-the-shelf models. The contribution lies mainly in how these components are connected and the heuristics used (e.g., frame selection), which may be seen as an engineering effort rather than a fundamentally new approach.
  3. The description of the ablation (Table 1) is unclear. It does not clearly specify how intermediate frames beyond 4 are sampled (e.g., does #Frames=6 mean {I₀, I₁, I₂, I_T−2, I_T−1, I_T}?)

Minor Weaknesses

The method mainly combines existing models DynamiCrafter + ViewCrafter + Duster.

问题

  1. Is the relay frame selection strategy ({I₀, I₁, I_T−1, I_T}) generalizable beyond the Cambridge Landmarks dataset?
  2. Clarify the frame sampling setup in Table 1?

局限性

Yes

最终评判理由

I appreciate the detailed author response and I don't have further questions. I'd like to keep my original positive score.

格式问题

None

作者回复

We thank the reviewer for their positive feedback regarding the clarity of our paper and the performance improvements demonstrated in our experiments. We provide our response to all concerns raised by the reviewer as follows:

Q1: Concerns about generalization and the heuristic nature of the relay frame selection strategy.

We thank the reviewer for raising this important point. We would like to clarify that although our relay frame selection strategy was initially inspired by empirical observations on the Cambridge Landmarks dataset, it is actually transferable to other data with different settings.

We have extended our evaluation beyond the Cambridge Landmarks dataset to include ScanNet, DL3DV-10K, and NAVI. These datasets feature diverse scene types and motion patterns. The updated results shown in the table below indicate that the selection of only the start and end frames along with their immediate neighbors {I0,I1,IT1,IT{I₀, I₁, I_{T−1}, I_T}} as the four relay frames consistently produces the lowest mean rotation error and the highest stability. This trend holds across different datasets and settings. It supports our intuition that using more frames may introduce geometric inconsistencies, and too few relay frames can lack sufficient information for accurate pose estimation.

Although the technique to select relay frames seems to be simple and deterministic, it has proven to be surprisingly robust and effective across datasets with varying settings. We can further explore its principled proof in our future work. We will also include these new experimental results and add more clarification in the revised version.

Dataset#Frames (n)
246816
Cambridge Landmarks20.5614.4716.6616.8717.83
ScanNet19.6716.2317.0317.1618.56
DL3DV-10K15.2214.2714.4014.7314.52
NAVI7.786.947.189.6410.92

Q2: The method appears to be a composition of off-the-shelf components.

Despite building on existing components, our method addresses a fundamentally challenging and under-explored problem: camera pose estimation under small or non-overlapping images. Most existing methods fail under these conditions due to the lack of reliable visual correspondences.

We further clarify our core contributions in the following:

  • We couple the video interpolation model DynamiCrafter with the pose-conditioned NVS model ViewCrafter. This combination forms our Hybrid Video Generation (HVG) framework. Our HVG synthesizes high-quality intermediate frames to support more accurate subsequent pose estimation.
  • We incorporate robust feature descriptors to propose the Feature Matching Selector (FMS) module. Our FMS deterministically selects the most informative frames for pose estimation. This eliminates the need for expensive statistical self-consistency scoring used in existing methods.
  • By integrating these components, we propose a simple yet effective pipeline for reliable pose estimation between image pairs with small or even no overlap. Experiments confirm that our method consistently outperforms existing approaches in these challenging scenarios.

Our work goes beyond a pure engineering effort. It reshapes the role of generative models in supporting geometric tasks and extends the applicability of existing pose estimators. We believe that our contributions to this work can benefit the community and inspire further research in related areas.

Q3: The description of the ablation (Table 1) is unclear.

We thank the reviewer for pointing out the ambiguity. We apologize for the lack of clarity in Table 1 and will revise the table caption and description in the main text for better readability. Specifically, the setting #Frames=6 corresponds to {I0,I1,I2,IT2,IT1,IT{I₀, I₁, I₂, I_{T−2}, I_{T−1}, I_T}}, and #Frames=8 corresponds to {I0,I1,I2,I3,IT3,IT2,IT1,IT{I₀, I₁, I₂, I₃, I_{T−3}, I_{T−2}, I_{T−1}, I_T}}.

评论

I appreciate the detailed response, and it addresses my concerns. I'd like to keep my original positive score.

评论

We sincerely appreciate your positive response and your decision to maintain the positive score. We are pleased that our clarifications addressed your concerns. If you have any further questions or suggestions, we would be happy to discuss.

评论

Dear Reviewer,

The authors have replied to your concerns, and it would be very helpful if you could engage in a discussion to clarify your concerns.

Best,

AC

最终决定

This paper introduces Hybrid Video Generation (HVG), which combines video interpolation with pose-conditioned novel view synthesis to generate clearer intermediate frames for challenging image pairs with minimal overlap. It also proposes a Feature Matching Selector (FMS) that selects frames optimized for pose estimation based on feature correspondences. Together, these innovations—implemented in the PoseCrafter system—improve pairwise camera pose estimation performance across multiple benchmarks, especially in low-overlap scenarios.

The concerns raised in the reviews and that were discussed can be summarized as follows:

  1. Unjustified design choices of the proposed HVG (R-7zgU);
  2. Ad-hoc and novelty solution (R-7zgU, R-8hif, R-6p5A, R-hdsC);
  3. Insufficient evaluations or experiments (R-8hif, R-6p5A); and
  4. Computational cost (R-hdsC)

After the reviewer and author discussion, the reviewers felt satisfied after reading the clarifications and seeing additional experiments. As suggested by many reviewers in their final recommendations, I would encourage the authors to include clarifications discussed with the reviewers as well as adding experiments or evidence that can benefit the experiments in the final paper.