PaperHub
7.8
/10
Spotlight4 位审稿人
最低4最高6标准差0.8
5
6
4
4
3.3
置信度
创新性3.3
质量3.3
清晰度3.0
重要性3.3
NeurIPS 2025

RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

Although COLMAP has long remained the predominant method for camera parameter optimization in static scenes, it is constrained by its lengthy runtime and reliance on ground truth (GT) motion masks for application to dynamic scenes. Many efforts attempted to improve it by incorporating more priors as supervision such as GT focal length, motion masks, 3D point clouds, camera poses, and metric depth, which, however, are typically unavailable in casually captured RGB videos. In this paper, we propose a novel method for more accurate and efficient camera parameter optimization in dynamic scenes solely supervised by a single RGB video, dubbed $ROS-Cam$. Our method consists of three key components: (1) Patch-wise Tracking Filters, to establish robust and maximally sparse hinge-like relations across the RGB video. (2) Outlier-aware Joint Optimization, for efficient camera parameter optimization by adaptive down-weighting of moving outliers, without reliance on motion priors. (3) A Two-stage Optimization Strategy, to enhance stability and optimization speed by a trade-off between the Softplus limits and convex minima in losses. We visually and numerically evaluate our camera estimates. To further validate accuracy, we feed the camera estimates into a 4D reconstruction method and assess the resulting 3D scenes, and rendered 2D RGB and depth maps. We perform experiments on 4 real-world datasets (NeRF-DS, DAVIS, iPhone, and TUM-dynamics) and 1 synthetic dataset (MPI-Sintel), demonstrating that our method estimates camera parameters more efficiently and accurately with a single RGB video as the only supervision.
关键词
4D reconstructioncamera parameter optimization; visual localization

评审与讨论

审稿意见
5

This paper addresses the challenging task of estimating camera parameters in dynamic scenes. Unlike previous methods that rely on ground-truth labels or require ground-truth masks of moving objects, the proposed approach is built entirely on a point tracking model and performs joint optimization based solely on the extracted point tracks. From my perspective, this method can be viewed as a form of bundle adjustment that is robust to dynamic objects. The results are impressive on real-world videos, even surpassing some supervised methods and commercial software such as that used on iPhones.

优缺点分析

Strengths

  1. The results are impressive—the proposed method outperforms many supervised approaches on real-world videos, despite relying solely on RGB inputs without any ground-truth supervision.

  2. The paper provides a clear and structured comparison with existing methods, particularly by categorizing them based on the type of supervision they use. This helps position the contribution of the proposed method within the broader literature.

Weakness:

  1. While the technical content is sufficient, the paper is not easy to read. It would benefit from a more global introduction and broader background context, rather than focusing almost entirely on the method itself. This would make the paper more accessible and easier to follow.

  2. A key component of the proposed method—the point tracking (PT) model—is not clearly introduced or described. In particular, the paper lacks details about the PT model’s input, output, and working mechanism. Since this module is crucial for enabling joint optimization from RGB inputs, a brief explanation—perhaps as a preliminary section—would be helpful. As a reader, it was unclear how joint optimization could be performed using only raw RGB pixels until it became evident that the method relies on extracted point correspondences.

  3. The proposed method targets dynamic videos, but it should also apply to static scenes. I am curious how it performs on standard static video benchmarks such as 7-Scenes, Tanks and Temples, or ScanNet. While it is not necessarily expected to outperform methods specifically designed for static scenes, such a comparison would serve as a useful reference and help assess the method’s generalizability.

  4. It would be helpful to include the number of frames per scene when reporting runtime comparisons in Table 9 and other related tables. This would provide better context for evaluating the computational efficiency across different scenarios.

问题

Do other point trackers work as well?

局限性

NA

最终评判理由

My concerns have been addressed in the rebuttal, and I have decided to maintain my original rating.

格式问题

NA

作者回复

Response to Reviewer 9gnd:

Thank you for your acknowledgment of our work. We highly appreciate your time and feedback, and address your concerns below.

About Weakness 1, ‘Broader background context for a more readable paper’

Thank you for your valuable suggestions. We appreciate your recognition of the technical content in our paper. As the current submission cannot be modified, we plan to enhance the background information in the camera-ready version to facilitate a clearer understanding of our work. In particular, we will provide a more comprehensive introduction to the research field at the beginning of our paper. Furthermore, regarding the pre-trained PT model that serves as the foundation for our proposed patch-wise tracking filters, we will include a more detailed explanation in the main paper—or, if permitted by the page limit, in a dedicated preliminary section.

About Weakness 2, ‘Introduction about the pre-trained PT model’

Thank you for pointing it out. We plan to add a more detailed explanation of the pre-trained PT model in the main paper in the camera-ready version. The default pre-trained PT model on which our proposed method is built is CoTracker. CoTracker is a transformer-based model achieving dense 2D point tracking in long video sequences. Unlike other related works, it jointly tracks the 2D points so that the dependencies on each other can help improve the accuracy and robustness of the model. During inference, the vanilla CoTracker allows a set of 2D query points and an RGB video as inputs and directly outputs the 2D location and visibility of each 2D query point on each frame. Although it significantly improved the PT performance compared with the existing works, it still suffers from the tracking location drift across frames, especially when one point disappears from the camera frustum and reappears in the field of view, as shown in Figure 9 in the Appendix. Such observations motivate us to propose the patch-wise tracking filters to extract only the robust, reliable, and sparse tracking trajectories as our pseudo-supervision.

About Weakness 3, ‘Performance on static scene dataset’

Thank you for your suggestions. We agree with your words that, 'although it is not necessarily expected to outperform methods specifically designed for static scenes, such a comparison would serve as a useful reference and help assess the method’s generalizability'. We conducted additional experiments on the Tanks and Temples dataset (Table 1 below). Due to the character limit, for other baselines, please refer to Table 2 in CF‑3DGS. Compared with CF‑3DGS (SOTA static-scene method that uses GT intrinsics), although our approach does not surpass the SOTA methods specialized for static scenes with extra supervision, such low errors can also support our generalizability.

Table 1: Camera Pose Evaluation on static-scene Tanks and Temples dataset (ATE\downarrow/RPT trans\downarrow/RPE rot\downarrow)

ScenesBallroomBarnChurchFamily
Ours0.005/0.003/0.5260.0188/0.006/0.9150.037/0.003/0.3010.021/0.002/0.440
ScenesFrancisHorseIgnatiusMuseum
Ours0.063/0.010/0.8830.012/0.007/0.8230.028/0.008/0.6280.008/0.020/0.551

We summarize below the reasons why meaningful comparisons with existing works on the 7‑Scenes and ScanNet datasets are not feasible:

  1. Access to the ScanNet dataset requires an application process and approval, which takes additional time.
  2. After recheck, we found that the existing 7‑Scenes evaluations (e.g., DUSt3R / Fast3R) are based on unordered images and pairwise pose metrics. In contrast, our method fundamentally relies on long, continuous video sequences and temporal consistency. Since the official protocol merges multiple short sequences into unordered inputs and does not provide a single long sequence, it is not possible to conduct a fair or meaningful comparison on this dataset.

About Weakness 4, ‘include the number of frames per scene when reporting runtime comparisons in Table 9 and other related tables’

Thank you for your helpful suggestions. Since the experimental section of our paper has already demonstrated the high efficiency property of our method, and due to the space limits of rebuttal, we only present here the runtime of our method on different datasets along with the corresponding number of frames, as shown in the following Tables 2, 3, and 4. We plan to incorporate these additional results into the runtime tables in the camera-ready version to make it more complete.

Table 2: Quantitative Runtime Results on DAVIS with Corresponding Frame Count

MetricCamelBearBreakdance-flareCar-roundaboutCar-shadowCar-turnCowsDogDog-agility
Cam (min)1.573.151.734.970.932.972.851.600.67
Frames9082717540801046025
MetricGoatHikeHorsejump-highLuciaMotorbikeParkourRollerbladeTennisTrain
Cam (min)2.104.201.971.972.089.071.253.471.90
Frames9080507043100357080

Table 3: Quantitative Runtime Results on iPhone with Corresponding Frame Count

MetricApplePaper-windmillSpace-outBackpackBlockCreeperHandwavy
Cam (min)3315237272715
Frames475277429180447360303
MetricHaru-sitMochi-high-fivePillowSpinSriracha-treeTeddyWheel
Cam (min)1062130153127
Frames200180330426302472437

Table 4: Quantitative Runtime Results on NeRF-DS with Corresponding Frame Count

MetricBellAsBasinPlatePressCupSieve
Cam (h)1.050.950.750.530.681.020.78
Frames881846668424487807584

About Question 1, ‘Do other point trackers work as well?’

Thank you for your question. As stated in Lines 141–143 of the main paper, although our proposed patch-wise tracking filters do an iterative process until there are BB tracked points in each frame, such filters do not\underline{not} conduct any repeated\underline{repeated} refinement or optimization of point tracking on the same\underline{same} tracking trajectory. This is the main reason why our runtime grows almost linearly with frame count, as shown in Figure 4 of the main paper. Thus, the pre-trained PT model on which our proposed filters are based must be a feed-forward prediction model, outputting point locations and visibility; we will clarify this further in the camera-ready version.

In Table 8 of the main paper, we show ablation experiments on 2 different popular pre-trained PT models, CoTracker and CoTracker3. The results show that our proposed method is nearly independent of building on the particular PT models, as long as these requirements are met.

Besides, Table 5 below summarizes all baseline PT methods reported in the CoTracker and CoTracker3 papers. CoTracker and CoTracker3 are the only two methods satisfying our requirements. Thus, other PT models cannot be incorporated in our proposed method due to architectural incompatibility.

Table 5: Properties of All Baseline Methods Reported in CoTracker and CoTracker3.

MethodTypeOutput Format
TAP-NetFeed-forwardPoint locations only
OmniMotionOptimization-basedPoint locations only
PIPsIterative (sliding window)Point locations only
MFTFeed-forwardPoint locations only
PIPs++Iterative refinementPoint locations only
TAPIRRecurrentPoint locations + Visibility
CoTrackerFeed-forwardPoint locations + Visibility
TAPTRRecurrent/transformerPoint locations + Visibility
LocoTrackRecurrentPoint locations + Visibility
CoTracker3Feed-forwardPoint locations + Visibility
BootsTAPIRRecurrentPoint locations + Visibility

We are also happy to discuss any other proper pre-trained PT models to explore potential improvements on our current proposed method. Please feel free to let us know.

评论

We hope our rebuttal has addressed your questions and resolved your concerns regarding our work. Please let us know if there is anything else we can clarify. We would greatly appreciate a brief reply to share your thoughts. Thank you very much for your time.

评论

My concerns have been almost addressed. Additionally, the 7-scenes dataset provides sequences with each containing 500-1000 frames. Is it too short for the proposed method? Hope authors can revise this and attach static scene results in the camera-ready version.

评论

Thank you for your response! We are so glad to see our answers have addressed your concerns.

About attaching the static scene results in the camera-ready version.

Thank you for your suggestion. We will attach the static scene results in the camera-ready version.

About whether 500-1000 frames are too short for the proposed method.

Regarding the concern about video length, we believe that sequences with 500–1000 frames sufficiently demonstrate the effectiveness of our proposed method. As our answers to Reviewer nBqP, About Question 3, ‘Testing on extremely long videos (>2000 frames)’, we explored the maximum number of frames our method could handle on our device (a single NVIDIA A100 GPU (40 GB) with 256 GB of system RAM). Specifically, we selected two dynamic-scene sequences exceeding 2000 frames (footlab_3rd and scene_j716_3rd) from the Point Odyssey dataset for this evaluation. However, even after trimming the sequence to 1500 frames, our model still encounters RAM capacity limitations on our setup. We plan to enhance our model’s capacity to handle a greater number of frames on our current hardware setup.

Thank you again!

评论

Thank you once again for your valuable time and commitment. We would like to kindly note that the discussion period will conclude in approximately 48 hours. If any points require further clarification, please don’t hesitate to reach out. If there is no remaining problem, we are looking forward to your final rating. Thank you!

评论

Thank you once again for your valuable feedback. Although the deadline was extended, we would like to kindly note that the discussion period will conclude in approximately 48 hours. We would like to try our best to address all your concerns. If there is no remaining problem, we are looking forward to your final rating. Thank you!

评论

My concerns have been addressed in the rebuttal, and I will keep my original rating unchanged.

评论

Thank you for your positive rating! We would like to kindly remind you to please submit the Mandatory Acknowledgment at your earliest convenience. Thank you again!

评论

Thank you for your acknowledgment of our paper. The discussion period will end in several hours. We would like to kindly remind you to submit the Mandatory Acknowledgment. Thank you!

审稿意见
6

The paper proposes an RGB-only camera-parameter optimisation pipeline for casually captured dynamic-scene videos. Key contributions are: 1) Patch-wise Tracking Filters, 2) Outlier-aware Joint Optimisation 3) A Two-Stage Optimisation Strategy. Extensive experiments on 5 public datasets (NeRF-DS, DAVIS, iPhone, MPI-Sintel, TUM-dynamics) show better accuracy and up to 12× faster runtime than previous RGB-only baselines (casualSAM, COLMAP w/o mask) and, on some videos, even LiDAR-assisted Record3D. Ablations confirm each design choice.

优缺点分析

Strengths (1)Well-designed optimisation objective; thorough ablations and evaluation on five datasets. (2)Method is laid out step-by-step with helpful figures; loss derivations in appendix; datasets & metrics clearly listed. (3)Addressing a practical bottleneck for 4D reconstruction; runtime improvement is compelling for real use. (4)Novel coupling of sparse hinge-like tracks with Cauchy uncertainty in bundle adjustment; two-stage soft-plus trick is neat

Weaknesses

All designs are well motivated, and the overall presentation is of high quality. Minor weaknesses, like the Abrupt introduction of COLMAP and visual odometry at the beginning of the paper, which confuses my understanding of the target problems, are suggested to be improved.

问题

You state in line 273 that the performance is “nearly independent” of the PT backbone (Table 8). Please add results with a much weaker tracker to support the claim, or consider improving the statement for rigorous discussion.

局限性

yes

最终评判理由

I have no further questions. The author’s explanation makes sense and does not affect my original rating.

格式问题

no particular Paper Formatting Concerns

作者回复

Response to Reviewer izms:

Thank you for your acknowledgment. We really appreciate your time and feedback, and address your concerns below.

About Weakness 1, ‘Abrupt introduction on COLMAP and Visual Odometry’

Thank you for your suggestions on paper writing. We plan to add more background information at the beginning of the paper, to help us make the introduction of visual odometry, and the state-of-the-art COLMAP more fluent.

About Question 1, ‘Results with a much weaker tracker’

Thank you for your question. As stated in Lines 141–143 of the main paper, although our proposed patch-wise tracking filters do an iterative process until there are BB tracked points in each frame, such filters do not\underline{not} conduct any repeated\underline{repeated} refinement or optimization of point tracking on the same\underline{same} point trajectory. This is also the main reason why our runtime grows almost linearly as frame count increases, as shown in Figure 4 of the main paper. Thus, the pre-trained PT model on which our proposed filters are based must be a feed-forward prediction model, outputting point locations and visibility; we will clarify this further in the camera-ready version.

In Table 8 of the main paper, we show ablation experiments on 2 different popular pre-trained PT models, CoTracker and CoTracker3. The results show that our proposed method is nearly independent of building on the particular PT models, as long as these requirements are met.

Besides, Table 1 below summarizes all baseline PT methods reported in the CoTracker and CoTracker3 papers. CoTracker and CoTracker3 are the only two methods satisfying our requirements. Thus, other PT models cannot be incorporated in our proposed method due to architectural incompatibility.

Table 1: Properties of all baseline methods reported in the CoTracker and CoTracker3.

MethodTypeOutput Format
TAP-NetFeed-forwardPoint locations only
OmniMotionOptimization-basedPoint locations only
PIPsIterative (sliding window)Point locations only
MFTFeed-forwardPoint locations only
PIPs++Iterative refinementPoint locations only
TAPIRRecurrentPoint locations + Visibility
CoTrackerFeed-forwardPoint locations + Visibility
TAPTRRecurrent/transformerPoint locations + Visibility
LocoTrackRecurrentPoint locations + Visibility
CoTracker3Feed-forwardPoint locations + Visibility
BootsTAPIRRecurrentPoint locations + Visibility

We are also happy to discuss any other proper pre-trained PT models to explore potential improvements on our current proposed method. Please feel free to let us know.

评论

I have no further questions. The author’s explanation makes sense and does not affect my final score.

评论

Thank you for your acknowledgment of our paper! Thank you again!

评论

We hope our rebuttal has addressed your questions and resolved your concerns regarding our work. Please let us know if there is anything else we can clarify. We would greatly appreciate a brief reply to share your thoughts. Thank you very much for your time.

评论

Thank you once again for your time and thoughtful feedback. As the discussion period concludes in three days, we remain fully available to address any remaining concerns or questions regarding our work. Please don’t hesitate to let us know if any further clarification is needed. We sincerely appreciate your engagement.

评论

Thank you once again for your valuable time and commitment. We would like to kindly note that the discussion period will conclude in approximately 48 hours. We greatly appreciate the opportunity to address any remaining questions or concerns you may have regarding our work. If any points require further clarification, please don’t hesitate to reach out. Thank you very much.

审稿意见
4

This paper proposes a SfM method, which estimates camera trajectory from an RGB video of the dynamic scenes. Unlike COLMAP and its successors, this approach doesn’t require any supervision and estimates camera intrinsics and extrinsic only from the RGB frames of the video. The method introduces three contributions: 1) patch-wise keypoint filter, which removes dynamic points and extracts stable points from the entire sequence, which will be used as a supervision signal; 2) Outlier handling mechanism integrated in the camera pose optimization procedure focusing on suppressing points coming from the dynamic objects; 3) Two step optimization strategy , which enhances the speed of the optimization. The method is evaluated on several datasets demonstrating improved performance and greater efficiency compared to the related approaches.

优缺点分析

Strengths: The main strength of the proposed method is their computational efficiency achieved due to their two step optimization strategy. Moreover, paper achieves reasonable performance given that it doesn’t use any GT supervision. The experimental evaluation is very detailed and well done. The quality of the writing could be improved.

Weaknesses: The overall novelty looks fairly standard containing steps used in SfM methods. The choice of the steps is reasonable, but in spite the ablation study it is hard to assess how is each step critical in different dynamic scenes. For example, it would be good to comment how many correct point trajectories have been extracted after the first step in different dynamic scenes, starting from a low to high amounts of moving outliers. Further, the question is what amount of outliers can be handled by the proposed outlier handling strategy. The current proposed outlier handling strategy is similar to standard reweighed least square procedure combined with the robust estimators, e.g. Tukey estimator. It would be good to compare the performance with other robust estimators. In the experimental evaluation it is evident that the methods with mask supervision or other GT supervision signals perform better then the proposed method. The question is how would the proposed method perform with such supervisions. It would be interesting to assess if the proposed optimization and outlier handling strategy would outperform the other methods if GT supervision was used.

问题

Γraw\Gamma_{raw} seems to be important parameter and its choice has not been experimentally analyzed in the main paper. It would be good to perform such analysis and backup the claims made in the two stage optimization procedure.

The qualitative results presenting camera trajectories in Fig.6 are good, but I would appreciate to see more examples in different scenes, especially, somehow ranked by the level of difficulties, e.g. from small to large numbers of moving outliers(low dynamic to high dynamic scenes).

It is strongly recommended to include and discuss failure cases both quantitatively and qualitatively (examples of recovered trajectories).

局限性

N/A

最终评判理由

I checked the authors' replies to my questions and I appreciate their answers and effort they put in the explanations. I also checked other reviews and answers to them. Authors are strongly encouraged to include ablations on parameter choice and include a discussion about failure cases. I am happy about the rebuttal and will increase the rating to borderline accept.

格式问题

No major concerns

作者回复

Response to Reviewer 79BF:

Thank you for your acknowledgment of our paper. We sincerely appreciate your time and feedback, and address your concerns below. As your concerns were not explicitly numbered or listed, we summarize them as follows.

About Weakness 1, ‘how many correct point trajectories..xxx..starting from a low to high amounts of moving outliers.’

Thank you for your question. We believe there must be a misunderstanding, and we should make it clear in the camera-ready version with the following clarifications.

Firstly, we do not define point trajectories by correctness. Both inlier and outlier trajectories can be correct or incorrect, and these datasets lack GT optical flow or tracking labels, so such a definition would be meaningless. Instead, we classify trajectories into inliers and outliers, which directly motivates our proposed outlier-aware joint optimization.

Secondly, the number of moving outliers does not affect trajectory accuracy. Our patch-wise tracking filters are built on CoTracker, which can robustly track trajectories regardless of the number of moving points, as discussed in the CoTracker paper. For further analysis on our model’s capacity to handle different amounts of outliers, please see our next response.

About Weakness 2, ‘What amount of outliers can be handled?’

Thank you for the question. By “amount of outliers,” we assume you mean the percentage of outliers among all trajectories. We want to clarify that model performance depends not only on this percentage but also on the appearance pattern of outliers. To illustrate, we provide a table comparing performance, outlier percentage, and appearance patterns across failure cases and one success case (motion masks for MPI-Sintel are manually annotated). As shown in Table 1 (below) and Fig. 21 (Appendix), a high outlier percentage (e.g., Ambush_6) can still yield good results, while lower percentages can fail when the appearance pattern is extreme. Thus, there is no strict threshold for the outlier percentage that leads to failure.

Table 1: Failure Case Analysis by Outlier Percentage and Appearance Pattern on MPI-Sintel dataset.

Case TypeSceneOutliers PercentageATERPE transAppearance Pattern
Failurecave_235.14%1.6090.226Cover / Sweep across the frame
Failurecave_439.36%0.1250.035Dominate screen across the frame
Failuremarket_534.31%2.0270.617Dominate screen across the frame
Failuretemple_331.92%0.7820.1467Cover / Sweep across the frame
Sampled Successambush_644.30%0.0800.129Dominate half screen across the frame

About Weakness 2, ‘Why not the Tukey estimator? Other estimator?’

Thank you for asking this question. To our knowledge, we believe what you mean by ‘the Tukey estimator’ is Tukey’s biweight estimator. In the weighting function of Tukey’s biweight estimator,

w(u)={(1(uc)2)2,if u<c0,if ucw(u) = \begin{cases} \left(1 - \left(\frac{u}{c}\right)^2\right)^2, & \text{if } |u| < c \\\\ 0, & \text{if } |u| \geq c \end{cases}

Tukey’s biweight completely discards residuals beyond the cutoff cc, a hard threshold that must be tuned for each video. In contrast, our Cauchy weighting learns all parameters, allowing our outlier-aware joint optimization to generalize across videos.

We are also happy to discuss other estimators to explore potential improvements on our current proposed method. Please feel free to let us know.

About Weakness 3, ‘How would the proposed method perform with more GT supervisions?’

Thank you for your question. More GT supervision (depth, camera pose, motion masks, 3D point clouds, camera intrinsic) will change the architecture of our pipeline, which is specifically designed for RGB-only supervision. So under such circumstances, our model needs to be carefully redesigned. But, theoretically, like most other models, we believe our method will perform better with more GT supervision. We might leave this as a future research path.

Furthermore, we want to reclarify that our biggest contribution is our capability to accurately and efficiently estimate the camera parameters in dynamic scenes solely supervised by a single RGB video, without any other GT supervision.

About Question 1, ‘Analysis on Γraw\Gamma^{raw}'

Thank you for your suggestion. As shown in Table  7 of the main paper, replacing our two-stage strategy with a randomly initialized and learnable Γraw\Gamma^{raw} leads to a clear performance drop. To address the request for a deeper analysis, we conducted additional ablations on Γraw\Gamma^{raw} (see table below). 'Meaningless' indicates that when Γraw\Gamma^{raw} is frozen in Stage 1, other value settings behave the same as setting Γraw=1\Gamma^{raw}=1, so ablations 2 and 3 duplicate the settings of ablation 1 and the full model.

Table 2: Analysis on Γraw\Gamma^{raw} on NeRF-DS dataset. (Stage 1: 200 iterations; Stage 2: 50 iterations.)

ModelStage 1 InitializationStage 2 InitializationPSNRSSIMLPIPS
Full modelΓraw\Gamma^{raw} = 1 (freeze)Γraw\Gamma^{raw} = Estage1ACP\mathbb{E}^{ACP}_{stage1} (learn)33.550.93810.1182
More ablation 1Γraw\Gamma^{raw} = 1 (freeze)Γraw\Gamma^{raw} = 1 (learn)31.810.90730.1403
Existing AblationΓraw\Gamma^{raw} = Random (learn)- (learn)25.950.81000.2668
More ablation 2Γraw\Gamma^{raw} = Random (freeze)- (learn)Meaningless
More ablation 3Γraw\Gamma^{raw} = Random (freeze)Γraw\Gamma^{raw} = Estage1ACP\mathbb{E}^{ACP}_{stage1} (learn)Meaningless

Such comparisons show that our initialization strategy on Γraw\Gamma^{raw} in different stages is very important; otherwise, the performance drops dramatically.

About Question 2, ‘more examples in different scenes, ranked by the level of difficulties, e.g., ..xxx... (low dynamic to high dynamic scenes)’

Thank you for your questions. To clarify, the difficulty of camera evaluation depends on multiple factors (outliers, camera motion, appearance pattern, scene dynamics, etc.). Analysis on parts of them is in our answers to About Weakness 2, ‘What amount of outliers can be handled?’. To our knowledge, there are no explicit quantitative or qualitative metrics to rank it. We will clarify this in the camera-ready version.

Unlike the related methods (Table 1 in the main paper), mostly supported by experiments on 30 - 50 videos, our results are supported by experiments on 5 popular public camera evaluation datasets covering about 70 videos across different difficulty levels. To make this clearer, we summarize the dataset characteristics (Section C of the Appendix) in the following table and will include it in the camera-ready version.

Table 3: Dataset Features

DatasetScene Dynamics LevelScene Dynamics PatternCamera Motion LevelCamera Motion Pattern
NeRF-DSLargeShaking movement of reflective ojectLargeCircle-like movement w/ large rotation change
DAVISLargeRegular movement of daily common objectsLargeCurve-like movement w/ proper rotation change
MPI-SintelLarge & ComplexCartoon characters’ irregular movementLargeStraight-line-like movement w/ large rotation change
iPhoneLarge & ComplexIrregular movement of daily common objectsLarge & ComplexFluctuating movement w/ large rotation change
TUM-DynLarge & ComplexIrregular movement of humansLarge & ComplexFluctuating movement with large rotation change

Due to the character limit, please refer to our response to Reviewer  9gnd, About Weakness 3, ‘Performance on static scene dataset,’ for a further investigation of the generalization ability of our method. Thank you.

About Question 3, ‘More qualitative and quantitative analysis on failure cases’

Thank you for your advice. We will add the following discussions to the paper in the camera-ready version. As shown in Fig. 21 (Appendix, L586–589), our method fails when videos are dominated by large, fast-moving objects (e.g., humans, dragons). We attribute it to two main reasons: (1) a high proportion of outliers and (2) extreme outlier patterns where large objects sweep and cover the screen. These factors hinder robust, sparse hinge-like relations extracted for accurate pseudo-supervision. More details about the reasons can be seen in our answers to About Weakness 2, ‘What amount of outliers can be handled?’.

评论

I am satisfied with your answers and do not have further questions.

评论

Thank you for your response! We sincerely appreciate your positive acknowledgment of our paper after considering our rebuttal. It is gratifying to know that our responses have successfully addressed your concerns.

评论

We hope our rebuttal has addressed your questions and resolved your concerns regarding our work. Please let us know if there is anything else we can clarify. We would greatly appreciate a brief reply to share your thoughts. Thank you very much for your time.

评论

Thank you once again for your time and thoughtful feedback. As the discussion period concludes in three days, we remain fully available to address any remaining concerns or questions regarding our work. Please don’t hesitate to let us know if any further clarification is needed. We sincerely appreciate your engagement.

评论

Thank you once again for your valuable time and commitment. We would like to kindly note that the discussion period will conclude in approximately 48 hours. We greatly appreciate the opportunity to address any remaining questions or concerns you may have regarding our work. If any points require further clarification, please don’t hesitate to reach out. Thank you very much.

评论

Thank you for your valuable feedback. Although the deadline has been somewhat extended, we would like to kindly remind you that the discussion period will end in approximately 48 hours. If you have any more questions, please let us know ASAP so that we have enough time to address your concerns. If you believe our rebuttal can help address your concerns, we sincerely expect you to consider improving your score as the final rating. Thank you!

PS: Please do not forget to submit the Mandatory Acknowledgement.

评论

Thank you for your reviews! We would like to kindly remind you that the discussion period will end in approximately 24 hours. If there are any more concerns, please let us know so that we can address them ASAP. We also sincerely invite you to see our discussions with other reviewers if you share similar concerns with them. We sincerely expect you to consider improving your score as the final rating. Thank you again!

审稿意见
4

This paper solves an important problem in dynamic scenes “solely supervised by RGB video”. To solve this problem, three key components are proposed as the main contributions of this paper.

优缺点分析

Strengths:

  1. Well-structured methodology: The paper presents a comprehensive approach with three well-motivated components that address specific challenges in RGB-only camera parameter optimization for dynamic scenes.

  2. Extensive experiments: The authors conduct thorough experiments across 5 datasets with both quantitative metrics and qualitative visualizations, I appreciate authors' hard work.

  3. Efficiency improvements: The method achieves substantial runtime reductions (e.g., 55% of the second-fastest method on NeRF-DS) while maintaining or improving accuracy, which is practically valuable.

  4. Ablation studies: The paper provides detailed ablation studies validating each component's contribution to the overall performance.

Weakness:

  1. It seems all the words on the figures are blured, this reduces readability and aesthetics. May the authors make the words on figures clearly ? For example, change all of the figures to vector format.

  2. The claim of being the "first RGB-only supervised" method (Contribution 1, line 61) contradicts Table 1, which clearly shows existing RGB-only methods like Robust-CVD and casualSAM. This should be corrected to accurately position the work's contributions.

  3. The patch-wise tracking filters appear to be incremental improvements over existing point tracking methods (CoTracker). I think there should be deeper analysis like:

  • How patch size ww affects performance across different scene types
  • Sensitivity analysis of the texture threshold τvarτ_{var}
  1. I encourage authors present limitation in the maintext. I think there should be more discussion when and why the method fails.

问题

  1. How did you determine w=12/24 and τ_var=0.1? Can you provide sensitivity analysis for these parameters? What happens when there are insufficient high-texture patches?

  2. You mention the method fails with large moving objects. Can you quantify this - what percentage of moving pixels causes failure? Is there a way to predict these failure cases?

  3. Have you tested sequences longer than 900 frames ? What’s about the stability ? For example, 2000 frames, since COLMAP failed occasionally.

局限性

yes

最终评判理由

Overall, I hold a positive evaluation of this paper, and the rebuttal phase has successfully addressed all of my concerns. I agree with reviewer 79BF's point that "The overall novelty looks fairly standard containing steps used in SfM methods," but this paper indeed tackles a rather challenging problem: "RGB-only supervised camera parameter optimization in dynamic scenes." The authors have conducted thorough comparisons with previous work and demonstrated the superiority of their method.

What I am not entirely satisfied with in this paper is that the numbers in the figures are too blurry (due to being saved in non-vector format), which causes some reading difficulties and diminishes the overall aesthetic quality of the paper. Therefore, I maintain my weak accept recommendation and suggest that the authors revise the figures.

格式问题

no

作者回复

Response to Reviewer nBqP:

Thank you for your acknowledgment of our paper. We sincerely appreciate your time and feedback, and address your concerns below.

About Weakness 1, ‘The blurry words in the figures’

Thanks for your suggestions. After re-checking, we found that some figures were not correctly saved as vector graphics. Since the article cannot be modified at this stage, we plan to correct it in the camera-ready version.

About Weakness 2, ‘Words’ contradictory’

Thank you for pointing out this misunderstanding. What we meant is that our method is the first RGB-only supervised approach for dynamic scenes that is both accurate and efficient, as existing RGB-only methods are either slow or inaccurate. We will remove the word “first” in the camera-ready version.

About Weakness 3, ‘How patch size affects performance across different scene types?’

Thank you for raising this question. We would like to clarify that although patch size can influence performance, the scene type is not a factor in this process; thus, we do not adjust patch size based on scene types. Instead, as noted in Line 200 of our main paper, the patch size ww is determined solely by the frame size.

Specifically, when the input aspect ratio (long/short side) exceeds 1.5, we consider the image elongated. Camera motion in such cases causes obvious uneven scene changes across axes, so we adopt a smaller patch size to capture more valid high-texture patches. Otherwise, we use a larger patch size. The value of ww is chosen as a trade-off between memory cost and efficiency, but it should definitely be adjusted for extremely small or large frame sizes. We will add such discussions to the camera-ready version. Further sensitivity analysis on ww is provided in our response to About Question 1, ‘Sensitivity analysis on ww and τvar\tau_{var}.

About Weakness 3, ‘Sensitivity analysis of the texture threshold’

Thank you for your advice. Please see our subsequent answers to About Question 1, ‘Sensitivity analysis on ww and τ\tau, for a detailed discussion about the sensitivity analysis of the texture threshold.

About Weakness 4, ‘The position of limitations’

Thank you for your suggestions. We will move the limitation section to the main paper in the camera-ready version.

About Weakness 4, ‘More discussion on when and why the method fails?’

Thank you for your advice. We should discuss more in the camera-ready version. As shown in Fig. 21 (Appendix, L586–589), our method fails when videos are dominated by large, moving humans and dragons. We attribute it to two main reasons: 1) high outlier percentage; 2) extreme outlier patterns where large objects sweep and cover the screen. These factors hinder robust, sparse hinge-like relations extracted as accurate pseudo-supervision. More details can be seen in our answers to About Question 2, ‘What percentage of moving pixels causes failure? ..xxx.. failure cases?’.

About Question 1, ‘How did you determine ww=12/24 and τvarτ_{var}=0.1?’

Thank you for your question. Please refer to our answers to About Weakness 3, ‘How patch size affects performance across different scene types?’ for a thorough analysis of ww.

We sincerely apologize for the typo in the paper. The value “τvarτ_{var}=0.1” should be corrected to “τvarτ_{var}=0.01”. We will correct this typo in the camera-ready version.

We empirically set τvar=0.01\tau_{var}=0.01 and use this fixed value for all videos. Eq. 1 in main paper selects relatively high-texture patches by keeping those whose intensity variance exceeds 0.010.01 ×\times σ\sigma* (σ\sigma* is the maximum patch variance in a frame). This removes textureless regions (e.g., plain backgrounds, white walls; see Fig. 9 in Appendix) while retaining useful information, as only patches with variance below 1% of the maximum are discarded.

About Question 1, ‘Sensitivity analysis on ww and τvar\tau_{var}

Thank you for asking this question. We conduct the sensitivity analysis on these two parameters, as shown in the subsequent two tables with detailed discussions. We plan to add the following two tables to the ablation study section of our main paper in the camera-ready version to make our paper more complete.

Table 1: Sensitivity Analysis on Patch Size ww on NeRF-DS dataset

Scene OptimizationCamera OptimizationPSNRSSIMLPIPS
4DGSours(ww=12, default)33.550.93810.1182
4DGSours(ww=16)32.230.90850.1411
4DGSours(ww=24)24.710.77220.3084

Table 2: Sensitivity Analysis on Variance Threshold τvar\tau_{var} on NeRF-DS dataset

Scene OptimizationCamera OptimizationPSNRSSIMLPIPS
4DGSOurs(τvar\tau_{var} = 0.01, default)33.550.93810.1182
4DGSOurs(τvar\tau_{var} = 0.04)32.450.91360.1367
4DGSOurs(τvar\tau_{var} = 0.08)32.140.90160.1408
4DGSOurs(τvar\tau_{var} = 0.16)Fail

Tables 1 and 2 show that our model is more sensitive to ww than τvar\tau_{var}. Such ablation studies also demonstrate our good value definition on these two parameters. However, too large τvar\tau_{var} makes the model fail due to a lack of useful information.

About Question 1, ‘What happens when there are insufficient high-texture patches?’

Thank you for asking this question. To clarify, what we mean by ‘high-texture patches’ is the relatively high-texture patches. We should make it clearer in the camera-ready version. Like our answers to About Question 1, ‘How did you determine w=12/24 and τ_var=0.1?’ above, in each frame, we only discard the relatively low-texture patches whose intensity variance is less than 1% of the maximum intensity variance. This efficiently avoids losing useful high-texture patches. However, we might share limitations with others in certain extreme scenarios, such as videos that predominantly feature plain walls with super-limited texture. We will leave it as our future research topic.

About Question 2, ‘What percentage of moving pixels causes failure? Is there a way to predict these failure cases?’

Thank you for asking this question. We would like to clarify that there is no strict threshold for the percentage of moving pixels that leads to failure. Because the model’s ability to handle outliers is also affected by the appearance pattern of outliers. Table 3 below compares the performance, outlier percentage, and appearance pattern in all failure cases and one sampled success case. We plan to add this table to the main paper of the camera-ready version. Since the MPI-Sintel dataset does not provide any GT motion mask, we manually annotate the motion masks.

Table 3: Failure Case Analysis by Outlier Percentage and Appearance Pattern

Case TypeScenePercentage of outliersATERPE transAppearance Pattern
Failure casecave_235.14%1.6090.226Object briefly sweeps across the frame, temporarily covering it
Failure casecave_439.36%0.1250.035Object persistently dominates most of the frame.
Failure casemarket_534.31%2.0270.617Object persistently dominates most of the frame.
Failure casetemple_331.92%0.7820.1467Object briefly sweeps across the frame, temporarily covering it
Sampled Successambush_644.30%0.0800.129Object persistently dominates about half of the frame.

As shown in the above table and Figure 21 of the Appendix, the model’s performance is not strictly determined by the proportion of outliers: in ambush_6, the model performs well even with a large number of outliers, while in others, performance degrades despite a small proportion of outliers. The appearance pattern also affects the performance of the model.

Currently, there is no quantitative indicator available to predict whether the method will succeed or fail. But we highly suggest forecasting based on the appearance pattern.

About Question 3, ‘Testing on extremely long videos (>2000 frames)’

Thank you for your questions. To our knowledge, existing test datasets and prior methods do not provide extremely long videos (>2000 frames) nor report experiments under such conditions. So we select two sequences exceeding 2000 frames (footlab_3rd and scene_j716_3rd) from the Point Odyssey dataset. However, even after trimming the sequence to 1500 frames, our model still encounters RAM capacity limitations on our setup, where all experiments are conducted on a single NVIDIA A100 GPU (40 GB) with 256 GB of system RAM. The longest video in our current experiments contains 881 frames (bell in NeRF-DS). We plan to investigate methods for reducing GPU and RAM consumption as part of our future work.

评论

We hope our rebuttal has addressed your questions and resolved your concerns regarding our work. Please let us know if there is anything else we can clarify. We would greatly appreciate a brief reply to share your thoughts. Thank you very much for your time.

评论

Thank you once again for your time and thoughtful feedback. As the discussion period concludes in three days, we remain fully available to address any remaining concerns or questions regarding our work. Please don’t hesitate to let us know if any further clarification is needed. We sincerely appreciate your engagement.

评论

Thank you once again for your valuable time and commitment. We would like to kindly note that the discussion period will conclude in approximately 48 hours. We greatly appreciate the opportunity to address any remaining questions or concerns you may have regarding our work. If any points require further clarification, please don’t hesitate to reach out. Thank you very much.

评论

Thank you for your valuable feedback. Although the deadline has been somewhat extended, we would like to kindly remind you that the discussion period will end in approximately 48 hours. If you have any more questions, please let us know ASAP so that we have enough time to address your concerns. If you believe our rebuttal can help address your concerns, we sincerely expect your final rating. Thank you!

PS: Please do not forget to submit the Mandatory Acknowledgement.

评论

Thank you for the authors' detailed response. I have carefully read the authors' replies as well as their responses to other reviewers' questions, and I have no further questions. This paper's contributions have certain practical significance, therefore I will maintain my rating and strongly recommend the authors to improve the presentation of qualitative results for readability and aesthetics.

评论

Thank you for your reply and for acknowledging our paper. It is our pleasure to address all of your concerns, and we look forward to receiving your final positive rating. Additionally, we would like to kindly remind you to submit the Mandatory Acknowledgment. Thank you again!

评论

Thank you for your acknowledgment of our paper. The discussion period will end in several hours. We would like to kindly remind you to submit both the final rating and the Mandatory Acknowledgment. Thank you!

最终决定

This paper introduces an RGB-only supervised method for camera parameter estimation in dynamic scenes, combining patch-wise tracking filters, outlier-aware optimization, and a two-stage strategy. Reviewers praised the clear methodology, thorough ablations, and strong runtime and accuracy across ~70 videos, with some noting novelty is moderate and presentation could improve. The rebuttal addressed concerns by clarifying claims, adding sensitivity and static-scene experiments, and committing to improved figures and explanations. Overall, the work is solid, practical, and broadly relevant. I recommend acceptance.