Rig3R: Rig-Aware Conditioning and Discovery for 3D Reconstruction
Feed-forward 3D reconstruction with flexibility to rig constraints
摘要
评审与讨论
This paper introduces Rig3R, a generalization of prior multi-view reconstruction models that incorporates rig structure when available, and learns to infer it when not. Rig3R jointly predicts pointmaps and two types of raymaps: a pose raymap relative to a global frame, and a rig raymap relative to a rig-centric frame consistent across time.
优缺点分析
Strengths: This paper proposes the first learned method that leverages rig constraints to improve 3D reconstruction and pose estimation, while generalizing to inputs with partial or missing metadata, including camera ID, timestamp, and rig poses. Moreover, a novel output representation based on global and rig-relative raymaps is introduced to enable closed-form pose estimation and rig structure discovery from unordered image inputs.
Weaknesses:
- Is it feasible to perform reconstruction with two input images that have no overlapping regions? Are there any limitations on the number of input images, and what is the maximum number of frames that can be processed in a single inference pass?
- What is the GPU memory consumption during a single inference operation of this method?
- How does Rig3R compare with VGGT in terms of reconstruction quality and computational efficiency?
问题
Based on the analysis above, there are three questions: (1) Can Rig3R perform reconstruction with non-overlapping image pairs, and what are the input constraints regarding frame count? (2) What is the GPU memory consumption during a single inference operation of this method? (3) How does Rig3R perform against VGGT in terms of both reconstruction quality metrics and computational performance?
局限性
yes
最终评判理由
The responses have addressed most of my concerns and I keep my initial score of Accept.
格式问题
No Formatting Concerns
We thank the reviewer for their encouraging and thoughtful feedback. We appreciate the recognition of our contributions and the insightful questions, which help clarify key aspects of the work. Below we address each point in detail and will revise the final version accordingly.
- Reconstruction from Non-Overlapping Image Pairs and Input Frame Constraints
Yes, Rig3R can indeed reconstruct non overlapping image pairs, particularly in the case when it is calibrated. The rig metadata, such as rig extrinsics and the image timestamps, provide strong constraints for the reconstruction that allow the model to pose the images even with no visual correspondences. We find that most rigs often have little overlap in their field of view, but Rig3R still successfully reconstructs their captured scene. In terms of input constraints, Rig3R is trained with frame ID interpolation similar to Fast3R, enabling generalization to variable-length sequences at inference. While most evaluations use 24-frame inputs for consistency, we have successfully run inference on 50+ frame sequences using a single A100 GPU. The theoretical limit is 100 frames, as this was the upper bound for our frame ID interpolation. We will be sure to include these findings in the final version for clarification.
- GPU Memory Consumption at Inference
The model itself takes up 10GB of VRAM, allowing it to fit in many standard GPUs. For every input frame the model processes at 512x512 resolution, an additional 0.85GB of VRAM is required. So, a 24 frame input requires roughly 30.5GB of input, and an A100 GPU can support roughly 82 frames per forward pass. This memory footprint is comparable to other learned feedforward reconstruction methods. We will add this resource usage breakdown to the final paper to improve reproducibility and clarify hardware requirements.
- Comparison with VGGT
We experimented and found that for in-distribution data, such as Waymo, and VGGT perform similarly, and with the rig metadata largely outperforms VGGT on both pose estimation and pointmap metrics. On out-of-distribution data such as WayveScenes 101, we find that VGGT outperform the base , but with the rig metadata, still strongly outperforms VGGT. We find this to be strong support for the central claim, that embedding rig metadata provides a strong constraint for reconstruction and pose estimation and allows for better generalization.
In terms of efficiency, Rig3R is a single-pass feedforward model with comparable inference time and memory footprint to VGGT. Once again, we thank the reviewer for their thoughtful and balanced review. Your comments helped us clarify key aspects of our method and improve the presentation. We believe the planned revisions will significantly strengthen the final version.
Thank you for your effort in rebuttal. The responses have addressed most of my concerns and I keep my initial score of Accept
The paper extends the 3R-series to handle structure-from-motion given multi-camera rigs. The method can leverage the rig prior if available or infer it when not. The paper is a timely extension with promising accuracy.
优缺点分析
Strengths:
- The paper presents for the first time a feedforward solution that specifically handles images captured from multi-camera rigs. The results indicate promising accuracy compared to relevant methods.
- The results indicate promising generalization of the trained model. The method can handle different rig configurations (different numbers of cameras) with similar accuracy.
- The paper is well-organized with clear demonstrations. The novelty of the proposed method and the implementation details are clearly presented.
Weaknesses:
- It seems that the method can handle different numbers of frames, but the training and the testing are restricted to a 24-frame setup. The generalization to different frames (instead of different cameras per rig) is unclear.eeeeeeeeee
- As mentioned in L189, WayveScenes101 uses COLMAP reconstruction as ground truth. Why the COLMAP baselines perform badly (even worse than other feedforward baselines) in Tab. 1?
- Why is COLMAP allowed to process the full sequence (L196) instead of conducting the same setup as done by other feedforward baselines?
- The arguments in L213-216 does not coincide with Tab. 1. Rig3R_unstr does not outperform other feedforward baselines given the RRA metric (only better than MV-DUSt3R). Given the gap between Rig3R_Calib and Rig3R_unstr in WayveScenes101, it can hardly be regarded as 'robust without rig metadata'.
- Why can the inferior performance regarding Chamfer distance be taken as a key advantage of the proposed method? (L229)
Minor issues:
- The embedding dropout (L174) seems to be the key to nice generalization that allows the model to adapt to different rig configurations. However, little analysis is conducted regarding this solution. The reviewer is aware that the training is computationally expensive, but simply masking the metadata with 50% probability provides little information regarding the generalization capability.
问题
- Multi-camera rigs provide a unique constraint that the rig-relative raymap should remain constant. However, the loss function in L153 does not take this strong consistency into account and merely treats the training as a supervised SfM problem. Do the authors try the solution but fail, or is the consistency simply ignored?
- Why are there missing frustums in Fig. 4 (the left one given 5 cameras and the right one given 7 cameras)?
局限性
yes
格式问题
No
We thank the reviewer for their encouraging and thoughtful feedback. We appreciate the recognition of our contributions and the insightful questions, which help clarify key aspects of the work. Below we address each point in detail and will revise the final version accordingly.
- Support for Variable Numbers of Frames
The reviewer is correct, our model can support a variable number of frames. During training time 24 frames was a fixed parameter, and we used sample frame IDs from an ID pool similar to Fast3R to generalize to more frames. While the main experiments use a fixed 24-frame input for consistency and comparability across baselines, we have validated the model's generalization to different sequence lengths.
Specifically, we tested sequences from 1 frame (monocular depth) up to 82 frames, constrained only by GPU memory. The model retains stable performance across this range. The theoretical maximum is currently 100 frames, based on the size of the learned embedding pool.
That said, longer sequences can pose challenges, particularly for aligning distant views to the first frame, due to accumulated pose drift. This is a known issue in transformer-based sequence modeling, but not the focus of our current work. We will clarify this behavior and include supporting empirical results in the revision.
- COLMAP Baseline Setup and Performance
The COLMAP ground truth used in WayveScenes101 was generated using full, dense sequences, which benefits from abundant overlap and context. In contrast, our evaluations focus on more challenging settings involving strided sequences with large baseline shifts, where keypoint overlap is more sparse. COLMAP suffers in these cases due to limited tracking opportunities, though it benefits from being able to jointly optimize over many frames. To stay consistent with learned baselines and our training, we process in 24 frame chunks. Feedforward models are limited in that they treat each chunk as an independent sample. To avoid artificially handicapping COLMAP, we allow it to process and optimize the full strided sequence as a whole rather than independent chunks. We will clarify this setup more explicitly in the final version.
- Clarification on Rig3R_Unstr Performance
Thank you for catching this discrepancy. We agree that does not outperform all feedforward baselines and will revise our wording to reflect this. We will revise this section to reflect that performs comparably to other feedforward methods, but does not outperform them across all metrics. It exhibits a clear drop in performance relative to , highlighting the importance of rig priors when available. Nevertheless, it remains a viable zero-shot option when rig metadata is missing, and this flexibility is what we intended to show
- Chamfer Distance and Pose Accuracy
Our intention was not to frame lower Chamfer as a direct advantage. Rather, we observed that Fast3R achieves higher Chamfer despite poor pose accuracy, suggesting that using intermediate pointmaps to infer pose (e.g., via PnP) may not guarantee consistent geometry. The key advantage is directly optimizing for pose via regression on our raymap output representation without relying on pointmap performance, yielding more consistent multi-view reasoning.
- Metadata Dropout and Generalization
The reviewer is correct in noting that dropout during training is a key factor in enabling generalization both to settings with missing rig metadata and to varying rig configurations. While we do not tune this as a hyperparameter, we did experiment with different dropout levels to assess its impact. As expected, training with no dropout leads to poor performance when metadata is absent, while full dropout reduces the model to Rig3R_unstr, removing the benefit of rig constraints. We found that at scale, training with intermediate dropout naturally exposes the model to a wide spectrum of metadata availability, making the specific dropout rate relatively insensitive. We use 50% as a standard, untuned setting that balances exposure to both structured and unstructured regimes. We will clarify and give more context to this in the final version.
- Loss Function and Rig-Relative Consistency
We did not include an explicit rig-relative consistency loss. However, the rig-relative raymaps are naturally aligned for cameras on the same rig and encouraged to remain consistent via shared supervision targets and metadata when available. We agree this is an interesting direction and will note it as future work.
- Missing Frustums in Fig. 4
The missing frustums are a product of taking 24 frame outputs with variable number of rig cameras. To remain consistent with the main table’s setup, our qualitative outputs also use 24-frame samples. In configurations with more cameras (e.g., 7-camera rigs), this corresponds to fewer timesteps, with the final timestep including only a subset of cameras (e.g., 3 out of 7). This is an intentional choice for the figure and highlights Rig3R’s ability to handle partial rigs and missing views, illustrating our model’s flexibility to operate even when the rig is incomplete. Thank you for this note, we will be sure to clarify this in the figure caption!
Once again, we thank the reviewer for their thoughtful and balanced review. Your comments helped us clarify key aspects of our method and improve the presentation. We believe the planned revisions will significantly strengthen the final version.
This paper proposes Rig3R, a novel transformer-based model for multi-view 3D reconstruction and camera pose estimation that explicitly incorporates multi-camera rig metadata—camera ID, timestamps, and rig-relative pose information—into the learning process. When metadata is missing, Rig3R is also able to infer rig structure from images alone. Key components include: (1) Rig-Aware Conditioning: Using optional rig metadata as embeddings to guide a ViT-based multiview reasoning architecture. (2) Raymap Representation: Outputting per-pixel ray directions and camera centers that enable closed-form camera pose recovery. (3) Rig Discovery: When metadata is unavailable, Rig3R can recover rig configuration directly from images. (4) Single Forward Pass: Unlike optimization-heavy pipelines, Rig3R produces results in one pass with no post-processing. The model achieves state-of-the-art performance on benchmarks including Waymo, WayveScenes101, and other real-world driving datasets across tasks like pose estimation, pointmap reconstruction, and rig discovery.
优缺点分析
Strengths (1) Strong Practical Relevance: Multi-camera rigs are common in autonomous driving and robotics, and Rig3R’s rig-aware design directly addresses this. (2) Generalization Capability: Rig3R works well with full, partial, or missing metadata, showcasing flexibility. (3) First to Perform Rig Discovery: Rig3R is the first method to discover rig calibration from unstructured images—this is both novel and practically useful. (4) Unified Representation: The use of raymaps for both global and rig-relative coordinates is a compelling design, enabling interpretable and consistent camera pose reasoning. (5) Robust Empirical Results: The model consistently outperforms classical and learned baselines across pose and 3D structure metrics, in both seen and unseen rig settings.
Weaknesses (1) High Computational Cost: Training uses 128 H100 GPUs over 5 days, making it inaccessible to many researchers. (2) Limited Qualitative Analysis: While the paper is rich in quantitative benchmarks, more detailed qualitative visualizations (e.g., failure cases, rig discovery dynamics) would strengthen understanding. (3) Ablation Experiments Are Downscaled: Some ablation studies are conducted at reduced batch and data scale, which may limit the strength of conclusions drawn. (4) Lack of Open-Sourcing: At the time of submission, code and data are not publicly available, limiting reproducibility.
问题
(1) Training Cost and Accessibility The full-scale Rig3R model was trained on 128× H100 GPUs over five days, which is extremely resource-intensive. Can the authors provide a more detailed discussion on training cost vs. performance trade-off, or offer a smaller model variant that retains most of the performance? (2) Ablation Study on Rig Metadata Modalities The paper proposes multiple rig-aware signals (camera ID, timestamps, rig-relative poses). However, it's unclear which modalities contribute most to performance and whether any are redundant. Can the authors provide an ablation isolating the contribution of each rig metadata signal to clarify their individual importance? (3) Rig Discovery Evaluation and Interpretability The proposed "rig discovery" is a novel and exciting contribution, but the evaluation is largely quantitative. Can the authors provide more qualitative visualizations of the discovered rig configurations (e.g., 3D layout, consistency across scenes)? Additionally, are there failure modes or cases where the rig discovery fails or produces degenerate configurations? (4) Generalization Beyond Driving Data The method is tested primarily on autonomous driving datasets. While this is the core use case, the approach may be applicable to other domains (e.g., indoor robot rigs, AR/VR capture systems). Have the authors tested Rig3R or a variant on other rigged camera setups outside the driving context?
局限性
Please see the above questions and Strengths And Weaknesses.
格式问题
None
We thank the reviewer for their encouraging and thoughtful feedback. We appreciate the recognition of our contributions and the insightful questions, which help clarify key aspects of the work. Below we address each point in detail and will revise the final version accordingly.
- Training Cost and Accessibility
This is an important concern, and we appreciate the opportunity to clarify. We trained the full-scale Rig3R model using 128 H100 GPUs to maximize training diversity and batch size, which we found to be beneficial for generalization to unstructured and unseen rig configurations (e.g., WayveScenes101). However, the model architecture itself is lightweight, requiring approximately 10 GB of memory at inference, and can run efficiently on standard single-GPU setups. To address accessibility, we also conducted ablations at reduced scale (32 H100 GPUs over 2 days), which is already smaller than comparable works such as Fast3R. These smaller models still demonstrate strong performance, particularly when rig metadata is provided, as shown in Tables 3 and 4.
- Ablation of Rig Metadata Modalities
We thank the reviewer for highlighting this, we agree that understanding the contribution of each modality is critical. Table 3 reports performance when only a single metadata input (camera ID, timestamp, or rig pose) is provided on our final model, showing isolated effects of the metadata on the model performance. We observe that rig-relative pose information provides the most direct geometric constraint, and is especially helpful for inferring consistent camera poses. However, camera IDs and temporal indices still yield performance gains, particularly when used together, as they capture useful ordering and grouping priors. We will make this analysis clearer in the final version and highlight any observed redundancy or complementarity between signals. Table 4 compares structured, unstructured, and downscaled training settings, providing insight into how the full metadata combination performs. Due to the computationally expensive training of the full-scale model, these ablations were performed on fewer GPU’s with a smaller batch size.
- Qualitative Rig Discovery Evaluation and Failure Cases
We agree that qualitative insights would greatly aid interpretability. We will aim to include more qualitative results on rig discovery in the final version of the paper. In the supplementary material, we currently include visualizations of 3D reconstructions under different rig metadata conditions and color-coded raymaps and camera centers, illustrating discovered rig configurations. We will expand this further to include 3D layouts of inferred rig geometry, showing consistency across scenes and highlighting discovery dynamics. As for failure modes, the most common include unusual or novel rig layouts that diverge significantly from training distributions where the rig calibration is not present, and also highly dynamic scenes. We will document representative failure cases in the revision and supplement.
- Generalization Beyond Autonomous Driving We agree that exploring non-driving scenarios involving multi-camera rigs, such as indoor robot rigs, AR/VR capture systems, would be a valuable extension of this work and an exciting direction for future research. Our current focus is on the challenges specific to real-world rig distributions, where cameras often have minimal field-of-view overlap, making it especially important to leverage rig constraints. To highlight these effects, we evaluate on publicly available rigged datasets, which are predominantly from the autonomous driving domain due to their scale and accessibility. We will look to explore additional domains and non-driving datasets in future work.
Once again, we thank the reviewer for their thoughtful and balanced review. Your comments helped us clarify key aspects of our method and improve the presentation. We believe the planned revisions will significantly strengthen the final version.
Thank you for clarifying the scalability and accessibility of your approach. While the full-scale training setup is indeed resource-intensive, I appreciate the ablation results showing that reduced-scale models still perform strongly. Highlighting the feasibility of single-GPU inference and comparisons to Fast3R help address concerns about accessibility and reproducibility.
The ablation study in Table 3 is informative. I agree that rig-relative pose provides strong geometric grounding, and the synergy between temporal indices and camera IDs is interesting. Clarifying how these signals complement or overlap will strengthen the final version.
I appreciate the commitment to expand qualitative results and document failure cases. These additions will significantly improve interpretability and provide valuable context for real-world deployment. Highlighting rig consistency across scenes and visualizing failure modes will make the approach more transparent.
I agree that broader applicability to other rigged systems (e.g., AR/VR, indoor robotics) would be a compelling future direction. It's understandable that autonomous driving datasets currently dominate due to availability, and I encourage the authors to explore and demonstrate transferability in future iterations.
Overall, the rebuttal addresses the main concerns I had. I appreciate the authors’ effort to clarify both the scalability and technical contributions of the work. I look forward to seeing the final version with the planned revisions and expanded analyses.
Thanks the authors's rebuttal. It has resolved all of my concerns. I'll keep my rating as borderline accept.
This paper proposes a feed-forward multi-view 3D reconstruction method (in the vein of Dust3r) that additionally accounts for multi-camera rigs. It does so by:
- Incorporating potentially known rig information (camera IDs, temporal IDs, rig calibration) as inputs
- Inferring both global point maps as well as pose and rig information as outputs
- Using a raymap representation in addition to the point maps that are standard and showing that raymaps are more robust for camera calibration
The paper evaluates the method on the Waymo and WayveScenes101, autonomous driving datasets captured with camera rigs. On these datasets, Rig3R outperforms previous methods (both learning-based and COLMAP) when rig calibration is available and is able to perform well and discover rig calibration when it is not given.
优缺点分析
Strengths:
-
Tackles an important use case of reconstruction from camera rigs (common in autonomous driving and other applications) by applying domain information in a novel way. In particular, it extends the emerging trend of feed-forward reconstruction models to this scenario and demonstrates improved reconstruction quality.
-
Makes a number of interesting (though small) technical contributions:
- the use of raymaps for both specifying rig inputs and inferring pose and rig information,
- the use of camera metadata as input (via dropout) thus handling both the unstructured and structured capture use cases,
- better camera calibration from raymaps instead of using pointmaps
-
State-of-the-art performance on multi-camera reconstruction (on the Waymo and WayveScenes101) datasets with fast inference times.
-
Well-written paper with extensive comparisons with previous methods and analyses of the method. The comparisons with the unstructured and calibrated variants of the method in particular are quite interesting.
Weaknesses:
- No major weaknesses. The paper is overall a good fit for NeurIPS. Some additional questions/clarifications are below.
问题
-
I have some confusion about the outputs of the model. My understanding is these are dense global pointmaps (in a reference camera coordinate systems?) as well as dense per frame pose raymaps and a single per frame global rig raymap. If this is the case, why are the pointmaps predicted with a DPT head and the dense pose raymaps with an MLP (and not a DPT head)?
-
How would you say Rig3r_{unstr} is functionally different from methods like Fast3r (beyond of course the outputs including raymaps)? Is it that training with rig constraints changes what the model learns even when this information is not available? I'm trying to understand why this would lead to such different performance. E.g., on Waymo Rig3r_{unstr} is significantly better than Fast3r. Surprisingly on WayveScenes101 it is worse.
-
Relatedly, why is rotation performance so much worse but translation so much better than Fast3r? Is this a function of training datasets (the autonomous driving datasets used here prioritize translation over rotation?)
-
It would be good to evaluate on non-autonomous driving datasets
-
My understanding is that inputs correspond to 48s time extents and 120 frames. Is this correct?
-
The ablations try only camera and time IDs. It would have been useful to try with both together. This would be a Rig3r_{uncalib} variant where the fact that a camera rig was used to capture is known without calibrating the rig.
局限性
Yes
最终评判理由
The rebuttal has addressed all my questions and I will retain my accept rating for this nice work.
格式问题
None
We thank the reviewer for their encouraging and thoughtful feedback. We appreciate the recognition of our contributions and the insightful questions, which help clarify key aspects of the work. Below we address each point in detail and will revise the final version accordingly.
- Clarification of Model Outputs and Prediction Heads
The model outputs include: (1) dense pointmaps in the coordinate frame of the reference image, (2) dense pose raymaps in the same frame, and (3) rig-centric raymaps in the coordinate frame of the reference camera.
The choice of prediction head is based on the type of information rather than output density. Pointmaps require fine-grained, pixel-level, spatially localized features and benefit from the multi-scale structure of the DPT head. Raymaps, on the other hand, encode global pose information per frame and do not require such detail. We therefore use lightweight MLP heads to avoid unnecessary complexity. We will clarify this design choice in the paper.
- Differences Between and Fast3R
operates without rig metadata at inference and is functionally similar to Fast3R in that sense. However, our training includes metadata dropout and rig-aware output heads, which encourages the model to learn more consistent internal representations. This is particularly true for rig configurations seen in training, leading to higher performance of on Waymo. This implicit structure could help improve spatial coherence, potentially explaining the stronger mAA and RTA scores observed for WayveScenes101.
- Rotation vs Translation Metrics
The reviewer is correct in noting that Rig3R performs better on translation. This is indeed a function on our training mix, which largely includes time-strided driving data. This distribution encourages the model to develop a strong understanding of vehicle dynamics and temporal structure, particularly when coupled with rig poses and timestamps during training. As a result, the model may implicitly focus on spatial consistency in translation. We would like to study this behavior further in future directions.
- Evaluation Beyond Driving Datasets
We agree with the reviewer that it would be valuable to explore non-driving scenarios with multi-camera rigs, such as multi-view drones or robotics platforms, and we consider this a valuable direction in future work. Our paper and model focus on the challenges posed by the rig distribution, where cameras often have minimal field-of-view overlap, and how this makes leveraging known rig constraints particularly beneficial. To demonstrate these effects, we evaluate on public, real-world, rigged datasets, most of which are commonly driving datasets.
- Temporal Coverage and Input Setup
Our main evaluations use 24-frame input sequences, corresponding to 5 distinct timesteps (with a 2-second stride) across 5 cameras, covering approximately 10 seconds. Section 4.3 explores different camera counts and the associated trade-offs in the number of timesteps. While we fix the input to 24 frames for consistency in evaluation, the model supports and has been tested with more or fewer frames. However, this flexibility is not a primary focus of the paper.
- Using Both Camera and Time IDs (Uncalibrated Rig Setting):
We agree this is a meaningful intermediate case, where the presence of a rig and timestamps is known but calibration is unavailable. We expect its performance to fall between the fully unstructured and calibrated variants, and plan to include experiments with this configuration in the final version.
Once again, we thank the reviewer for their thoughtful and balanced review. Your comments helped us clarify key aspects of our method and improve the presentation. We believe the planned revisions will significantly strengthen the final version.
Thank you for clarifying my doubts. I am happy to retain my accept rating.
Following the recent trends of developing feed-forward models for 3D reconstruction, this paper proposes a feed-forward multi-view 3D reconstruction method that additionally accounts for multi-camera rigs. It Incorporates potentially known rig information (camera IDs, temporal IDs, rig calibration) as inputs, infers both global point maps as well as pose and rig information as outputs. and uses a raymap representation in addition to the point map, which offers robust camera calibration. The paper evaluates the method on the Waymo and WayveScenes101, autonomous driving datasets captured with camera rigs. The results are SOTA.
All reviewers were positive before rebuttal and remain positive after rebuttal. Minor questions on paper presentations and experiments were properly addressed. Therefore, the AC agrees with accepting the paper.