Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping
3d object detection (with or without known camera pose) can be considered mapping and localization with objects as 3D oriented boxes.
摘要
评审与讨论
The main idea is an object-based SfM, using object cuboids and their descriptors in an analogous way to points and their descriptors in point-based SfM. Cuboids are detected and feature descriptions extracted using CuTr; pairwise image matching is performed using an extension of SuperGlue; pairwise pose is solved; and global pose is solved using rotation/translation averaging. The matching and pose estimation steps require some innovation to deal with the cuboid candidate matches, e.g. a second SuperGlue Model is trained to produce corner points that better correspond.
Experiments on ScanNet++ and CA-1M demonstrate good results for multiview 3D object detection (compared to methods that rely on some form of point/dense fusion) and pose estimation (compared to Droid-SLAM and Cutt3r).
优缺点分析
The paper is creative and has plenty of little innovations, as well as a good proof-of-concept evaluation. The main weakness is that it does a poor job explaining the relationship to object-based SLAM and SfM (particularly GlobalSfM). Also, it would have been nice to compare the pose estimation to the Global SfM methods that are more analogous, such as GLOMap or 1d-sfm, and to incremental methods such as COLMAP. Also, the speed and scalability of the method isn't analyzed. It feels half-baked.
Strengths
- Revisiting SfM using class-agnostic object cuboids is an interesting idea
- The paper offers many little innovations needed to make it work
- Results in both 3D object detection and pose estimation are good compared to recent methods
- The paper is mostly clear about the approach and results
Weaknesses
- The related work is much too cursory, basically saying that there is a bunch of stuff on object-centric SLAM and 3D object detection, without explaining or citing most of that work. The paper and figures generally feel unpolished, so it seems like this is due to running out of time, rather than well-considered space prioritization
- Does not compare to SfM approaches (either incremental or global), while the proposed method is closer in approach to them than it is to SLAM
- Does not provide any analysis of timings
Recommendation: Weak accept The paper is interesting, provides good proof-of-concept experiments, and could inspire future work. However, the paper feels thrown together with its barely-there related work section and conclusion, small text in figures, and questionable choices of figure and text space allocation. It feels half-baked and, while worth reading in its current form, the authors should consider a revise-and-resubmit to maximize impact.
问题
Comments
- For DroidSLAM would have been nice to compare with monocular depth
- "Rooms from Motion" isn't the best title because it's not really about estimating rooms. "Object-based SfM" or "Objects from Motion" may fit better
- After reading the paper, I'm still not sure of the practical use, e.g. does this have potential to be part of a well-engineered object mapping system, would it be used in conjunction with points or as a post-process to SfM, etc.
- It's not clear how the method is adjusted when you have ground truth depth or pose
局限性
OK
最终评判理由
I disagree with reviewer n63D's statement "interesting application of structure-from-motion for objects, but there's little additional insights".
The paper does have significant deficiencies in position with respect to related work and clearly stating the goals and contributions.
That said, I stick with my original assessment that while the paper could do a better job of positioning itself, it's an interesting paper that is above bar in its current state.
格式问题
n/a
Thank you. We will elaborate on the set of related work in a future revision. We agree "Objects from Motion" is an appropriate name and will work to clarify this.
Regarding the cited weaknesses:
Does not compare to SfM: We emphasize that the main goal of this paper is un-posed 3D object detection. We provide localization comparisons in order to calibrate its ability. Additionally, our method is not easily comparable to SfM, even if it is inspired by it. In particular, we do matching in a metric 3D space and expect to output metric camera poses and 3D objects/boxes. Traditional SfM techniques do not generally output metric quantities and must rely on some mechanism of external scaling.
Does not provide any analysis of timings: We provide some basic timings for a video (100 frames) running on a V100:
- Detection of objects per frame: ~65ms per frame (640x480 image)
- Average time to "add" a frame to the view graph (perform matching, relative pose estimation to all other frames): 700ms
- ~10 seconds to "finalize" the graph (averaging/filtering)
The optional bundle adjustment runs on CPU using PyCeres and takes approximately 3-5 minutes (depending on the number of objects).
As is the case for global SfM methods, our primary goal is to run offline and a streaming version (closer to SLAM) is an interesting area of future research. We will add these timings to an updated revision.
Regarding the provided comments:
nice to compare with monocular depth: We do not compare with DROID-SLAM for monocular depth since in that scenario, it will only produce camera poses up to scale which is incompatible with our setting of metric camera localization.
not sure of the practical use: We believe our framework can immediately replace traditional 3D object detection systems which rely on point clouds and other machinery like 3D sparse convolution when depth is present. Furthermore, we believe that our 3D object detection and/or localization can be applied on systems which lack the ability to capture images at a rate necessary for traditional SLAM or lack depth sensors. Generally, our framework will make 3D object detection more readily applied in a wider set of scenarios than previously thought capable.
how the method is adjusted: We will clarify this in an updated revision. To incorporate depth, we adopt the RGB-D variant of CuTR which takes both an RGB image as input alongside a metric depth map. To incorporate pose, we use the ground-truth relative pose rather than the estimated relative pose. Additionally, we use the ground-truth absolute poses which is used to lift individual 3D object tracks into world space.
Thanks for the response.
The paper is currently positioned as "localization and mapping within indoor spaces (i.e., rooms) using 3D objects as the fundamental primitive without the need for point-based representations." This makes it sound like camera pose estimation is the main objective and object detection is a nice side effect. The rebuttal is saying that pose estimation is not the central aim.
The paper needs revision to be more clear about its goals/contributions and positioning in related work.
I still think that for camera pose estimation, it would be good to include other SfM/SLAM approaches for comparison, even you need to scale them and put an asterisk that they need to be scaled.
Despite these issues, I still think the paper is worth accepting for its originality. So basically, my position is not changed from the original review, and I still recommend accept with reservations.
Hi,
Thank you. We appreciate the feedback. The primary intent of our work is 3D object detection from un-posed images (e.g., "Rooms from Motion: Un-posed 3D object detection..."), which has not generally been explored before, so we do hope it is not seen as a side effect. However, our approach does treat it as a localization and mapping problem where the mapping results in 3D object detection. Building consistent object tracks and producing the optimization (partial BA) on 3D boxes benefits 3D object detection and not the localization. Nonetheless, we see localization as a (good) side effect. After all, in our approach, one cannot expect good 3D object detection results without good localization. But theoretical approaches may exist which can produce 3D object detection results from un-posed images without explicit localization.
As suggested, we will work to add more comparisons to point-based localization methods to ensure the usefulness as a localization pipeline can be better realized. For instance, we add colmap results below. These results should be compared to the RGB-D case since we use Umeyama alignment with scaling against the GT trajectory (as computed by evo) to overcome the lack of metric scale. We hope these results emphasize that the localization abilities of RfM are compelling and results on the given datasets are non-trivial. All results are computed on 100 frames per video (the same setting as RfM).
CA-1M:
| Method | ARE (deg) | ATE (cm) | Registered |
|---|---|---|---|
| colmap (scaled) | 18.5 (50.1) | 38.5 (68.0) | 100% |
ScanNet++:
| Method | ARE (deg) | ATE (cm) | Registered |
|---|---|---|---|
| colmap (scaled) | 9.2 (33.3) | 28.8 (61.0) | 100% |
The paper introduces Rooms-from-Motion (RfM), a method that treats the 3D-bounding boxes of objects in a scene as first-class citizens and generates both camera poses and a map of 3D bounding-box positions of a scene using unordered images.
RfM uses a multistep pipeline. In a first step, it uses the prior work Cubify Transformer (CuTR) to detect 3D bounding boxes of objects in each image individually. RfM then matches both objects and object corners across image pairs, thereby generating point matches between images. Using some filtering heuristics to suppress wrong matches, these derived point matches then are used to estimate camera poses and global object poses. CuTR can process metric depth maps for improved performance, and metric depth as well as ground-truth camera poses can also be used in the last step of the pipeline to increase accuracy.
RfM shows improved 3D object detection performance on both the CA-1M and the ScanNet++datasets. It also outperforms DROID-SLAM and the recent CUT3R on the task of camera pose estimation.
优缺点分析
- Strengths
- new, and intriguing idea to use only object-bounding-boxes as scene representation (map)
- I can imagine that by focussing on objects as "first-class citizens", there might be a bit more robustness in the matching, though at the expense of fewer / less accurate individual point matches
- The main advantage of this approach I think is in using the object-centric scene representation / detected objects for downstream applications like scene understanding / robotic planning. From this perspective however, not modelling the background geometry (floor, walls, ceiling) seems like a limitation (though one that it should be possible to overcome)
- straightforward pipeline, well-explained
- strong results in a number of settings and different tasks
- new, and intriguing idea to use only object-bounding-boxes as scene representation (map)
- Weaknesses
- Limited Baselines for Camera Pose Estimation.
- would like to see something like COLMAP with strong pixel matchers, and perhaps VGG-SfM and/or VGG-T as baselines
- The bounding-box-focussed scene representations might be less appropriate in outdoor scenes / object-sparse scenes (beaches, highways) / non-human environments (e.g. forests)
- Limited Baselines for Camera Pose Estimation.
问题
-
Could you add some additional strong baselines for the Camera Pose Estimation task? (see Weaknesses)
-
I would like to see a diagram of the box matching network and the box corner matching network, even if only in the appendix.
局限性
yes
最终评判理由
I agree with other reviewers that the camera localisation experiments experiments should compare against some stronger baselines, either ones that require ground-truth depth (which the authors use in some of their experiments already and can be cheaply obtained with phone-based LiDAR sensors) or using point-based methods that do output metric predictions, such as MASt3R.
I suspect that such baselines will, perhaps significantly, outperform the present method in camera pose estimation, but my rating does not depend on the method's performance on a single application. I believe the presented approach is novel and promising and has the potential to spark future fruitful work in exploring the best domains and applications for it. I therefore maintain my rating.
格式问题
No
Thank you.
Regarding the cited weaknesess:
Limited Baselines for Camera Pose Estimation: We would like to add additional RGB only baselines, as suggested, however this is actually quite difficult given that a fair comparison requires a method capable of outputting metric camera poses from only RGB images. Methods like colmap, VGG-T, Mast3r-SLAM do not output metric information (only up to scale) and require explicit scaling by some means to achieve this. We found CUT3R to be the only appropriate baseline here. As a result, we do emphasize in the paper that these results are meant to calibrate and inspect our method and less so prove that it is a superior localization method. The main purpose of the paper is to perform un-posed 3D object detection, which has not generally been explored before. We emphasize that 3D object detection is inherently a metric task which helps inform our approach.
diagram of: We will add additional diagrams of the box matching and corner matching networks in a future revision (notably, due to the rebuttal format, we cannot upload them here).
I thank the authors for their response.
I would like to note that point-based methods whose outputs are only up to scale can have their scale fixed by using either ground-truth depth, which the paper's experiments in Tables 1 already use, so it is reasonable to also evaluate camera pose estimation in that setting. Furthermore, there are point-based reconstruction and camera pose estimation methods that do produce metric estimates, such as MASt3R (https://arxiv.org/pdf/2406.09756).
I continue to believe this approach of object-detection-based SLAM is an intriguing research direction, even if the downstream applications or domains that most benefit from this approach remain to be conclusively determined in future work.
Thank you. We agree that adding scaled results may be useful to the reader. We have attached initial colmap results using scaling below. We are committed to including results from Mast3r and Mast3r-SLAM. Mast3r results are currently being computed, however, these will take some time: over 15 minutes per video (50 frames) and 60 minutes (100 frames) using SparseGA. We do emphasize that results from CUT3R (initialized from a pre-trained dust3r) are already included in the paper.
CA-1M:
| Method | ARE (deg) | ATE (cm) | Registered |
|---|---|---|---|
| colmap (scaled) | 18.5 (50.1) | 38.5 (68.0) | 100% |
ScanNet++:
| Method | ARE (deg) | ATE (cm) | Registered |
|---|---|---|---|
| colmap (scaled) | 9.2 (33.3) | 28.8 (61.0) | 100% |
This paper proposes Rooms from Motion (RfM), an object-centric framework for 3D object mapping and camera localization from unposed RGB(-D) images, aiming to solve the problem of indoor 3D scene understanding without relying on known metric camera poses.
RfM replaces traditional keypoint-based structure-from-motion (SfM) with an object-level pipeline: it detects metrically scaled 3D bounding boxes using a pretrained detector (CuTR), matches objects and their 3D box corners across images, estimates relative and global camera poses, and constructs global 3D object tracks, which are further refined through bundle adjustment.
Experiments on CA-1M and ScanNet++ demonstrate that RfM achieves competitive or superior performance in 3D object detection compared to point- and volume-based baselines, even without access to ground-truth poses or depth. In the localization task, RfM shows strong results, outperforming state-of-the-art methods such as DROID-SLAM and CUT3R in both RGB-D and RGB-only settings.
Overall, the paper presents a pose-free, object-centric approach to 3D detection and localization that offers a novel alternative to conventional volume-based 3D detectors and SfM methods.
优缺点分析
Strengths
-
The paper introduces a shift from traditional point-based SfM to an object-centric formulation. By treating 3D bounding boxes as the fundamental geometric primitive, the approach provides a sparse yet semantically meaningful representation for both mapping and localization.
-
The method is evaluated thoroughly on two large-scale datasets (CA-1M and ScanNet++), across multiple settings (posed/unposed, RGB/RGB-D). Results show consistent improvements over both point-based and voxel-based baselines, in both 3D object detection and camera localization.
Weaknesses
-
The paper does not clearly specify the training procedure for key components, particularly the object- and corner-matching networks. It is strongly implied that they are trained using ground-truth object correspondences in CA-1M and possibly ScanNet++, but this is not stated explicitly, nor are hyperparameters, or architectures described.
-
In the camera pose estimation experiments, RfM appears to be evaluated on the same datasets it is (likely) trained on, whereas competing methods like DROID-SLAM and CUT3R are used off-the-shelf without domain adaptation. This setup favors RfM and may overstate its performance advantage. A fairer comparison would involve either evaluating all methods in cross-domain settings or including additional baselines with access to similar supervision.
-
The paper does not report any timing statistics for detection, matching, pose estimation, or optimization. This makes it difficult to evaluate the method’s scalability or real-time potential, especially considering the potentially expensive steps like pairwise matching, graph-based pose averaging, and object-level bundle adjustment.
-
All experiments are conducted on indoor RGB(-D) datasets (CA-1M and ScanNet++), both of which are highly curated and structurally similar. There is no discussion or empirical evidence regarding generalization to more diverse environments (e.g., outdoor scenes, cluttered motion blur, or different camera types). This raises questions about the robustness of the approach outside the training distribution.
-
The framework relies on successful object detection and matching to drive both mapping and localization. It is unclear how the method performs in sparsely populated environments or in scenes with limited object overlap between views. In such cases, the view graph could become disconnected, but no quantitative analysis of failure modes is provided.
问题
-
Training details: Could the authors clarify the training procedure for the object- and corner-matching networks? Specifically, what datasets were used, and what architectures and hyperparameters were used?
-
Fairness of localization comparisons: Were the RfM matchers trained on the same datasets (CA-1M or ScanNet++) used for evaluation in the localization experiments? If so, how do the authors justify the fairness of comparing to methods like DROID-SLAM and CUT3R, which were not trained or adapted to these domains? Have the authors considered evaluating RfM in cross-dataset settings to assess generalization?
-
Runtime and scalability: Can the authors provide timing statistics for the major components of the pipeline (e.g., CuTR inference, object matching, pose averaging, track-level optimization)? This would help assess the method’s practicality in large-scale or time-sensitive applications.
-
Generalization to other domains: Have the authors considered applying RfM to domains beyond indoor scenes, such as outdoor environments or videos with different camera characteristics (e.g., lower resolution, motion blur)? If not, can the authors comment on the expected limitations or required adaptations?
-
Robustness in sparse or low-overlap scenes: How does RfM perform in cases where object density is low or where object overlap between frames is minimal? Does the view graph often become fragmented in such settings, and are there any mechanisms to recover from such cases? Quantitative or qualitative analysis of these failure modes would be helpful.
局限性
yes
最终评判理由
Most concerns have been well addressed. However, I remain unconvinced about the fairness of the comparison to DROID-SLAM, as it was not finetuned on the ScanNet++ or ARKitScenes datasets. Finetuning DROID-SLAM on these datasets and then conducting the comparison would provide a more convincing evaluation for readers.
While I suggest the authors include the finetuned DROID-SLAM results to strengthen the narrative, the experiments presented in the main paper are sufficient to demonstrate the effectiveness of the proposed method. Thus, I will maintain my rating as accept, especially since the direction on object-level localization and mapping offers an interesting perspective.
格式问题
no
Thank you.
Regarding the cited questions:
Training details: We will clarify training procedures and hyperparameters in an update to part B of our appendix. Some aspects of note:
- We train models on the same datasets (CA-1M or ScanNet++) as used for evaluation and use the train and validation splits respectively.
- We generally adopt the same model/training hyperparameters as the original Cubify Transformer paper.
- The additional Cubify Match components are trained end-to-end using the double softmax loss from LightGlue.
Fairness of localization comparisons: We note that CUT3R is trained on ScanNet++ and ARKitScenes (which shares the same underlying data as CA-1M). Therefore, we believe the comparison to CUT3R is fair --- they will have seen the same sets of images as our method and, in fact, many more since they train on much more than just these datasets. For DROID-SLAM, we note that its training is emphasized to be quite generalizable. Additionally, for DROID-SLAM, we believe the failures on ScanNet++ are less do to training datasets and more fundamental to SLAM systems which cannot easily handle non-smooth/sparser trajectories that are present in ScanNet++.
Runtime and scalability: We provide some basic timings for a video (100 frames) running on a V100:
- Detection of objects per frame: ~65ms per frame (640x480 image)
- Average time to "add" a frame to the view graph (perform matching, relative pose estimation to all other frames): 700ms
- ~10 seconds to "finalize" the graph (averaging/filtering)
The optional bundle adjustment runs on CPU using PyCeres and takes approximately 3-5 minutes (depending on the number of objects).
Further optimization could be applied to steps (1) and (2) to chunk/batch these operations which are highly parallel.
Generalization...: Generalization to other domains (e.g., outdoors) is exciting further research and something we are actively investigating, however, we note that our method is only applied to Indoor 3D Object Detection. We do find our method has slightly degraded performance at lower resolutions (as the CuTR method slightly degrades) but can work successfully down to resolutions like 256 x 192 and is especially robust if depth is also provided at lower resolutions. Regarding motion blur, most videos in CA-1M have a large amount of motion blur due to the scanning pattern speed and so we do find our method is capable of working with motion blur since it is trained on motion blurred data to start with.
Robustness in sparse...: Regarding sparse scenes, we believe the impact may been seen in the ablation (Table 4). Particularly, when we have less objects due to a smaller taxonomy (e.g., ARKitScenes), this will impact the ability for successful localization. This shows the importance of a detector like CuTR which is trained to detect nearly every object, albeit in a class-agnostic manner. Additonally, regarding overlap, FoV does matter as seen in Table 5 where a narrow FoV camera does lead to worsened results. We believe further research can help reduce this requirement by incorporating more priors/context (like dust3r/mast3r) into this type of framework.
I appreciate the authors for addressing my questions regarding training details, fairness of the localization comparison, runtime, generalization, and robustness to sparse scenes.
Most concerns have been well addressed. However, I remain unconvinced about the fairness of the comparison to DROID-SLAM, as it was not finetuned on the ScanNet++ or ARKitScenes datasets. Finetuning DROID-SLAM on these datasets and then conducting the comparison would provide a more convincing evaluation for readers.
While I suggest the authors include the finetuned DROID-SLAM results to strengthen the narrative, the experiments presented in the main paper are sufficient to demonstrate the effectiveness of the proposed method. Thus, I will maintain my rating as accept, especially since the direction on object-level localization and mapping offers an interesting perspective.
In this paper, authors proposed rooms from motion -- a mapping a localization framework that is grounded around objects instead of traditional points. Authors uses off-the-shelf 3D bounding box detector, match it across the images, and then run "object corner point" based matching to get relative pose between two images. After that utilizing all structure-from-motion tricks, optimized poses and map can be retrieved. Unlike traditional point based map, this map is grounded in objects and is less heavy. Authors also show that their method can work in pose + RGB-D, RGB-D, pose + RGB, and RGB only settings. Through experiments on CA-1M and ScanNet++ datasets authors show that their method beats other baselines in respective categories.
优缺点分析
-
Authors use an off-the-shelf 3D bounding box detector (Cubify Transformer - CuTR) and base their pipeline on top of this. While this is interesting, there's limited originality. Especially, authors have just replicated structure-from-motion pipeline for objects instead of points. Even if we consider this as innovation, the relative pose estimation still rely on point based method. Authors uses corners of detected 3D bounding boxes and project them to 2d, and then matches among these points. So, relative pose still comes from points and not objects as claimed in the paper.
-
Another aspect this paper is missing is that they are using off-the-shelf 3d object detector. This is fine, but we can also just generate dense scenes usign off-the-shelf 3D reconstruction methods such as VGGT and then run point cloud based 3D object detection. The presented method might perform better, but it needs to be compared with others.
-
I also have concerns about evaluation. The paper needs to compare with SOTA baselines in respective areas and not just ANY baseline. For example, comparison with point-based localization and mapping is completely missing. I know that DROID-SLAM is point based, but its SLAM, what authors are presenting is SfM. Also, in RGB only methods, it uses CUT3R as a baseline, while much better baselines do exist (e.g. MASt3R).
-
Regarding evaluation, I think authors should follow standards in respective eval. For example, for 3D objects mapping -- its common to report mAP at various threshold (not just AP and AR). Similarly, for pose estimation its common to report percentage of image localized within different error thresholds. I would highly encourage authors to read relevant baselines and follow their standards.
-
Overall, I believe this is an interesting application of structure-from-motion for objects, but there's little additional insights provided by authors that adds value to NeurIPS readers.
问题
- Table-1, FCAF is cited wrongly
- Can authors report mAP, similar to other baseline methods? (e.g. UniDet3D report mAP)
- Why do we only compare with CUT3R in RGB pose estimation baseline (Table 3)? This is the place where authors need to prove their localization can achieve better results than point based methods (COLMAP[1] , MASt3R-SLAM, VGGT)
[1] Johannes L. Schönberger. “Robust Methods for Accurate and Efficient 3D Modeling from Unstructured Imagery.” ETH Zürich, 2018. [2] Murai R, Dexheimer E, Davison AJ. MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors. InProceedings of the Computer Vision and Pattern Recognition Conference 2025 (pp. 16695-16705). [3] Wang J, Chen M, Karaev N, Vedaldi A, Rupprecht C, Novotny D. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference 2025 (pp. 5294-5306).
局限性
Authors have discussed limitations of their work in sufficient detail.
最终评判理由
I read authors rebuttal and other reviewers response. Based on all the discussion I agreed that paper does a bit more than just using objects instead of points. Authors have been responsive answering my questions. Based all all evidence I increased my rating from reject to weak accept.
格式问题
No conerns
limited originality: We believe that our work has significant originality and is more than "replicated structure-from-motion pipeline for objects instead of points". We believe this underestimates the difficulty of making such a framework work in a general and large-scale manner. As far as we know, no previous work has considered a learned object centric matching as a basis for estimating relative pose. Similarly, no work proves that consistent 3D object tracks can be established without a priori pose, as we do in this work. Furthermore, we introduce early insights into the idea of global optimization (bundle adjustment) on 3D boxes, which has not been explored in previous work, as far as we know. Instead, previous work generally uses 2D keypoints to detect features of interest with little semantic value. We show that 3D objects (as detected from images) are themselves a sufficient primitive that can be matched across images. Notably, this matching uses metric 3D points derived from the set of detected objects (using the corners in a metric 3D space). Furthermore, this matching is enforced to respect the object-level matching and has high semantic value. We emphasize that this matching is not done using 2D pixels/points, instead, we only show the projections in Figure 2 as a means of visualization.
we can also just generate dense scenes using off-the-shelf 3D reconstruction methods such as VGGT. While it is true that methods like VGG-T/Mast3r-SLAM/colmap are exciting for dense reconstruction, methods like VGG-T/Mast3r-SLAM/colmap output normalized reconstructions (i.e., up to a scale factor). This is fundamentally different from our approach which outputs metric camera poses and 3D boxes (expressed in real-world meters). This is a fundamental requirement for a 3D object detector --- it must produce distances and dimensions in a real-world scale in order to be useful. Additionally, one important aim of our work is to show the plausibility of using objects (as 3D boxes) as the sole primitive for performing localization and mapping in a large scale setting. As we show in the the RGB-D/posed case, the methods we compare to (FCAF, UniDet3D) are, in fact, operating on the dense reconstruction as an aggregated point cloud or mesh. Nonetheless, we show that these methods can be outperformed by RfM which challenges the idea that a dense reconstruction is even the best representation for 3D object detection. Furthermore, in order to efficiently process a dense reconstruction, one needs additional machinery like sparse 3D convolutions which are quite costly and not easily applied in certain settings.
FCAF is cited wrongly: Thank you, we will correct this.
mAP: We clarify that we do report mAP (AP averaged over all classes) in the exact manner as FCAF, UniDet3D, and others. We will update the notation to reflect this.
comparison with point-based localization and mapping is completely missing: We emphasize that the main goal of this paper is un-posed 3D object detection. We primarily intend the localization evaluation and experiments to calibrate our method and show that 3D object detection can be performed as a localization and mapping task using 3D objects as a primitive.
only compare with CUT3R in RGB: We note that the comparisons in our paper are with respect to metric camera pose estimation and 3D object detection. The baselines mentioned (COLMAP, Mast3r-SLAM, VGG-T) all produce camera poses and reconstruction up to an unknown scale. The best suitable metric-capable baseline we have found is CUT3R, which acts similarly to Mast3r, and generally gives results comparable if not better than Mast3r-GA.
I thank the reviewer for clarifying their matching method and overall contribution of the paper. I mistook CubifyMatch for 2D corner matching.
While I agree authors reluctance to compare with learned 3D reconstruction / point-based sfm method (COLMAP), it is essential considering paper is about localization and mapping. I can understand viewing objects as a anchor point for localization and mapping pipeline is an interesting direction, but not comparing with existing 3D reconstruction method does not give readers full context required to make sense of the current work. I see other reviewers (p8G3 and ZnXN) raising similar concerns.
I also read through other reviews, and do think that adding objects into mapping + localization pipeline is an interesting direction.
I am inclined to raise my rating to weak accept considering all this. But I strongly recommend authors to compare with SoTA localization + mapping pipelines. I don't think reviewers are looking for a result that this method beat other methods, but for readers to have full context its required to have this comparison.
Thank you. We are committed to adding results from other methods (including colmap, Mast3r, and Master-SLAM). We have produced preliminary results on colmap (using scale alignment) and will continue to add others including Mast3r (these are being computed but take over 15 minutes per video).
Copying the colmap results here for clarity:
CA-1M:
| Method | ARE (deg) | ATE (cm) | Registered |
|---|---|---|---|
| colmap (scaled) | 18.5 (50.1) | 38.5 (68.0) | 100% |
ScanNet++:
| Method | ARE (deg) | ATE (cm) | Registered |
|---|---|---|---|
| colmap (scaled) | 9.2 (33.3) | 28.8 (61.0) | 100% |
Thank you! Please update here on a comment when the results are ready and added to the paper. I will take a look at it again and update rating accordingly.
We provide some preliminary translation error results when using mast3r GA on ScanNet++ and CA-1M. For ScanNet++, we provide results for 50 and 100 frames. For CA-1M, we only provide 50 frames and will provide 100 frames for CA-1M in an updated manuscript. This is because mast3r takes 15 minutes for 50 frames and over an hour for 100 frames. We don't expect (as seen from ScanNet++) that 100 frames will make a significant difference. Rotation error results will also be provided in the updated manuscript. Generally, we see that Rooms from Motion has significantly less translation error than mast3r, while mast3r which is somewhat better/on par with cut3r.
CA-1M (50 frames):
| Method | ATE (cm) | Registered |
|---|---|---|
| mast3r (metric) | 24.5 (29.0) | 100% |
ScanNet++ (50 frames):
| Method | ATE (cm) | Registered |
|---|---|---|
| mast3r (metric) | 25.8 (31.7) | 100% |
ScanNet++ (100 frames):
| Method | ATE (cm) | Registered |
|---|---|---|
| mast3r (metric) | 26.2 (32.3) | 100% |
Hi Reviewers,
Thanks for your effort in reviewing for NeurIPS. We are now in the reviewer-author discussion phase. Please look at each others' reviews and the authors' responses, and further clarify any doubts, especially any points of disagreement with the authors before Aug 6 11:59pm AoE.
--AC
This paper receives 2x borderline accepts and 2x accepts. The reviewers believe that the proposed approach is novel and promising. It also has the potential to inspire future works in this direction. The work shows clear goal and contributions. However, there are still some remaining concerns on the comparison with stronger baselines. The authors are strongly encouraged to include comprison with fine-tuned DROID-SLAM in the final paper. Nonetheless, the reviewers think that this missing comparison does not affect the acceptance of the paper. The ACs follow the final ratings of the reviewers to accept the paper.