Unsupervised Discovery of Object-Centric Neural Fields
We propose uOCF, an unsupervised 3D object discovery approach that discovers objects from single real-world images by disentangling objects' intrinsics and extrinsics, allowing applications like 3D object segmentation and scene manipulation.
摘要
评审与讨论
This paper proposes a novel approach called unsupervised discovery of Object-Centric neural Fields (uOCF) for inferring 3D object-centric scene representations from a single image. To generalize to the real-world scenes, the paper focuses on disentangling the learning of object intrinsics and the extrinsic separately. The proposed approach significantly improves systematic generalization, enabling unsupervised learning of high-fidelity object-centric scene representations from sparse real-world images. The approach allows for the discovery of visually rich objects from a single real image, allowing for applications such as 3D object segmentation and scene manipulation.
优点
- The proposed 3-stage training process is technically sound and is proven to be effective.
- It is nice to see the effectiveness of the proposed approach in terms of zero-shot generalizability on various datasets.
- This paper is well-written and well-organized, providing a detailed explanation of the proposed approach. Its evaluation of various datasets proves the generalization ability.
- The proposed method enables unsupervised learning of high-fidelity object-centric scene representations from sparse real-world images, which has potential applications in 3D object segmentation and scene manipulation.
缺点
- The main additional contribution of this paper should be disentangling the learning of object intrinsics and extrinsic compared with the previous method uORF. The performance comparisons in the experimental parts demonstrate the effectiveness of the whole pipeline. However, it is not very clear to me which parts contribute most to the performance gain. The object-centric sampling mentioned in Sec. 3.3 seems reasonable and should lead to better rendering performance. But I didn't see any relevant ablation study on this.
- The authors did not provide a detailed analysis of the limitations or failure cases of the proposed approach, which could limit its applicability in certain scenarios.
- The paper does not provide a clear explanation of the training process and hyperparameters used in the experiments, which could limit its reproducibility and further research.
问题
- How to decide the hyperparameter K in Eq.(1)? How to guarantee the redundant slots do not affect the learning of other slots?
- Did you test with transparent objects? Would it also work properly?
- Did you test the geometric quality of the learned object representations? Can your method recover accurate shape of the objects?
- Which module do you think contributes most to the final performance improvement? Could you provide more detailed ablation study on this?
Thank you for recognizing the empirical success, wide generalizability, and broad applicability of our method. We have addressed your main concerns as follows:
Q1. Ablation studies on encoder design, training pipeline, and object-centric sampling.
A1. We provide additional ablation studies in Appendix E.1 following your suggestions. These studies justify the necessity of our three-stage training pipeline, showing that omitting synthetic dataset training stages significantly impairs the model's performance. Additionally, our findings illustrate that the integration of DINO features, while beneficial, requires the concurrent use of a U-Net route for better performance. Ablation studies on object-centric sampling also confirm its role in enhancing background reconstruction. We provide a snapshot of the results below, and the full details are available in Tables 4, and 5 and Figures 12, and 13.
Ablation Studies on Encoder Design
| Method | ARI | FG-ARI | NV-ARI | PSNR | SSIM | LPIPS |
|---|---|---|---|---|---|---|
| uORF | 0.649 | 0.108 | 0.587 | 24.37 | 0.688 | 0.251 |
| uORF-DINO | 0.688 | 0.677 | 0.656 | 25.33 | 0.739 | 0.240 |
| uOCF-DINO | 0.652 | 0.346 | 0.621 | 26.06 | 0.723 | 0.251 |
| uOCF | 0.802 | 0.785 | 0.747 | 28.96 | 0.803 | 0.121 |
Ablation Studies on Training Pipeline
| Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
| uORF | 19.23 | 0.602 | 0.336 |
| Omit stage 1&2 | 19.67 | 0.565 | 0.568 |
| Omit stage 2 | 24.52 | 0.769 | 0.157 |
| Omit object-centric sampling | 27.89 | 0.843 | 0.083 |
| uOCF | 28.29 | 0.842 | 0.069 |
As for the most crucial component, we emphasize the integration of position disentanglement and a three-stage training pipeline. The former enables the model to learn object-centric NeRFs and object positions in 3D scenes from single images, while the latter enables generalization to complex real-world scenes. These two components are complementary and indispensable for our method's success.
Q2. Limitation analysis
We have added discussions on our method's limitations in Appendix D. The current constraints include limited diversity in object appearance and background complexity in our datasets and challenges in reconstructing foreground objects with complex textures. These limitations, shared with other generalizable NeRF methods, are areas we aim to improve in future work.
Q3. Explanation of training configurations
Details on datasets and training configurations are provided in Appendix C. Additionally, to facilitate reproducibility, we have dedicated a section in our paper and committed to releasing all codes, datasets, models, and detailed instructions upon paper acceptance.
Q4. Hyperparameter (number of objects)
While the scenes in our datasets have objects, our method can also reconstruct and discover all objects to scenes with fewer objects. Further details and visual demonstrations are available in Appendix E.6 and Figure 19.
Q5. Results on transparent objects
As for transparent objects, we have included new experiments in Appendix E.7, with visual results in Figure 20. Our model ignores transparency due to the absence of transparent objects in its training dataset. However, it still demonstrates reasonable object segmentation and reconstruction capabilities.
Q6. Geometric quality of learned object representations
Recovering object geometric shapes from NeRFs is indeed a challenging problem. NeRF primarily represents objects using density values for rendering, which doesn't explicitly encode object geometry. This can lead to inaccuracies, especially in complex geometries or scenarios with limited view coverage. However, significant advancements have been made in this area. Recent research efforts have focused on enhancing NeRF's capability to reconstruct object geometry more accurately. These include improved sampling strategies [1], integration of geometric constraints [2], and hybrid approaches that blend traditional 3D representations with NeRF [3]. Integrating these advancements into our method is a promising direction for future work.
References
[1] Li et al. "NerfAcc: Efficient Sampling Accelerates NeRFs". In ICCV, 2023.
[2] Rematas et al. "ShaRF: Shape-conditioned Radiance Fields from a Single View". In ICML, 2021.
[3] Yariv et al, "Volume Rendering of Neural Implicit Surfaces". In NeurIPS, 2021.
First of all, I would like to thank the authors for addressing my raised issues and conducting more experiments to support the claims. The overall framework is indeed fancy to me at first glance. Unsupervisedly reconstructing the shapes of objects in the scene is indeed an interesting problem and I also believe this topic is important for many robotic applications. However, after reading other reviewers' comments, I agree that the current framework should still face challenges in handling over-segmentation and under-segmentation. I decided to downgrade and rate this paper marginally above the acceptance. The reason why I still lean to the acceptance side is that this paper indeed proposed an unsupervised pipeline for reconstructing the nerfs for objects in the scene and hopefully the remaining issues can be handled by additional uncomplicated designs.
The paper studies unsupervised identification of objects in 3D radiance space from a single image. Particularly, given an input image, the encode will learn per-object fectures and per-object locations in 3D space. Each object will be encoded into a seperate NeRF model. By composing all individual 3D objects and background and rendering back to 2D images, the entire network will be supervised by multiview images by photometric losses. Experiments are conducted on collected datasets and promising results are obtained.
优点
-
The paper studies a critical problem in object-centric learning without any human annotations, especially discovering objects in 3D space.
-
Using posed multi-views as supervision signals to learn radiance fields is reasonable to provide more information and constraints for object discovery.
-
Additional datasets are provided to evaluate the proposed method. They are supposed to be beneficial in the community in the future.
缺点
- About the motivation
In the beginning, the paper states that the bottleneck of existing works is the entanglement of object attributes such as shape and appearance with extrinsic properties such as object location. However, there is a lack of concrete evidence to support this claim. In this paper, why is separately learning per-object locations significantly better than others?
- About the method
2.1. In sec 3.1, the description of Encoder/Latent Extraction Module seems not matched with the provided Figure 2. It's unclear how the two sets of feature maps f_g and f_l are fed into the second module.
2.2. In Latent Extraction Module, for each object, its position p_i^{wd} in the world frame will be learned. In this case, what is the world coordinate? Is it predefined for the whole set of multi-view images of every single scene? In this case, given two different input images of the same scene, the same 3D object needs to learn exactly the same 3D location. In this way, the network actually needs to learn a separate coordinate system conditioning on every input image, which seems not that sensible if I understand correctly.
2.3. The paper models every object/background by a separate NeRF. How is such strategy able to deal with the potential under/over segmentation of 3D objects? For example, what if two chairs are grouped into a single NeRF (the center of two chairs may be learned as the combined object center)? What if a single chair is learned as two objects at the very beginning? Why will the entire background structure (usually complex) must be grouped into a single NeRF? In practice, there are typically a variable number of objects in each 3D scene to discover. There could be objects that are visible in input image, but invisible in multi-view images, or otherwise.
2.4. The paper states that it incorporates two novel techniques: 1) object-centric prior learning and 2) object-centric sampling. However, in page 5, such two techniques are very briefly described by two small paragraphs. As to object-centric prior learning, what types of priors are planned to be learned? why is it helpful for the subsequent object discovery? As to object-centric sampling, how does it work in detail and why is it important?
- About the experiments
3.1. For the collected three datasets. How many objects in each image? From the samples in Figure 4, it seems every image has exactly four objects.
3.2. As to the two real-world datasets Kitchen-easy/hard, the objects are rather simple in terms of both shape and appearance. It's suggested to evaluate on more complex real-world images such as ScanNet[1].
3.3. For the evaluation metrics of object segmentation, only ARI related scores are reported. It's suggested to report additional metrics such as AP scores as analyzed by [2] in the community of unsupervised object segmentation.
3.4. For the experimental results in Tables 1/2/3, why are the existing methods inferior in both segmentation and novel view synthesis? It's suggested to give more concrete discussions.
3.5. There is a lack of ablation studies to analyze the effectiveness of the proposed components.
[1] ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes, CVPR'17 [2] Promising or elusive? unsupervised object segmentation from real-world single images, NeurIPS'22
- Minor suggestions:
4.1. In page 2, the claimed first contribution seems not very meaningful at all. 4.2. For equations 1/2/3/4/5, it's suggested to use mathematical symbols rather than English words. It's a bit hard to track the meanings. 4.3. The work ONeRF [3] should be discussed, as it also uses NeRF for unsupervised 3D object discovery.
[3] ONeRF: Unsupervised 3D Object Segmentation from Multiple Views, arXiv'22
问题
Details given above.
Q6. Ablation studies on other model components
Additional ablation studies (Appendix E.1) reveal that while substituting DINO for a shallow U-Net encoder improves performance, it does not inherently overcome its limitations. Specifically, Table 4 (also attached below) illustrates how uORF-DINO and uORF-DR (uORF with our dual-route encoder) improve upon the standard uORF, yet still fail to correctly disentangle foreground and background elements. Note that uORF-DR binds all foreground objects to the background, leading to an ARI score of zero. Besides, uOCF-DINO, which drops the U-Net route and utilizes the DINO encoder only, and uOCF-IM, which substitutes DINO's intermediate layer features for the shallow encoder in uOCF, both achieve inferior performance than uOCF. This reaffirms the superiority of our encoder design.
| Method | ARI | FG-ARI | NV-ARI | PSNR | SSIM | LPIPS |
|---|---|---|---|---|---|---|
| uORF | 0.649 | 0.108 | 0.587 | 24.37 | 0.688 | 0.251 |
| uORF-DINO | 0.688 | 0.677 | 0.656 | 25.33 | 0.739 | 0.240 |
| uORF-DR | 0 | 0 | 0 | 25.38 | 0.698 | 0.322 |
| uOCF-DINO | 0.652 | 0.346 | 0.621 | 26.06 | 0.723 | 0.251 |
| uOCF-IM | 0.806 | 0.749 | 0.752 | 27.77 | 0.753 | 0.182 |
| uOCF | 0.802 | 0.785 | 0.747 | 28.96 | 0.803 | 0.121 |
Q7. Number of objects in the scene.
While the scenes in our datasets have objects, our method can also reconstruct and discover all objects to scenes with fewer objects. Further details and visual demonstrations are available in Appendix E.5 and Figure 17.
Q8. Evaluation on more challenging datasets
Given that ScanNet does not offer textured meshes or calibrated images, we use the HM3D dataset [6], a dataset focused on building-scale indoor 3D reconstruction, for evaluating more complex real-world scenes. We conducted our evaluation using zero-shot inference, with detailed results presented in Appendix E.2 and illustrated in Figure 15. Our model, pre-trained on the Room-Texture dataset, demonstrates robustness in discovering and segmenting chair instances even in unseen complex scenes.
Q9. Additional evaluation metrics
We have adopted the official AP computation code from [7] and included the AP evaluation metric in Appendix E.4, with detailed results in Table 7 (also attached below). Specifically, we consider two kinds of APs: Input view-AP for the input view, and Novel view-AP for novel views. Our method's high AP scores demonstrate its superior scene segmentation capabilities.
| Metric | uORF | QBO | COLF | uOCF |
|---|---|---|---|---|
| Input view-AP | 0.005 | 0.359 | 0.315 | 0.782 |
| Novel view-AP | 0.001 | 0.195 | 0.015 | 0.770 |
Q10. Performance analysis with comparison methods
A key strength of our work lies in its superior reconstruction quality. In contrast, uORF struggles to distinguish between background and foreground elements, particularly in complex datasets such as Kitchen-Hard and Room-Texture, resulting in low segmentation scores. This difficulty extends to uORF's background NeRF, leading to blurry reconstructions due to its inability to effectively focus on background elements.
Our approach builds upon uORF but introduces a critical enhancement: the disentanglement of object positions. This advancement not only enhances systematic generalization but also facilitates the application of techniques like object-centric prior learning and sampling. Specifically, object-centric prior learning allows better disentanglement of background, and object-centric sampling allows for samples with the same amount of computation. This results in a significant improvement in reconstruction quality for both background environment and foreground objects, setting our method apart from uORF's limitations.
Q11. Minor issues
Following your advice, we have revised the contributions for clarity and conciseness, updated the mathematical notations for positional encodings, and added discussions about the ONeRF paper.
References
[1] Biza et al. "Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames." In ICML, 2023.
[2] Traub et al. "Learning What and Where: Disentangling Location and Identity Tracking Without Supervision." In ICLR, 2023.
[3] Fuchs et al. "SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks," In NeurIPS, 2020
[4] Spelke. "Principles of object perception." Cognitive Science, 1990.
[5] Chen et al. "Learning to Infer 3D Object Models From Images." In ECCV, 2022.
[6] Ramakrishnan et al. "Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI". arXiv, 2021.
[7] Yang and Yang, "Promising or elusive? unsupervised object segmentation from real-world single images." In NeurIPS, 2022
Thanks for providing the concrete feedback. I believe my main concerns have been addressed, and I would change to a positive rating.
Thank you for recognizing the empirical success and wide applicability of our method. We have addressed your main concerns as follows:
Q1. Concerns on motivation
A key limitation in current object representation methods is their viewer's frame-dependent nature, where even minor positional shifts or camera movement can significantly alter an object's latent representation. This entanglement of intrinsic attributes with extrinsic properties limits generalization in complex real-world scenes (Figure 1). Furthermore, as highlighted in recent 2D object-centric learning literature [1,2] and the success of convolutional networks, position-invariance is essential for systematic generalization. This point is also emphasized in recent 3D supervised learning literature [3]. Therefore, we propose to disentangle object position and introduce the concept of position-invariant object-centric NeRFs. This disentanglement not only allows learning generalizable object priors from category-agnostic synthetic data (object-centric prior learning) but also improves sample efficiency (object-centric sampling). Extensive experiments in Section 4 and Appendix E validate the superiority of our method on our newly curated datasets.
Q2. How to integrate local and global features?
Section 3.4 details our approach to integrating local and global features. We utilize as the "" for the latent extraction module and concatenate its outputs with an attention-weighted mean of . This results in the final slot latent , where attention weights are also derived from the latent extraction module. This design efficiently balances global and local information, maximizing the utility of pre-trained models.
Q3. Inferring the object's position in the world coordinates
Instead of learning the object position in the world coordinates , our approach first estimates an object's position on the image plane by the attention-weighted mean over a spatial grid, which is then converted to the world coordinates using the intersection of the ray with the ground plane. A bias term is added to to adjust for discrepancies in this conversion. Since the position estimation happens in image space instead of 3D space, our approach does not need to learn separate coordinate systems for each image.
Q4. How to tackle under/over-segmentation? How to handle objects visible in multi-view images but invisible in the input image, and vice versa?
Under/over-segmentation is a common challenge in existing approaches, often resulting in binding multiple objects into a single foreground slot (missing objects) or all objects into the background slot (blurry object reconstruction), as visualized in Figures 5 and 7. The technical contribution of our work is exactly to help address this challenge. In particular, we introduce the generalizable object-centric prior learning pipeline. This pipeline begins with learning object priors from a synthetic dataset. The learned object priors might include generalizable visual clues such as physical coherence [4], which is critical to unsupervised segmentation [5]. The subsequent stage refines the model's ability to predict object positions and segregate them into individual slots. As shown in Figure 7, our method can effectively separate and reconstruct individual objects in real scenes, even when the object is largely occluded.
However, although our method can tackle object occlusion, it cannot handle invisible objects in the input image, as the model cannot discover them. Reversely, our model can still discover and reconstruct the object when the object is visible in the input image but invisible in the multi-view images. This property is naturally inherited from the volumetric rendering technique utilized in NeRF.
Q5. Effect of object-centric prior learning and sampling
We have discussed the effect of object-centric prior learning in previous responses. For qualitative results, ablation studies (Appendix E.1) show that omitting synthetic dataset training stages drastically reduces performance. Meanwhile, our object-centric technique enables samples within the same computational budget, significantly enhancing background reconstruction quality. Key results are summarized below, with full details in Tables 4, 5, and Figures 12, 13.
| Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
| uORF | 19.23 | 0.602 | 0.336 |
| Omit stage 1&2 | 19.67 | 0.565 | 0.568 |
| Omit stage 2 | 24.52 | 0.769 | 0.157 |
| Omit object-centric sampling | 27.89 | 0.843 | 0.083 |
| uOCF | 28.29 | 0.842 | 0.069 |
This paper addresses the challenge of learning single-view RGB object-centric representations with multi-view supervision tasks. The author presents an object-centric learning framework built upon the foundation of uORF, which incorporates Slot-attention, a pre-trained DINOv2 encoder, and compositional NeRF.
The training process is carried out in multiple stages. In the initial stage, the model is trained on synthetic scenes, each featuring a single object. Subsequently, the training extends to synthetic scenes containing four objects. Finally, the model undergoes further training on a real-world dataset characterized by a substantial domain gap in comparison to the synthetic dataset.
优点
The exploration of object-centric learning within real-world datasets is an important research direction. This study takes a step in the pursuit of this objective. The writing is largely lucid and comprehensible. Notably, the reconstruction and segmentation outcomes outperform previous methods, particularly in scenes featuring intricate textures.
缺点
This model is essentially an integration of prior research efforts, with key components such as slot-attention, DINO, and object-compositional NeRF having been individually introduced in previous works. Consequently, the training pipeline itself lacks a significant degree of novelty.
I have major concerns regarding the utilization of a pre-trained DINO encoder. DINO is already proficient at extracting meaningful and somewhat object-aware features from images, leading to the possibility that the model merely aligns these features with latent variables. This raises questions about how much the proposed pipeline genuinely contributes to object discovery.
The "l_pos" loss, which calculates the positional disparity between object poses from different viewing directions, relies on the fact that, during the initial training stage, there is only one object per scene, eliminating object-matching challenges. However, strictly speaking, the use of a single-object dataset disqualifies the proposed pipeline as an unsupervised model while making the learning problem considerably less challenging. To apply this method to a general dataset, one must manually select scenes containing only one object, a process akin to labeling. In other words, the training pipeline appears tailored specifically for the proposed dataset. While the author presents object-centric prior learning as a contribution, it might be perceived more as a limitation.
The fixed number of objects in each scene restricts the method, while most object-centric learning frameworks merely set a maximum number of objects in the scenes.
The limitations of the work are not adequately discussed or disclosed. The ablation study is rather limited in scope.
问题
Could the author provide insights into why the proposed method demonstrates superior background reconstruction quality compared to uORF? While the inference pipeline shares many similarities with uORF, the reconstruction quality appears to be significantly enhanced.
It would be valuable to know the specific value of the parameter "$K" employed during the training process.
Is it possible for the author to conduct training and evaluation of the proposed method on the Room-Chair dataset, which was utilized in the uORF paper?
Moreover, it would be beneficial to observe results on real-world datasets without the pre-training on synthetic datasets. It remains unclear why pre-training is deemed essential, especially when facing a substantial domain gap.
Q5. Fixed number of objects in the scene
While the scenes in our datasets have objects, our method can also reconstruct and discover all objects to scenes with fewer objects. Further details and visual demonstrations are available in Appendix E.6 and Figure 19.
Q6. Why our background reconstruction quality is much higher than uORF?
A key strength of our work lies in its superior reconstruction quality. uORF's failure in differentiating between background and foreground elements, particularly in complex datasets like Kitchen-Hard and Room-Texture, prevents its background NeRF from focusing on background reconstruction. Our approach introduces a critical improvement: the disentanglement of object positions. This not only enhances systematic generalization but also facilitates object-centric prior learning and object-centric sampling. Specifically, object-centric prior learning allows better disentanglement of background, and object-centric sampling allows for samples with the same amount of computation. This results in a significant improvement in reconstruction quality for both background environment and foreground objects, setting our method apart from uORF's limitations.
Q7. Results on Room-Diverse dataset
We have included new experiments on the Room-Diverse dataset in Appendix E.3. The results in Table 6 (also attached below) and Figure 16 show our method's superior performance against all baseline comparisons. Additionally, we have explored another newly curated real-world Planters dataset in Appendix E.5, providing a comprehensive analysis with sample images, and results in Figures 17, 18, and Table 8.
| Method | ARI | FG-ARI | NV-ARI | PSNR | SSIM | LPIPS |
|---|---|---|---|---|---|---|
| uORF | 0.638 | 0.705 | 0.494 | 25.11 | 0.683 | 0.266 |
| uORF-DINO | 0.692 | 0.555 | 0.633 | 25.50 | 0.698 | 0.239 |
| uORF-DR | 0.742 | 0.653 | 0.680 | 26.00 | 0.707 | 0.209 |
| QBO | 0.724 | 0.716 | 0.618 | 24.49 | 0.680 | 0.182 |
| uOCF | 0.769 | 0.828 | 0.688 | 27.31 | 0.751 | 0.141 |
Q8. Novelty Concerns
We argue that our core contribution is proposing object-centric modeling, which addresses a fundamental bottleneck in generalizing 3D object discovery to more complex real scenes than synthetic scenes. In particular, existing 3D object discovery methods represent objects in the viewer's coordinate frame, entangling camera extrinsics with object intrinsics. This entanglement prevents them from effectively learning object priors and generalizing to complex scenes. Our object-centric modeling allows object-centric prior learning instantiated by our 3-stage training pipeline, which shows promising results in generalization, even in a zero-shot setting.
Therefore, our innovation mainly lies in object-centric modeling which further enables prior learning and object-centric sampling.
Regarding the integration of DINO, our new ablation studies (Appendix E.1, Table 4) show that a mere substitution of DINO in existing frameworks cannot overcome their limitations. The three-stage training pipeline, particularly the initial stages on synthetic datasets, is essential for our model’s generalization, as evidenced by the substantial performance drop when omitted (Appendix E.1, Table 5). Moreover, the object prior obtained on synthetic datasets are agnostic to the real-world dataset's object categories, and thus can even be learned from simple CLEVR shapes. These results underscore the effectiveness of the proposed object-centric modeling.
References
[1] Spelke. "Principles of object perception." Cognitive Science, 1990.
[2] Chen et al. "Learning to Infer 3D Object Models From Images." In ECCV, 2022.
I appreciate the author's effort in providing more results in such a short time frame. I read other reviewers' comments and all rebuttals.
The additional results indeed demonstrate the potential of the proposed method. However, I think there are still flaws in the experiment design, making this paper below the bar of acceptance.
About Stage 1: The authors claim that the use of single-object data is not akin to labeling since the same dataset can be used to train models targeting different environments. In other words, the single-object dataset is in some sense universal. I found this argument less supported. There is no explanation for why a large domain gap can be overcome. At this stage, this is merely an observation.
About Stage 2: I believe the experiment design of this stage is not sound. For each data, there are 4 objects and 4 objects only. In other words, the dataset is tailored. Then, the number of slots, i.e. K, as an important hyperparameter in the slot attention model, is set to precisely 4. As a result, there is no space for over-segmentation and under-segmentation, making the learning task significantly easier. In previous works, over-segmentation prevention is a challenge and commonly requires some special designs to overcome. To be more specific, it is a challenge as over-segmentation normally leads to even higher reconstruction accuracy, which is the main optimization objective. However, in this work, this challenge is shortcutted. Thus, this is not a strict unsupervised setup. The number of objects in a scene is known and is actively used in the design of the network. This is also evidenced by the results in Figure 19. If there are less than 4 objects in each scene, the same object will be detected twice. From the downstream applications' point of view, the object slots extracted by the proposed model cannot be directly used as there are duplicates.
This also in return could explain why a large domain gap can be overcome. During stage 1, the scene layout component is trained to model a flat surface and the slot is trained to capture an object of compact shape above the surface. Thus, for a new dataset, as the number of slots is exactly the same as the number of objects, each slot easily finds an object to represent and adapt its representation accordingly.
PLEASE note that when I say previous works, I mean the majority of the object-centric learning methods, not uORF specifically. The results fully demonstrate the superiority over uORF. There is no doubt about that.
About DINO: DINO is demonstrated to carry object instance information already. Many works aim to extract object segmentation masks directly from DINO without any learning components. For example [1]. DINO also significantly increases the FG-ARI for uORF. My concern is that DINO contributes to object discovery significantly, while the pipeline design in this work also helps. However, this concern is minor.
[1] Cut and Learn for Unsupervised Object Detection and Instance Segmentation
In summary, after considering all the pros and cons, I believe the experiment setup needs to be fixed to make all claims in the paper sound. I have no choice but to lower my rating. I'd like to emphasize again that the pipeline proposed in this work is certainly better than uORF. I also think that there should be no problems in training with the presence of 0-K objects, with a design to prevent over-segmentation.
Thank you for recognizing the empirical success of our work. We have addressed your main concerns as follows:
Q1. Utilization of DINO encoder
We note that simply substituting DINO for the shallow U-Net encoder in existing models does not inherently overcome their limitations. To see this, we have conducted comprehensive ablation studies detailed in Appendix E.1. Specifically, Table 4 (also attached below) illustrates how uORF-DINO and uORF-DR (uORF with our dual-route encoder) improves upon the standard uORF, yet still fails to correctly disentangle foreground and background elements. Note that uORF-DR binds all foreground objects to the background, leading to an ARI score of zero. Besides, Figure 12 provides visual comparisons to demonstrate these distinctions in performance.
| Method | ARI | FG-ARI | NV-ARI | PSNR | SSIM | LPIPS |
|---|---|---|---|---|---|---|
| uORF | 0.649 | 0.108 | 0.587 | 24.37 | 0.688 | 0.251 |
| uORF-DINO | 0.688 | 0.677 | 0.656 | 25.33 | 0.739 | 0.240 |
| uORF-DR | 0 | 0 | 0 | 25.38 | 0.698 | 0.322 |
| uOCF | 0.802 | 0.785 | 0.747 | 28.96 | 0.803 | 0.121 |
Q2. Utilization of single-object datasets
We clarify that in the first stage of training, our model is trained on a single-object synthetic dataset (which can be easily synthesized) to learn general object priors agnostic to the real-world dataset's object categories. Therefore, it is not akin to the manual process. For instance, we trained our model on synthetic chairs and successfully applied the learned priors to real-world kitchen datasets. We further validate this generalizability using a very simple CLEVR dataset (colored primitives like cubes and spheres) instead of the chair dataset for object prior learning. We added the results in Table 5 and Figure 12 in Appendix E.1. We also attach the table below.
| Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
| uOCF (adapt from CLEVR) | 27.32 | 0.833 | 0.092 |
| uOCF (adapt from chairs) | 28.29 | 0.842 | 0.069 |
Q3. The necessity of pre-training on synthetic datasets
The pre-training stages on synthetic datasets are critical to our method's success. This three-stage training pipeline begins with teaching the model the basics of object-centric NeRF, including essential aspects like physical coherence [1], which is crucial to unsupervised segmentation [2]. The subsequent stage then enhances the model's ability to predict object positions and segregate them into individual slots, thereby equipping it for handling complex real-world scenes. Our comparative analysis, detailed in Appendix E.1 and quantitatively supported by the results on the Kitchen-Hard dataset in Table 5 (also attached below), clearly shows the significant performance drop when omitting these initial synthetic dataset training stages. We show qualitative results in Figure 13 in the updated paper.
| Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
| uORF | 19.23 | 0.602 | 0.336 |
| Omit stage 1&2 | 19.67 | 0.565 | 0.568 |
| Omit stage 2 | 24.52 | 0.769 | 0.157 |
| Omit object-centric sampling | 27.89 | 0.843 | 0.083 |
| uOCF | 28.29 | 0.842 | 0.069 |
Q4. Limitation analysis
We have added discussions on our method's limitations in Appendix D. The current constraints include limited diversity in object appearance and background complexity in our datasets and challenges in reconstructing foreground objects with complex textures. These limitations, shared with other generalizable NeRF methods, are areas we aim to improve in future work.
Dear reviewer TQPk,
Thanks for your valuable insights on the experimental design. We address your concerns below with additional experiments. Due to the limited time before the discussion period ends, the number of training epochs for the following experiments is fewer than standard. The additional discussions and results have been added to Appendix E.5. If accepted, we will refine our experiments section in the main text in the camera-ready version.
Experiment configuration: We render a new synthetic dataset with each scene containing 2-4 chair instances. Real scenes still include exactly 4 objects. We test both and . The results are shown in Tables 8, 9 (also attached below), and Figure 18.
After using a training set consisting of a randomized number of objects, we observe that empty slots now appear (please see Figure 18), overcoming the over-segmentation problem. Besides, we note that our method does incorporate a specific design to prevent over-segmentation. Firstly, the slot attention mechanism enforces areas of similar features to bind to the same slot. Moreover, we add slot-specific positional encoding to the keys and values to let the slot latents emphasize local information, thus preventing parts of different instances from binding together, which is a common problem in previous methods (Figures 12, 23, also ObSuRF in Figure 4 in [Sajjadi et al.]).
[Sajjadi et.al.] M. Sajjadi et.al., Object Scene Representation Transformer. NeurIPS'22.
Qualitative Results on Room-Texture (15 epochs, 75000 iterations)
| Method | ARI | FG-ARI | NV-ARI | PSNR | SSIM | LPIPS |
|---|---|---|---|---|---|---|
| K=4, n_obj=4 | 0.819 | 0.643 | 0.743 | 28.68 | 0.803 | 0.139 |
| K=4, n_obj{2,3,4} | 0.828 | 0.743 | 0.756 | 30.11 | 0.831 | 0.112 |
| K=5, n_obj{2,3,4} | 0.819 | 0.559 | 0.769 | 28.72 | 0.805 | 0.132 |
Qualitative Results on Kitchen-Hard (150 epochs, 48600 iterations)
| Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
| K=4, n_obj=4 | 27.36 | 0.820 | 0.140 |
| K=4, n_obj{2,3,4} | 27.10 | 0.825 | 0.131 |
| K=5, n_obj{2,3,4} | 28.26 | 0.837 | 0.120 |
As shown in these tables, uOCF performs reasonably well when trained on scenes with objects, justifying that the outstanding performance of uOCF in real-world scenes is not because of the identical number of objects between stages 2 and stage 3. In other words, applying the object prior learned from synthetic scenes to real-world scenes neither requires two datasets to have the same number of objects nor to equal the number of objects in the scene. We can still assume a shared maximum number of objects of the two datasets, similar to previous unsupervised object discovery literature.
Regarding under-segmentation.
In the additional experiments above we see that our method is able to overcome over-segmentation. Regarding under-segmentation, in Figure 12, we observe that only our model (both the previously trained model and the newly trained model) can overcome this drawback, whereas previous methods fail to segment all the chair details.
On the generality of the object prior.
We agree with the possible explanation on the generalization to overcome large domain gaps you mentioned, i.e., the learned object priors might be "compact shapes on a flat surface." We note that what defines an object is not entirely settled in cognitive science. In the core knowledge theory [Spelke 1990], fundamental principles of objects include "physical cohesion" (i.e., objects should move cohesively), which indicates that an object should be a cohesive compact entity, and "support" (i.e., objects do not float in mid-air without support). Therefore, we argue that "compact shapes on a flat surface" is a valid category-agnostic object prior.
This argument justifies our stage 2 in applying the learned object prior: Since the learned object priors are general, it enables handling the domain gaps between synthetic and real data, regardless of the number of objects in these scenes.
[1] Spelke. "Principles of object perception." Cognitive Science, 1990.
Finally, we wish to reiterate that the main contribution of this work is to show the generalization power of object-centric modeling in unsupervised 3D object discovery, which allows real-world 3D scene editing from a single image, as well as zero-shot generalization to unseen real scenes. These have never been shown in previous object-centric learning methods.
We appreciate your efforts and hope our experiments and analysis can address your concerns. Your suggestions indeed help us improve our work. We are more than happy to discuss if you have further concerns.
Sincerely,
Authors
The paper proposed a method to infer 3D object centric representations from a single RGB image. Unlike previous works, the proposed method disentangles the intrinsic and extrinsic properties of an object when learning the scene representation, which allows them to generalize better. The paper shows various application of their method - novel view synthesis, segmentation, object removal and scene rearrangement.
优点
- The paper is clearly written and has good figures which help in getting the point across.
- The main novelty of the work is in disentangling the intrinsic and extrinsic properties of an object when learning object centric representations. This helps the method to generalize better.
- The paper also releases 3 new datasets to help test this approach which can help the research community to build on top of the method.
- The paper shows excellent results on different tasks - scene segmentation, novel view synthesis, object re-arrangement, object removal.
缺点
The major weakness of the paper is the lack of ablations performed. The paper mentions various design choices, however, I didn't find any ablations in the main paper justifying any of them. For example, the paper mentions that 3 stage training helps in getting a better performance, however there is no ablation to justify that. What if you skip the stage 2 and directly train on stage 3 after learning object priors using stage 1? How much will the performance degrade? How much does stage 2 training actually help here? Also, authors mention learning a global representation using ViT and a local representation using shallow UNet. Again, I didn't find an ablation justifying using two separate networks for these representations. Why not use some intermediate layer (or the spatial features) from ViT as local features? What's the intuition behind shallow UNet? Is it to prevent features from different objects entangling with each other? The authors should provide such ablations/intuitions in the main paper.
问题
I would like authors to address the points raised in the weakness section.
- What is the impact of stage 2 training? How much value is it adding? What if you skip stage 2 training and directly train on real datasets in stage 3?
- What is the significance of using a separate shallow UNet for local features? Why not just use the intermediate features from ViT as local features.
- How do authors deal with pose ambiguity. Most of the objects used in the paper are symmetric (plates, chairs, etc). Will that symmetry cause problems or make the method stuck in some local minima?
Thank you for recognizing the empirical success and wide applicability of our method. We have addressed your main concerns as follows:
Q1. Ablation studies on the training pipeline.
We appreciate your suggestion regarding the inclusion of additional ablation studies. In Appendix E.1, we have expanded our analysis to emphasize the critical role of our three-stage training pipeline in achieving robust generalization to complex real-world scenes. The ablation results in Table 5 (also attached below) and Figure 13, demonstrate how omitting the initial synthetic dataset training stages significantly compromises the model's capability to distinguish foreground from background elements. These results confirm the indispensability of each stage in our pipeline for achieving the model's full potential.
| Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
| uORF | 19.23 | 0.602 | 0.336 |
| Omit stage 1&2 | 19.67 | 0.565 | 0.568 |
| Omit stage 2 | 24.52 | 0.769 | 0.157 |
| Omit object-centric sampling | 27.89 | 0.843 | 0.083 |
| uOCF | 28.29 | 0.842 | 0.069 |
Q2. Ablation studies on encoder design.
Following your suggestions, we conducted ablation studies on the encoder design in Appendix E.1. The results in Table 4 and Figure 12 indicate that while integrating DINO features enhances performance, the parallel maintenance of a U-Net route is crucial for better performance. The comparative results on the Room-Texture dataset illustrate this point:
| Method | ARI | FG-ARI | NV-ARI | PSNR | SSIM | LPIPS |
|---|---|---|---|---|---|---|
| uORF | 0.649 | 0.108 | 0.587 | 24.37 | 0.688 | 0.251 |
| uORF-DINO | 0.688 | 0.677 | 0.656 | 25.33 | 0.739 | 0.240 |
| uOCF-DINO | 0.652 | 0.346 | 0.621 | 26.06 | 0.723 | 0.251 |
| uOCF-IM | 0.806 | 0.749 | 0.752 | 27.77 | 0.753 | 0.182 |
| uOCF | 0.802 | 0.785 | 0.747 | 28.96 | 0.803 | 0.121 |
Specifically, while uORF-DINO improves over standard uORF by replacing the shallow U-Net encoder with DINO, it still fails to correctly disentangle foreground and background elements. Besides, uOCF-DINO, which drops the U-Net route and utilizes the DINO encoder only, and uOCF-IM, which substitutes DINO's intermediate layer features for the shallow encoder in uOCF, both achieve inferior performance than uOCF. This reaffirms the superiority of our encoder design.
Q3. Handling the pose ambiguity.
We appreciate your focus on the challenge of pose ambiguity, especially regarding symmetric objects. As detailed in Section 3, our method primarily targets the inference of object-centric NeRFs and object positions in 3D scenes from single images, excluding the disentanglement of object orientation. This choice is based on (1) the absence of a universal canonical orientation for diverse object categories, particularly the symmetric objects you have mentioned; (2) The technical challenges in accurate orientation estimation from varied real-world images; (3) Empirical evidence (as detailed in Appendix B) suggesting our model's proficient learning of object scale and orientation after position disentanglement. This approach enables our method to effectively handle real-world scene complexities without delving into the currently challenging problem of pose estimation.
I thank the authors for addressing my concerns.
Summary of Paper Updates
We thank all reviewers for their insightful feedback and valuable suggestions. In response, we have comprehensively updated the paper and the appendix. Below is a summary of the changes:
New Experimental Results
[All reviewers] Ablation Studies on Encoder Design. We have introduced thorough ablation studies in Appendix E.1, elaborated in Table 4 and Figure 12. These studies reveal that substituting a shallow U-Net with DINO is insufficient for separating foreground objects and disentangling foreground from background. Importantly, our design for maintaining a U-Net route alongside DINO features is crucial for enhanced performance and visual fidelity.
[All reviewers] Ablation Studies on Training Pipeline. Additional ablation studies in Appendix E.1, detailed in Table 5 and Figure 13, emphasize the significance of the dual-stage training on synthetic datasets for robust generalization to complex real-world scenes. More discussions on the effect of object-prior learning and sampling techniques are also included.
[jMSr] Additional Zero-Shot Generalizability Analysis. Our method's zero-shot generalizability is now tested on more challenging datasets, with findings detailed in Appendix E.2 and illustrated in Figure 15. Our approach can accurately identify and segment the chair instances in the scene and deliver plausible reconstruction results.
[TQPk] Extended Results on Room-Diverse Dataset. We have included additional experiments on the Room-Diverse dataset in Appendix E.3, with comparative results presented in Table 6. Our method demonstrably surpasses all baseline methods in qualitative performance and visual quality.
[jMSr] Additional Evaluation Metrics. Our scene segmentation evaluation has been expanded to include the Average Precision (AP) metric [1] in Appendix E.4, with results detailed in Table 7.
[TQPk,jMSr,ZWx8] Handling Scenes with Fewer Objects. We clarify the dataset's composition that all scenes contain exactly objects. However, our method performs effectively in scenes with less than objects. Additional discussions and visualizations are provided in Appendix E.5 and Figure 17.
[TQPk,jMSr] Introduction of New Real-World Dataset. The newly curated Planters dataset is introduced in Appendix E.6. Sample images and comprehensive evaluation results are available in Figure 19, Table 10, and Figure 20.
[ZWx8] Experiments with Transparent Objects. New experimental results for scenes containing transparent objects are presented in Appendix E.7, with visual demonstrations in Figure 21. Our model ignores transparency due to its training dataset's absence of transparent objects. However, it still demonstrates reasonable object segmentation and reconstruction capabilities.
Paper Writing Updates
[jMSr] Enhanced Paper Contributions. The contributions section has been restructured for clarity and brevity, emphasizing our novel three-stage training pipeline. See the revised Section 1 for details.
[jMSr] Expanded Relevant Paper Discussions. Discussions about the ONeRF paper have been added to Section 2, enriching our literature review and contextual understanding.
[ZWx8] Reproducibility Statement. A new section on reproducibility has been added. We pledge to release all codes, datasets, models, and detailed instructions upon acceptance of the paper, ensuring replicability and transparency.
[TQPk,ZWx8] Limitation Analysis. We have added discussions on the limitations of our method in Appendix D, addressing the reviewers' concerns.
References
[1] Yang and Yang. "Promising or Elusive? Unsupervised Object Segmentation from Real-world Single Images." In NeurIPS, 2022.
Effect of Number of Slots (). We further explore the effect of in uOCF. Specifically, we consider randomizing the number of objects in stage 2 (each scene contains 24 chair instances) and setting the number of slots exceeds the maximum number of objects in the scene (). We show that the success of uOCF neither requires the synthetic and real datasets to have the same number of objects nor to equal the number of objects in the scene. Detailed discussions are provided in Appendix E.5, with quantitative and qualitative results in Tables 8, 9, and Figures 12, 18.
This paper tackles the task of unsupervised object discovery, and the key insight compared to prior work like uORF is to explicitly disentangle the 3D position of the objects from the latents that instantiate object-centric neural fields. The paper also introduces a new (real-world) dataset and demonstrates promising results across this and prior (synthetic) data.
The ratings from the reviewers were mixed, with one recommending accept, one reject, and 2 others borderline positive. The three positive reviewers appreciated the importance of the task studied, the dataset contribution, as well as the improvements over prior work. The reviewer recommending rejection, however, expressed concerns that the three-stage training process (e.g. initially requiring training one object at a time) and that assumed known number of objects conflict the positioning of the work as ‘unsupervised’. The authors provided additional experiments to address these concerns showing some robustness to not knowing the number of objects (e.g. 5 slots and varying number of objects from 2-4), but this did not fully convince the reviewer.
On the plus side, the new dataset would be helpful for a community and the approach does yield clear empirical improvements and an ability to generalize. However, the AC does also agree that the 3-stage training protocol and the assumption of (approximately) known number of objects is a bit limiting. Another concern is that the paper’s technical contribution is perhaps limited — the key insight is to explicitly model the 3D location, and this is inferred via a pixel location and a known ground plane (which seems to be a key piece of assumed information as well can be a restrictive assumption). Overall, the AC is a bit torn about this paper but feels that despite the empirical results and dataset contributions, this paper is perhaps just below the bar.
为何不给更高分
While this is a well-ablated paper with strong empirical results, the technical contribution is a bit limited. Moreover, the assumptions required (e.g. ability to capture single-object scenes for multi-stage training, or known ground plane, or prior on number of instances) make it slightly difficult to understand whether the approach would really work in more challenging settings.
为何不给更低分
N/A
Reject