PaperHub
6.5
/10
Poster4 位审稿人
最低6最高8标准差0.9
8
6
6
6
3.5
置信度
正确性2.8
贡献度2.8
表达2.8
ICLR 2025

SPA: 3D Spatial-Awareness Enables Effective Embodied Representation

OpenReviewPDF
提交: 2024-09-16更新: 2025-02-26
TL;DR

We introduce SPA, a novel framework that enhances 3D spatial awareness in embodied AI representation learning, outperforming existing models across 268 tasks and 8 simulators.

摘要

关键词
embodied AIrepresentation learning3D spatial awarenessmulti-view imagerobot manipulationneural rendering

评审与讨论

审稿意见
8

This paper introduces a new way of incorporating 3D spatial awareness into 2D visual setting to be used in embodied AI. The authors achieve this by first extracting 2D features from images via a ViT, then construct a 3D feature volume from the multi-view feature maps, then employs differentiable neural rendering to connect the 2D and 3D domains, predict color, depth and semantic features per pixel and then trains the whole model with rendering loss along with some regularizations. This paper also presents an extensive embodied evaluation benchmark.

优点

  1. The paper is very nicely written

  2. The architecture is quite well thought out, tying several components effectively, a feat not easy to achieve or make it work.

  3. The evaluation benchmark has 268 tasks, which is quite extensive and a big improvement over previous benchmarks.

  4. Thorough ablations (mask ratio importance, dataset impact, etc) are very informative

  5. Results are quite nice, showing the potential of SPA

缺点

  1. Benchmark descriptions are not well written. Not clear what the tasks are supposed to be.

  2. Tables do not have sufficient captions, and is a bit difficult to understand the metrics from the tables themselves.

  3. It is not clear from the tables which methods are adapted from vision community to solve embodied AI tasks, thus making it difficult to assess the fairness of the comparison.

  4. Real world task setting is missing the most common vision language tasks which might benefit from spatial awareness. How well does this perform for tasks solved by papers like "Spatial VLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities"?

问题

  1. What are the mask tokens that are used to fill the masked patches in section 2.1? are they learnt?

  2. Can you describe the tasks and what they entail properly? It is not much clear from the paper itself what the single and multi task benchmarks are about.

  3. Can you provide results for proper established real world tasks (object detection or vision language based spatial aware tasks)? You can check the OpenEQA dataset, or the paper "Spatial VLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities" for this.

评论

The results, which report overall Top-K accuracy on the test set, are provided below. These findings are consistent with our previous results and highlight that SPA outperforms other representation learning methods in the monocular grasp pose detection task. The details of this experiment have also been added into Appendix H.

Method (ViT-Base)CLIPDINOv2MoCoV3MAESPA
Overall Accuracy21.1022.0829.3931.0331.20

[1] Gou, Minghao, et al. “RGB Matters: Learning 7-DoF Grasp Poses on Monocular RGBD Images”. Proceedings of the International Conference on Robotics and Automation (ICRA). 2021.

[2] Fang, Hao-Shu, et al. “GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping”. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020.

“What are the mask tokens that are used to fill the masked patches in section 2.1? are they learnt?”

Thank you for your question. Yes, the mask tokens used to fill the masked patches are learnable, similar to the approach employed in MAE.

评论

Dear Reviewer,

Thank you for your detailed review and positive feedback! We are delighted that you found our paper well-written and appreciated the design of our architecture. We are also pleased that you recognized the extensiveness of our evaluation benchmark and found the ablation studies informative. We appreciate your insights and will address your comments and questions below.

“Benchmark descriptions are not well written. Not clear what the tasks are supposed to be.”

“Can you describe the tasks and what they entail properly? It is not much clear from the paper itself what the single and multi task benchmarks are about.”

We are sorry for the confusion. We have added detailed task descriptions in Appendix B in the revised paper.

“Tables do not have sufficient captions, and is a bit difficult to understand the metrics from the tables themselves.”

Sorry for the confusion! To clarify, for the embodied benchmarks, the Mean S.R. metric refers to the "Mean Success Rate," representing the average success rate across all tasks, and serves as an indicator of overall performance. The Mean Rank reflects the average ranking of each method's success rate across tasks, offering a measure of relative performance.

For the camera pose estimation experiment in Table 5, Trans. denotes the translation error, which is computed as the Euclidean distance between the predicted and ground-truth camera pose translations. Rot. refers to the rotation error, measured as the geodesic distance between the predicted and ground-truth rotation quaternions. Further details on the camera pose evaluation can be found in Appendix E.

We have revised the captions in the updated paper to make the metrics clearer and more self-explanatory. Thank you again for your helpful feedback.

“It is not clear from the tables which methods are adapted from vision community to solve embodied AI tasks, thus making it difficult to assess the fairness of the comparison.”

Sorry for the confusion! We categorized the methods into 3 groups: vision-centric, multi-modal, and embodied-specific. The vision-centric methods, including MoCoV3, MAE, and DINOv2, are originally from the vision community, and we evaluate their effectiveness in embodied AI tasks. The multi-modal methods include CLIP, EVA, and InternViT. They are typically CLIP-style language-image pre-trained models and are used specifically for VLMs. The embodied-specific methods, including MVP, VC-1, and our SPA, are designed and pre-trained specifically for embodied AI tasks. We have summarized the categories in Table 2 and Table 3. And we have added more clear explanations in Section 5.1 in the revised paper.

“Real world task setting is missing the most common vision language tasks which might benefit from spatial awareness. How well does this perform for tasks solved by papers like "Spatial VLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities"?

“Can you provide results for proper established real world tasks (object detection or vision language based spatial aware tasks)?”

Thank you for your thoughtful suggestion! We agree that VLMs represent an exciting area of research, and we are eager to explore their potential in future work. This direction, including works like SpatialVLM and OpenEQA, is further discussed in our "Future Work" section. A key aspect of VLMs is the alignment or grounding between vision and language, which is typically achieved through the use of models like CLIP or its variants as vision encoders. In the current implementation of SPA, language grounding has not yet been incorporated during pre-training. However, it can be easily extended by rendering additional CLIP feature maps, for example.

That said, we fully agree that presenting results on relevant real-world tasks would further strengthen our work. To address this, we have conducted an additional experiment on monocular grasp pose detection, a task similar to monocular 3D object detection and closely related to embodied AI. We followed the experimental setup in [1], which involves training a neural network to detect 7-DoF grasp poses from monocular image observations. The experiment was conducted on the GraspNet-1Billion dataset [2], a large-scale benchmark for real-world object grasping. We adhered to the official implementation’s hyperparameters and settings, with the exception of replacing the default ResNet encoder with various pre-trained ViT models for feature extraction. All models used the ViT-Base architecture, and the pre-trained representations were frozen during training.

<CONTINUED>

评论

Dear reviewer,

We wonder if our response answers your questions and addresses your concerns? If yes, would you kindly consider raising the score? Thanks again for your very constructive and insightful feedback!

评论

Dear reviewer,

As the discussion stage is ending soon, we wonder if our response answers your questions and addresses your concerns? If yes, would you kindly consider raising the score? Thanks again for your very constructive and insightful feedback!

评论

Thank you for your effort on the rebuttal. My concerns have been addressed and would like to keep my score. Good luck.

评论

Dear Reviewer,

Thank you very much for the response! We welcome any follow-up discussions!

审稿意见
6

The paper presents SPA, an innovative framework for representation learning that enhances 3D spatial awareness in embodied AI. SPA integrates differentiable neural rendering on multi-view images to give a ViT a strong sense of spatial understanding, enabling it to excel in various embodied tasks. The author conducted an extensive evaluation across 268 tasks in 8 different simulators, addressing both single-task and language-conditioned multi-task scenarios. SPA achieves superior performance with less training data. Real-world experiments confirm SPA’s practical effectiveness, underscoring the importance of 3D spatial awareness in representation learning for embodied AI. Overall, the paper is well-written and the experiments are extensive.

优点

The paper is well-written. The method named SPA is simple yet effective. The authors have carefully designed their experiments, showcasing SPA's impressive performance across a wide range of task types, including both single-task and language-conditioned multi-task scenarios. This level of comprehensive evaluation highlights the versatility and robustness of SPA. I greatly appreciate the extensive efforts the authors have invested in the evaluation process, as they rigorously compared SPA against multiple state-of-the-art representation methods.

缺点

My main concerns about this paper are listed below.

  1. The author claims that the paper proposes a significant spatial hypothesis that 3D spatial awareness is crucial for embodied representation learning. However, I believe this hypothesis is clear long before and that is also why many work try to use 3D features for embodied tasks. I appreciate the authors' efforts to demonstrate this, but I do not think it is a significant 'new' hypothesis.

  2. About the methodology. Previous work have tried to generate 3D representations from 2D images and use them for embodied tasks. But here, the author randomly mask patches across multi-view images. So I'm worried about the quality of the volume construction. MAE leverages a high ratio of mask and I'm wondering whether the quality of construction will be affected, and then make the training objective too easy during the volume rendering.

  3. As the paper is about how to integrate 3D spatial awareness into 2D backbones, I believe some work about learning 3D features from 2D images should be further discussed. However, in section "Representation Learning for Embodied AI", I didn't see too much about this.

Overall, I think there are still some questions in this paper, both on writing and methodology. But I may consider raise the score if the authors can explain more about question 2. And I would like to know the attitude of the authors towards question 1&3

问题

My questions are listed in weakness section.

评论

Finally, our volume construction process is explicit, which projects voxel features into multi-view image spaces based on their coordinates. Additionally, our use of deformable attention expands the receptive field, allowing the model to attend to unmasked areas. Therefore, even with a relatively high mask ratio, the quality of volume construction remains robust.

Mask RatioAdroitMetaWorldDMControlTriFingerMean Success Rate
0.0053.3±4.688.5±5.757.5±2.674.1±0.670.36
0.2552.7±3.189.6±4.557.6±3.070.4±1.770.17
0.5053.3±4.288.8±1.660.1±3.172.6±0.771.18
0.7551.3±1.288.0±3.561.1±3.573.0±0.871.01
0.9551.3±1.285.6±4.062.5±5.373.1±0.270.67

“As the paper is about how to integrate 3D spatial awareness into 2D backbones, I believe some work about learning 3D features from 2D images should be further discussed. However, in section ‘Representation Learning for Embodied AI’, I didn't see too much about this.”

Thank you for this valuable suggestion! As we mentioned earlier, most previous works on 3D robot learning have predominantly focused on using explicit 3D input observations [1][2][3] or lifting 2D features into 3D spaces [4][5]. In contrast, our approach directly encodes 3D knowledge into the 2D backbone without explicit 3D input data, with a focus on representation pre-training.

While there are several computer vision methods that operate in a similar space [6][7][8], our approach offers certain advantages. For example, unlike [8], which requires point cloud data for pair-wise contrastive learning with image pixels, our method does not rely on such inputs, making it more versatile and accessible.

That said, we agree that a more detailed discussion of related work on 3D robot learning and 3D representation learning in computer vision would further enrich the paper. In response to your suggestion, we have expanded the related work section (see “3D Robot Learning and 3D-Aware Computer Vision”) to include these discussions. Thank you again for helping us improve the paper.

[4] Ke, Tsung-Wei, et al. “3D Diffuser Actor: Policy Diffusion with 3D Scene Representations.” Conference on Robot Learning (CoRL). 2024.

[5] Goyal, Ankit, et al. "RVT-2: Learning Precise Manipulation from Few Demonstrations." Robotics: Science and Systems (RSS). 2024. [6] Yang, Honghui, et al. "Unipad: A universal pre-training paradigm for autonomous driving." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024.

[7] Yue, Yuanwen, et al. "Improving 2D feature representations by 3D-aware fine-tuning." European Conference on Computer Vision (ECCV). 2025.

[8] Zhang, Sha, et al. “Hvdistill: Transferring knowledge from images to point clouds via unsupervised hybrid-view distillation.” International Journal of Computer Vision (IJCV). 2024.

评论

Dear Reviewer,

We truly appreciate your thorough review, positive feedback, and time and efforts taken to help us strengthen the paper even more! We are thrilled that you found the paper well-written and recognized the simplicity and effectiveness of SPA. We are especially pleased that you appreciated our comprehensive evaluation. Your acknowledgment of the careful experimental design and the rigorous comparisons with state-of-the-art methods is greatly appreciated. We will address your concerns and questions below.

“The author claims that the paper proposes a significant spatial hypothesis that 3D spatial awareness is crucial for embodied representation learning. … I appreciate the authors' efforts to demonstrate this, but I do not think it is a significant 'new' hypothesis.”

Thank you for your thoughtful comment. We agree that 3D robot learning has been explored in numerous works, such as [1][2][3], as we acknowledge in the paper (e.g. Lines 046–048). However, much of the prior research relies on explicit 3D input observations, which are often challenging to obtain and scale effectively. In contrast, the focus of our work is on representation learning— learning pre-trained knowledge from large-scale, unlabeled raw images. While both approaches involve 3D spatial understanding, they address different challenges and contexts.

Our hypothesis emphasizes the importance of 3D spatial awareness specifically within the domain of representation learning for embodied AI, a perspective we believe has not been systematically explored in prior work. To the best of our knowledge, SPA is the first approach to learn 3D spatial-aware representations using a vanilla 2D encoder in the context of representation learning for embodied AI. Although it may seem intuitive that 3D spatial awareness can benefit embodied representation learning, previous methods have not explicitly incorporated or empirically validated this hypothesis. Therefore, we believe our focus on this hypothesis, along with our proposed method and large-scale evaluation, represents a novel and significant contribution to the field.

[1] Mohit, Shridhar, et al. "Perceiver-actor: A multi-task transformer for robotic manipulation." Conference on Robot Learning (CoRL). 2023.

[2] Zhu, Haoyi, et al. “Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning.” Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track. 2024.

[3] Ze, Yanjie, et al. “3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations.” Robotics: Science and Systems (RSS). 2024.

“Previous work have tried to generate 3D representations from 2D images and use them for embodied tasks. But here, the author randomly mask patches across multi-view images. So I'm worried about the quality of the volume construction. MAE leverages a high ratio of mask and I'm wondering whether the quality of construction will be affected, and then make the training objective too easy during the volume rendering.”

Thank you for raising this excellent question! You are correct that higher mask ratios can impact the quality of rendering output during pre-training. For instance, MAE uses a mask ratio as high as 75%, which results in relatively low reconstruction quality. However, as observed in both MAE and in our experiments, better reconstruction or rendering quality does not necessarily lead to better representations for downstream tasks (actually sometimes worse). Our primary objective is to learn effective representations rather than to achieve high-fidelity reconstructions.

Additionally, similar to MAE, we apply masking only during the pre-training phase. In downstream embodied tasks, no masking is applied, and the rendering decoder is discarded. We only utilize the pre-trained ViT encoder for feature extraction. As a result, the masking strategy and volume construction during pre-training do not directly influence the downstream tasks.

Moreover, while we adopt a multi-view masking strategy, it differs slightly from the original MAE. We use a mask ratio of 50%, which was found to be optimal based on our ablation studies (see Table 7, also provided below for convenience). The masked patches are selected independently across views, meaning that as long as there is overlap between the multi-view images, the model can infer the missing areas from other views or adjacent patches, thereby enhancing its spatial awareness. Since SPA is asked to render not only RGB images but also depth images during pre-training, the training objective remains non-trivial.

<CONTINUED>

评论

Dear reviewer,

We wonder if our response answers your questions and addresses your concerns? If yes, would you kindly consider raising the score? Thanks again for your very constructive and insightful feedback!

评论

Thanks for the efforts! The author have addressed most of my concerns and I would like to raise my score.

评论

Dear Reviewer,

Thank you so much for the update! We sincerely appreciate your constructive feedback and valuable advice!

审稿意见
6

This work introduces SPA, a representation learning method that incorporates 3D spatial awareness into Vision Transformers using differentiable rendering as a pretraining objective. Starting with multi-view images and camera poses, the method constructs feature volumes using deformable attention and employs NeuS-based volume rendering to generate self-supervised RGBD and semantic maps for pretraining. The authors claim their 3D pre-training objective better captures the 3D spatial relationships for embodied tasks. They benchmark their approach through an extensive evaluation, spanning five existing benchmarks (VC-1, Franka, Meta-World, RLBench and LIBERO). The results demonstrate consistent improvements over both vision-centric and embodied-specific baselines, with particularly strong performance on zero-shot camera pose estimation tasks.

优点

Method addresses an important gap in current representation learning approaches by explicitly incorporating 3D spatial understanding through differentiable rendering.

Comprehensive evaluation across a large number of tasks and simulators demonstrates the broad applicability of their approach.

The self-supervised nature of their training signal (generated RGBD and semantic maps) is an interesting direction that reduces the need for expensive labeled data.

缺点

The paper's performance on LIBERO-spatial (Table 4) is somewhat counterintuitive. This seems like quite an important benchmark out of all the evaluation tasks, and given SPA's neural rendering pretraining objective, one would expect stronger results on spatial tasks.

It seems to me that AM-RADIO should be a baseline comparison in Table 3, given that the feature maps are in used as supervision during pre-training.

问题

Could the authors clarify if Section 2.2 represents a novel contribution or builds on existing methods?

I believe the DROID dataset consists of dynamic scenes, which is not explicitly handled by the NeuS volume rendering. Did the authors do anything special to handle this?

Minor suggestions:

Table 1 appears to be structured more like an ablation study than a main result. For reading flow, it might be helpful to move this to a later section.

Figure 1 is a bit hard to read due to choice of colors.

评论

“Table 1 appears to be structured more like an ablation study than a main result. For reading flow, it might be helpful to move this to a later section.”

Thanks for the suggestion! We have revised the paper and adjusted the section order to improve the reading flow.

“Figure 1 is a bit hard to read due to choice of colors.”

Thank you for pointing this out! In the revised paper, we have adjusted the color scheme of Figure 1 to improve clarity. We appreciate your feedback and welcome any further suggestions.

评论

Dear Reviewer,

Thank you for your insightful review and positive feedback. We are pleased that you found our incorporation of 3D spatial understanding through differentiable rendering valuable and appreciated the broad applicability of our evaluation. We are also glad that the self-supervised training signal and its potential to reduce the need for labeled data resonated with you. Below, we address your concerns and comments:

“The paper's performance on LIBERO-spatial (Table 4) is somewhat counterintuitive. …, one would expect stronger results on spatial tasks.”

This is an insightful observation. It is important to clarify the distinction between the types of "spatial" reasoning being evaluated. In the LIBERO-Spatial benchmark, the term "spatial" refers specifically to spatial reasoning based on language relationships between objects, which differs from the visual "spatial awareness" that is central to our work. For example, the LIBERO-Spatial tasks involve instructions such as:

  • pick up the black bowl between the plate and the ramekin and place it on the plate
  • pick up the black bowl next to the ramekin and place it on the plate
  • pick up the black bowl from table center and place it on the plate

As these examples show, the tasks require the model to understand spatial relationships as described in language. Consequently, models like EVA, which are pre-trained on language-image data, tend to perform well on these tasks. In contrast, SPA does not incorporate language supervision, which may explain its comparatively lower performance in LIBERO-Spatial. While it would be relatively straightforward to extend SPA with language supervision, for example, by incorporating CLIP feature map rendering, this is not the primary focus of our current work, as outlined in Section 2.4.

Moreover, in our real-world tasks, the policy does not rely on any language inputs, and objects are placed in random locations and orientations during both training and evaluation. These tasks focus purely on spatial factors, and the results effectively demonstrate SPA’s strong spatial awareness, independent of language-based spatial reasoning.

“It seems to me that AM-RADIO should be a baseline comparison in Table 3, given that the feature maps are used as supervision during pre-training.”

Thanks for the suggestion! We have updated the table in the revised paper.

“Could the authors clarify if Section 2.2 represents a novel contribution or builds on existing methods?”

Thank you for this great question! While Section 2.2 builds upon established techniques from the autonomous driving community, we believe it also constitutes a novel contribution within the context of embodied representation learning. Specifically, the application of 3D volume construction without depth or point cloud information remains uncommon in robot learning—particularly in embodied representation learning. Just as works like MVP and VC-1 adapt the same method from MAE to the embodied representation domain, we believe that applying these techniques in a new context represents a meaningful and novel contribution. Building on this framework, we have discovered the critical importance of 3D spatial awareness in pre-training, and we conducted extensive validation and evaluation to substantiate our claims. We believe these findings could inspire further advancements in the field.

“I believe the DROID dataset consists of dynamic scenes, which is not explicitly handled by the NeuS volume rendering. Did the authors do anything special to handle this?”

Great point! Our current work primarily focuses on 3D spatial awareness, which is orthogonal to the temporal dimension. As a result, we process single timestamp images as input, similar to approaches like MVP and VC-1, which extract individual frames from dynamic videos. For the Droid dataset, due to the high similarity between frames, the videos are first downsampled by a factor of 15 during pre-processing, resulting in 1.78 million extracted image frames. More pre-processing details can be found in Appendix C.1.

Looking ahead, we agree that handling dynamic scenes could be a promising future research direction. Thanks to recent advances in dynamic rendering [1], we could replace the NeuS renderer with methods such as D-NeRF [2] or 4D-GS [3] that support dynamic scene rendering.

[1] Yunus, Raza, et al. "Recent Trends in 3D Reconstruction of General Non-Rigid Scenes." COMPUTER GRAPHICS. 2024.

[2] Pumarola, Albert, et al. "D-nerf: Neural radiance fields for dynamic scenes." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021.

[3] Wu, Guanjun, et al. "4D Gaussian Splatting for Real-Time Dynamic Scene Rendering." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024.

<CONTINUED>

评论

Dear reviewer,

We wonder if our response answers your questions and addresses your concerns? If yes, would you kindly consider raising the score? Thanks again for your very constructive and insightful feedback!

评论

Dear reviewer,

As the discussion stage is ending soon, we wonder if our response answers your questions and addresses your concerns? If yes, would you kindly consider raising the score? Thanks again for your very constructive and insightful feedback!

评论

Thank you for your detailed response and clarification on the Libero-Spatial benchmark.

Regarding Section 2.2: Ok great! Thank you for the clarifications. I agree with the authors that this module serves as a contribution for robot learning even though the components are from prior work.

As of now, I am leaning towards keeping my original score but will wait for additional reviewer discussion to update my score.

评论

Dear Reviewer,

Thank you very much for the response! We welcome any follow-up discussions!

审稿意见
6

This paper propose a representation learning framework named SPA to incopriate the 3D spatial awarness in embodied AI. SPA represents an advancement in embodied representation learning by enhancing a standard ViT with intrinsic 3D spatial understanding through neural rendering. The comprehensive evaluation and consistent empirical success show the effectiveness of spatial awareness in embodied AI tasks.

优点

  1. The paper conducts one of the most extensive evaluations in embodied AI representation learning to date, covering 268 tasks across 8 different simulators. This large-scale evaluation provides a thorough comparison with multiple state-of-the-art methods showing a significant level of empirical rigor.
  2. SPA uses neural rendering and multi-view images to enhance the 3D spatial awareness of the ViT, which is an effective way to give the model a better understanding of depth information and spatial relationships in 3D scenes.
  3. The paper makes an important conceptual contribution by proposing and validating the spatial awareness hypothesis, which could guide future research in embodied AI representation learning.

缺点

  1. SPA simply extends and adapts the ViT by adding neural rendering and 3D spatial features to enhance expressiveness, but the underlying architecture is still the same ViT that already exists; in contrast, many new model architectures make more independent innovations at the algorithmic level. This paper is lack sufficient groundbreaking innovations in model architecture.
  2. The evaluation of SPA focuses primarily on imitative learning and does not fully explore reinforcement learning or other complex learning paradigms. This limits the scope of understanding the generalizability and performance of the model under different real-world conditions.
  3. SPA is designed for static multi-view images, which constrains the ability to model dynamics and temporal changes that are essential in many embodied AI scenarios. And the paper proposes 3D spatial awareness is critical, however it does not consider temporal relationships.
  4. While the paper claims that SPA has achieved significant improvements in 3D spatial perception, there is a lack of independent comparative experiments to validate the contribution of neural rendering for the overall performance improvement. The specific effects of multi-view learning and 3D spatial understanding are not isolated. This makes it impossible to determine whether the improvements come from 3D spatial awareness or from better data or other improvements in training strategies, etc.

问题

  1. SPA currently focuses on static multi-view scenes. How does the model generalize to dynamic environments, especially where temporal information is critical?
  2. The evaluation is focused solely on imitation learning. How would SPA perform in a reinforcement learning setting?
  3. Are there any optimizations or lightweight versions of SPA that could reduce computational costs?
评论

Dear Reviewer,

Thank you for your thoughtful review, valuable feedback, and the time taken to provide constructive feedback to strengthen our work further! We are pleased that you recognize the scale of our evaluation, the effectiveness of our approach in enhancing 3D spatial awareness, and the conceptual contribution of the spatial awareness hypothesis. We will address your concerns and questions below.

“SPA simply extends and adapts the ViT... This paper is lack sufficient groundbreaking innovations in model architecture.”

Thanks for the question. We intentionally build our method on top of the vanilla ViT encoder, as our primary focus is on representation learning rather than model architecture design, which we believe are complementary but orthogonal areas of research. Similar to well-established representation learning methods in computer vision, such as MAE, MoCo, and DINOv2, which also use the ViT encoder, we consider this a common and reasonable choice for studying representation learning.

Additionally, the ViT architecture is highly versatile and can be easily integrated into current Vision-Language Models (VLMs), ensuring that our method can be potentially applied in a broader range of applications.

Lastly, we opted for the vanilla ViT to ensure a fair comparison with other methods. While incorporating architectural modifications could potentially enhance performance, doing so would introduce variability that could affect the fairness of the comparisons. Additionally, this lies outside the scope of our current research, which is focused on advancing representation learning specifically.

“The evaluation of SPA focuses primarily on imitative learning and does not fully explore reinforcement learning or other complex learning paradigms.”

“The evaluation is focused solely on imitation learning. How would SPA perform in a reinforcement learning setting?”

Thank you for your insightful comment. We recognize that robot learning encompasses a variety of paradigms, including both imitation learning and reinforcement learning. Our initial focus was on imitation learning due to its practicality and widespread adoption in real-world applications. For this reason, we implemented a diverse set of policy methods (e.g., MLP, transformer, RVT, diffusion policy). Additionally, focusing on imitation learning for embodied representation evaluation is a common practice, as demonstrated in several recent works [1][2][3].

[1] Nair, Suraj, et al. "R3M: A Universal Visual Representation for Robot Manipulation." Conference on Robot Learning (CoRL). 2023.

[2] Karamcheti, Siddharth, et al. "Language-driven representation learning for robotics." Robotics: Science and Systems (RSS). 2023.

[3] Zeng, Jia, et al. "Learning Manipulation by Predicting Interaction." Robotics: Science and Systems (RSS). 2024.

That said, we agree that incorporating reinforcement learning or other paradigm experiments could further strengthen our work. In response to your suggestion, we have conducted two extra experiments and added the discussion to Appendix G and Appendix H.

<CONTINUED>

评论

Additional Experiment 1: Reinforcement Learning

We conduct additional RL experiments following the settings in [4] to use DrQ-v2[5], a state-of-the-art off-policy actor-critic approach for continuous vision-based control. We train some RL experiments with different pre-trained vision representations with ViT-Base architectures. The vision encoders are frozen during RL training. Five tasks in the Meta-World benchmark are chosen, as shown below. We train for a total of 1.1M frames and all other hyper-parameters including random seeds are kept as default and same. We run three seeds for each experiment. We report the evaluation success rate and episode reward below as well as in Table 9 in the revised paper. The reward curves are also visualized in Figure 7 in Appendix G. From the results, we can observe that under RL settings, our 3D spatial-aware representation still performs better than other representation learning methods.

Meta-World RL TaskMethod (ViT-B)Success RateEpisode Reward
button-press-topdown-v2CLIP0.93653.97
DINOv21.00746.04
MAE0.46517.54
MoCoV30.99749.93
SPA (Ours)1.00778.47
hammer-v2CLIP0.00401.41
DINOv20.67746.74
MAE0.66720.19
MoCoV30.59645.46
SPA (Ours)1.00870.32
lever-pull-v2CLIP0.00478.18
DINOv20.00694.73
MAE0.00540.44
MoCoV30.23598.54
SPA (Ours)0.15646.33
coffee-pull-v2CLIP0.00181.40
DINOv20.00180.72
MAE0.00184.56
MoCoV30.00225.73
SPA (Ours)0.00262.11
drawer-close-v2CLIP1.001228.90
DINOv21.001236.30
MAE1.001233.91
MoCoV31.001233.46
SPA (Ours)1.001235.81
MeanCLIP0.39588.77
DINOv20.53720.91
MAE0.42639.33
MoCoV30.56690.63
SPA (Ours)0.63758.61

[4] Hu, YingDong, et al. “For Pre-Trained VIsion Models in Motor Control, Not All Policy Learning Methods are Created Equal.” International Conference on Machine Learning (ICML). 2023.

[5] Yarats, Denis, et al. "Mastering visual continuous control: Improved data-augmented reinforcement learning." International Conference on Learning Representations (ICLR). 2022.

Additional Experiment 2: Monocular Grasp Pose Detection

We also conduct a monocular grasp pose detection experiment to further investigate more complex robotics learning paradigms. We follow similar settings in [6], which train a neural network to detect the 7-DoF grasp poses on monocular image observations. The experiment is conducted on GraspNet-1Billion [7], a large-scale real-world object grasping benchmark. We follow the hyper-parameters and setups in the official implementation, except that we replace the default ResNet with different pre-trained ViT models for feature extraction. All pre-trained representations are with ViT-Base architecture and are frozen during training. We report the overall Top-K accuracy on the test set below. The results align well with our findings and indicate that SPA also outperforms other representation learning methods in the monocular grasp pose detection task.

Method (ViT-B)CLIPDINOv2MoCoV3MAESPA
Test Accuracy21.1022.0829.3931.0331.20

[6] Gou, Minghao, et al. “RGB Matters: Learning 7-DoF Grasp Poses on Monocular RGBD Images”. Proceedings of the International Conference on Robotics and Automation (ICRA). 2021.

[7] Fang, Hao-Shu, et al. “GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping”. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020.

<CONTINUED>

评论

“SPA is designed for static multi-view images… it does not consider temporal relationships.”

“SPA currently focuses on static multi-view scenes. How does the model generalize to dynamic environments, especially where temporal information is critical?”

Great point! Currently, SPA primarily focuses on static image representation learning, similar to methods like MAE, MVP, and VC-1. Our primary goal is to investigate the effectiveness of spatial awareness, and thus we have centered our approach on this aspect. The utility of temporal information has already been demonstrated by several existing works, such as R3M and Voltron.

However, we fully agree that incorporating dynamic temporal information into SPA, which is orthogonal to 3D spatial awareness, could further improve performance. We briefly touch on this possibility in the future work section (Section 7) of our paper, and we are excited about the potential benefits this extension could bring. While we leave this exploration for future work, we can suggest some possible approaches:

  • We could leverage recent advancements in dynamic rendering techniques from 3D vision and replace the current NeuS renderer with a dynamic renderer, such as D-NeRF [1] or 4D-GS [2]. Given the rapid development in the dynamic rendering space [3], we believe this extension would be a natural and promising direction.

  • Another approach could involve integrating temporal representation learning frameworks, such as R3M [4] and MPI [5], to introduce additional temporal contrastive or prediction tasks. For instance, we could compute temporal contrastive losses between the rendered outputs at different timestamps to capture the temporal dynamics.

[1] Pumarola, Albert, et al. "D-nerf: Neural radiance fields for dynamic scenes." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021.

[2] Wu, Guanjun, et al. "4D Gaussian Splatting for Real-Time Dynamic Scene Rendering." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024.

[3] Yunus, Raza, et al. "Recent Trends in 3D Reconstruction of General Non-Rigid Scenes." COMPUTER GRAPHICS. 2024.

[4] Nair, Suraj, et al. "R3M: A Universal Visual Representation for Robot Manipulation." Conference on Robot Learning (CoRL). 2023.

[5] Zeng, Jia, et al. "Learning Manipulation by Predicting Interaction." Robotics: Science and Systems (RSS). 2024.

“…, there is a lack of independent comparative experiments to validate the contribution of neural rendering for the overall performance improvement. … This makes it impossible to determine whether the improvements come from 3D spatial awareness or from better data or other improvements in training strategies, etc.”

Thank you for your insightful question! To clarify the contribution of neural rendering to the overall performance of SPA, we conducted an additional ablation study. In this study, we maintained all settings identical—data loading, training techniques, hyperparameters, and the encoder—while replacing the volume neural rendering decoder with a multiview transformer-based decoder, similar to the MAE decoder. This alternative decoder receives masked patches filled with mask tokens corresponding to multiview images. Additional camera pose embeddings are added, and attention layers are used to fuse the multiview information and reconstruct RGB and depth images. We refer to this baseline as MV-MAE. It was trained on the ScanNet dataset without semantic supervision, ensuring a fair comparison with the result in the last line of Table 7 of the mask ratio and loss components ablation study. The results from this experiment demonstrate that neural rendering is crucial for incorporating explicit 3D spatial information. Simple multiview attention-based interaction, as used in MV-MAE, does not perform as effectively in learning 3D spatial awareness.

Method (ViT-B)Meta-WorldDMControl
MV-MAE84.8±5.859.6±3.2
SPA88.0±4.561.5±3.4

<CONTINUED>

评论

Moreover, to further substantiate that SPA’s performance improvements are primarily due to its 3D spatial awareness capabilities, rather than data quality or training strategies, we have conducted another ablation study, with results presented in Table 6 of the paper. Specifically, we used an ImageNet pre-trained MAE model, a strong representation learner, and loaded both its encoder and decoder weights. We then continued to pre-train this model using the same datasets, training strategies, and pre-training steps as SPA. This baseline is referred to as SPA-MAE. Under these controlled conditions, SPA consistently outperforms SPA-MAE, reinforcing the conclusion that the improvements in SPA’s performance stem from its pre-training objective, which explicitly incorporates 3D spatial awareness, rather than from differences in data or training methodologies.

Method (ViT-B)AdroitMeta-WorldDMControlTriFingerMean Success Rate
SPA-MAE55.33±3.0690.67±6.0063.85±3.6070.14±0.9873.11
SPA52.00±3.4692.00±4.1664.21±3.5273.06±0.5173.66

“Are there any optimizations or lightweight versions of SPA that could reduce computational costs?”

Thank you for this insightful suggestion! Our primary results are based on ViT-Large, as this model size is commonly used in recent representation learning methods, including MAE, MoCoV3, DINOv2, CLIP, InternViT, VC-1, and others. Furthermore, in real-world scenarios, we observe that a single 4090 GPU is capable of supporting real-time inference with ViT-Large.

For convenience, we have also pre-trained a ViT-Base version of SPA, which demonstrates competitive performance and outperforms previous ViT-Base representation methods, as shown in Table 4. ViT-Base is another widely used model size in practical applications. While we have not yet pre-trained smaller versions, such as ViT-Small or ViT-Tiny, we would be happy to explore these lighter architectures if the community is interested.

评论

Dear reviewer,

We wonder if our response answers your questions and addresses your concerns? If yes, would you kindly consider raising the score? Thanks again for your very constructive and insightful feedback!

评论

Dear reviewer,

As the discussion stage is ending soon, we wonder if our response answers your questions and addresses your concerns? If yes, would you kindly consider raising the score? Thanks again for your very constructive and insightful feedback!

评论

Dear reviewer,

As the discussion stage is ending soon, we wonder if our response answers your questions and addresses your concerns? If yes, would you kindly consider raising the score? Thanks again for your very constructive and insightful feedback!

评论

Thanks for your response and additional experiments. I would raise my scores.

评论

Dear Reviewer,

Thank you so much for the update! We sincerely appreciate your constructive feedback and valuable advice!

评论

Dear reviewers and meta-reviewers,

We are grateful for the time you have spent providing us with constructive feedback and thoughtful comments. We sincerely appreciate that all reviewers found our large-scale evaluation, methodological contributions, and the clarity of our paper to be significant strengths.

For this rebuttal, we have conducted additional experiments, ablations, and analyses to further address concerns and provide more insights. We are glad that SPA's novel approach to incorporating 3D spatial understanding, its comprehensive evaluation, and the potential impact on embodied AI were well-received. The paper has been updated according to the suggested revisions and are highlighted in orange.

We welcome any follow-up discussions and are excited to improve the paper further based on your feedback!

AC 元评审

The paper introduces SPA, a framework that integrates 3D spatial awareness into Vision Transformers (ViTs) for embodied AI tasks. SPA uses differentiable neural rendering on multi-view images to improve the ViT’s understanding of 3D spatial relationships. The authors evaluate SPA across 268 tasks in 8 simulators, showing that it outperforms over 10 state-of-the-art methods while using less training data. Real-world experiments further confirm its practical utility, highlighting the importance of 3D spatial awareness in embodied AI.

Strengths of the Paper:

-- The paper presents a large-scale evaluation of SPA across 268 tasks in diverse simulators, providing strong empirical evidence of its effectiveness in both single-task and multi-task scenarios.

-- By integrating neural rendering, SPA improves the ViT’s spatial understanding, which is crucial for embodied AI tasks involving 3D scenes.

-- The paper includes real-world experiments and a commitment to open-source the code, promoting further research in embodied AI.

Weaknesses of the Paper:

-- The paper focuses primarily on imitation learning, without exploring reinforcement learning or dynamic environments, which are critical for many real-world applications.

-- SPA only works with static multi-view images and doesn’t handle temporal dynamics. Additionally, the paper doesn’t isolate the impact of its key components, making it unclear if improvements are due to the 3D spatial features or other factors.

-- The evaluation lacks commonly used real-world benchmarks like object detection or vision-language tasks that could better demonstrate SPA’s broader applicability.

After carefully reading the paper, reviews and rebuttal discussions, the AC agrees with the majority of the reviewers on accepting the paper.

审稿人讨论附加意见

The weaknesses are described above. The authors have addressed most comments in rebuttal and the reviewers generally agree to accept the paper.

最终决定

Accept (Poster)