Unifying 2D and 3D Vision-Language Understanding
UniVLG unifies 2D and 3D vision-language tasks, transferring 2D knowledge to the 3D domain.
摘要
评审与讨论
The paper proposes UniVLG, a model that can be trained on both 2D and 3D vision-language data for both 2D and 3D tasks. Specifically, the model relies on pre-trained 2D image features and lifts 2D data to 3D to take advantage of large-scale 2D datasets. It also defines a mask decoding head which outperforms bounding box decoders. The method directly takes sensor-generated data as input which is a more realistic evaluation setting than previous methods using mesh-reconstructed point clouds. Experiments on tasks including 3D referential grounding, 3D question answering and 2D referential grounding demonstrate the effectiveness of this method.
update after rebuttal
The rebuttal addressed my questions so I raise my score.
给作者的问题
NA
论据与证据
The paper made the following claims:
- Its method unified 2D and 3D visual grounding. It is illustrated by experiments on both 2D and 3D referential grounding tasks with the same model.
- It achieves state-of-the-art performance on both in-domain and out-of-domain 3D referential grounding datasets, which is shown by experiments results.
- It proposes a language-conditioned 3D mask decoder, and its superior performance was shown in the ablation study (table 5).
- It uses a realistic evaluation setting, which is also shown in experiments (Table 1 and Table3).
方法与评估标准
The methods and evaluation make sense.
- For the method, there are way more 2D datasets than 3D ones, so it makes sense to utilize the vast body of 2D data for enabling 3D tasks. The network design also makes sense to me.
- The method is evaluated on several tasks including 3D and 2D referential grounding, 3D question answering and 3D instance segmentation using widely used benchmarks.
理论论述
No theoretical claim made.
实验设计与分析
The experimental designs and analyses make sense to me. It did experiments on three tasks in the main paper and one in the appendix. It also did a thorough ablation study to validate key designs. To make fair comparisons, it evaluated all methods using the same point clouds as well as retraining a subset of methods on sensor point clouds.
补充材料
I reviewed the whole appendix. No further question.
与现有文献的关系
- The paper proposed a way to utilize the rich 2D data for 3D vision language understanding tasks, which largely improved their performance and bridges the gap between 2D and 3D models.
- The paper proposed a unified architecture of both 2D and 3D tasks.
- There are several findings that could inspire future research, e.g. the comparison between using non-parametric and parametric queries, and the essence of visual tokens updating during mask decoding.
遗漏的重要参考文献
NA
其他优缺点
- Strengths please see sections above.
- Weaknesses: (1) Qualitative results are included in the appendix now (figure 3 and 4). They should be included in the main paper. (2) Since the paper claims it has a more realistic embodied-aligned evaluation setting, it would be beneficial to include results on data without ground truth pose and depth. Currently there are experiments on noisy pose and depth on SR3D in the appendix but I am more interested in real-life data instead of manually adding gaussian noises.
其他意见或建议
- Duplicate "performance" in L73.
- Duplicate in L668-669
Thank you for your review.
(1) Qualitative results are included in the appendix now (figure 3 and 4). They should be included in the main paper.
Thank you for the feedback. We agree that these should be included in the main paper and we will incorporate them using the additional page given for the camera-ready version.
(2) Since the paper claims it has a more realistic embodied-aligned evaluation setting, it would be beneficial to include results on data without ground truth pose and depth. Currently there are experiments on noisy pose and depth on SR3D in the appendix but I am more interested in real-life data instead of manually adding gaussian noises.
We want to clarify that our experiments do not use "ground-truth" pose or depth. The SR3D, NR3D, and ScanRefer datasets rely on 3D scenes from ScanNet, which captures depth using a Kinect sensor and estimates camera poses with the widely used BundleFusion algorithm. Therefore, the depth and pose data in our experiments are already realistic and embodied-aligned, naturally containing real-world noise. The experiments in Appendix A.9 further explore the effects of adding additional noise beyond the existing real-world sensor noise.
We are happy to clarify any remaining concerns that you may have that might lead you to reconsider increasing your score. Thank you.
This paper proposes a unified architecture for 2D and 3D vision language understanding. The method is based on Jain et al. 2024 where the additional innovations are in sharing all parameters between 2D and 3D instead of a subset, and extending the application to referential grounding, The paper uses a. number of existing SOTA methods in different parts of their pipeline to start with a special finding that instead of freezing the visual features, updating them is very crucial for 3D referential grounding.
给作者的问题
Can we know if there are any adverse effect of changing the visual features? What if the points from wang et al. has noise? How the method behaves? Can we see the videos of segmentation in a supplementary material?
What are the failure cases? What is the future work? Why maskformer had to introduced? Line 194-200 is bit convoluted. Please make them simple and illustrative. How this method perform w.r.t LLava-3D?
论据与证据
The claims are in unified 2D-3D visual grounding by updating the visual features for improved referential grounding and language conditioned mask decoder. The evidences are there in the results section through quantitative and qualitative results.
方法与评估标准
The method is primarily adjusted from Jain et al. 2024 with an update of visual features, language conditioned mask decoder. The evaluations are done on out-of-domain referential grounding, 3D question answering, 2D referential grounding. The incremental additions in the method seems working.
理论论述
No theoretical claims
实验设计与分析
Yes the experimental designs are good.
补充材料
No supplementary material
与现有文献的关系
The contribution is incremental w.r.t the broader literature.
遗漏的重要参考文献
Yes the references are good.
其他优缺点
The writing of the paper is good, the method is well positioned and results are validated. The paper tries to state where they innovate and the impact of the innovations. I like the way this paper is written.
其他意见或建议
Nothing.
Thank you for your review.
“special finding that instead of freezing the visual features, updating them is very crucial for 3D....”
We want to clarify a potential misunderstanding: One of our significant findings is not in unfreezing the visual features but rather on allowing them to attend and propagate through the decoder (Table 7 & 5). In contrast, prior mask decoders allow visual features to receive gradients yet prevent them from attending to object queries and language features. As we reiterate below, this is just one aspect of our contribution.
“Incremental over Jain et al., 2024 (ODIN); Incremental w.r.t the broader literature.”
We offer the following counterarguments and invite you to reconsider your position:
- UniVLG tackles referential grounding and question-answering, while ODIN focuses solely on object segmentation.
- UniVLG significantly modifies the mask decoder to better incorporate language information. Our design choices—updating visual features, using parametric queries, and adding a box loss—are crucial for referential grounding (Tables 5 & 7). Reviewer aKaE acknowledges this, stating that insights like “updating the visual feature” are novel contributions.
- We generate synthetic 3D data by lifting 2D RGB images into 3D pointmaps, sharing all parameters across 2D and 3D pathways rather than a subset. Reviewers aKaE, 5zKN, and SyhT recognize this as a unique strength or “novel.”
With these advancements, UniVLG outperforms ODIN by over 25% and LLaVA-3D by 9.4% on 3D language grounding, underscoring the importance of our design choices. None of these changes were trivial or obvious, and lead to significant performance improvements.
Broader Impact:
- UniVLG demonstrates that unified 2D-3D visual language models can enhance data-scarce 3D modalities without sacrificing 2D performance. The 3D VLM field is bottlenecked by a lack of large-scale, diverse datasets, and we believe that showing ways of using 2D datasets and pre-trained weights for VLMs is an important contribution. While ODIN also tried this for 3D segmentation task, its strategy of skipping 3D layers for 2D data leads to suboptimal performance (Table 6) and near-zero 2D-to-3D generalization (Table 12). We not only show more diverse results than ODIN, we also improve sharing between the 2D and 3D modalities further and benchmark all our methods on more realistic setups which is “largely overlooked by prior studies” (Reviewer aKaE, a8kZ) in 3D vision-grounding literature. As Reviewer SyHt notes, “There are several findings that could inspire future research, e.g. the comparison between using non-parametric and parametric queries, and the essence of visual tokens updating during mask decoding."
- UniVLG achieves SOTA results in 3D referential grounding, a highly active field with dozens of papers published annually. It surpasses prior work by over 10%. If our approach were merely an incremental change over ODIN, a similar baseline would already exist. Furthermore, if a seemingly “incremental” modification yields substantial improvements—especially in more realistic settings—we argue that it is highly relevant to the community and is worthy of dissemination.
Thank you for your other questions! Our paper and supplementary file address almost all of them. It seems you may have missed our supplementary material, which starts on Page 13 of the main PDF. Here are the relevant sections for your convenience:
"Adverse effect of changing the visual features with noisy inputs?"
See Section A.9 (Page 15) and Figure 6 (Page 19), where we experiment with varying noise levels in camera pose and depth maps. Our method remains robust to noisy inputs.
“Videos of segmentation in suppl.?”
Extensive qualitative visualizations are available in Figure 3 (Page 16), Figure 4, and Figure 5 (Page 17) of appendix.
Failure cases?
See Section A.8 (Page 15) for a detailed discussion and visualizations of failure modes.
Future work?
- We currently train with significantly less 2D data than SOTA 2D VLMs. A natural next step is scaling with more 2D data and studying its impact.
- Our method is designed for static 3D scenes; extending it to dynamic 3D environments is an important future direction.
We will include a dedicated future work section in the camera-ready version.
“Why maskformer had to introduced?”
We built our mask decoder head on Mask2Former, as we find that it outperforms box-decoding heads (Tables 5 & 7). Additionally, mask decoding unifies the output space across 2D and 3D via per-pixel segmentation masks.
"Lines 194-200 are convoluted. Please simplify."
Thank you for the feedback. We will revise this for clarity.
"How this method perform w.r.t LLava-3D?"
As shown in Table 1, UniVLG outperforms LLaVA-3D by 9.4% on ScanRefer, despite LLaVA-3D using mesh point clouds while UniVLG relies on sensor point clouds.
We will move some of these to the main paper.
Many of the questions are answered in the rebuttal and hence raising the scores. Thank you!
This paper presents UniVLG, a unified vision-language model designed to bridge the gap between 2D and 3D vision-language understanding in embodied AI systems. Given the scarcity of well-annotated 3D datasets, UniVLG explores the transfer of vision-language knowledge from well-curated 2D data to enhance 3D reasoning. The model leverages pre-trained 2D VLMs and is trained on a diverse set of 2D and 3D vision-language tasks, allowing effective cross-modal learning. UniVLG processes 2D images or RGB-D inputs during training and inference, eliminating the need for explicit 3D mesh reconstructions. This design makes it more aligned with realistic embodied AI applications, where direct sensor data is often the primary input.
给作者的问题
I have no questions for the authors.
论据与证据
The paper focuses on unifying 2D and 3D vision-language understanding by leveraging 2D-to-3D lifting strategies to enhance 3D reasoning. UniVLG achieves SOTA performance on multiple 3D referential grounding and question-answering benchmarks. Comprehensive experiments support the claims and demonstrate the effectiveness of transferring 2D vision-language knowledge to 3D tasks.
方法与评估标准
The methods and evaluation criteria used in this paper are well-designed and appropriate for the problem setting.
理论论述
The paper does not make explicit theoretical claims or include formal proofs. Its contributions are primarily focused on model design and performance improvements through practical innovations such as the 2D-to-3D lifting strategy and the language-conditioned mask decoder.
实验设计与分析
The paper demonstrates strong performance across a range of 3D understanding tasks, including referential grounding and question answering. While the model’s 2D performance is not degraded, the discussion on 2D results is relatively limited.
补充材料
I reviewed the additional experiments in the appendix, which further demonstrate the effectiveness of the proposed method.
与现有文献的关系
The paper builds upon recent advances in vision-language models and leverages DINO features for visual encoding and Moge for generating point maps from 2D images. Additionally, its approach of using 2D-to-3D lifting strategies effectively bridges the gap between 2D and 3D vision understanding. By combining these ideas, UniVLG offers a unified framework that advances both 3D referential grounding and 3D question-answering tasks. Compared to previous methods, UniVLG eliminates the reliance on 3D mesh reconstructions, instead utilizing sensor data, making it better aligned with realistic embodied AI scenarios.
遗漏的重要参考文献
The citations in this paper are primarily limited to point clouds and NeRF (e.g., Panoptic-Lifting). However, relevant works on 3D Gaussian Splatting (3DGS) have been omitted, such as GOI [1] which leverages 2D RES models to achieve 3D RES.
[1] Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane
其他优缺点
Strengths:
- This paper presents UniVLG, a unified model capable of handling both 2D and 3D vision-language tasks, promoting seamless integration across modalities.
- The paper employs 2D-to-3D lifting strategies to enhance 3D reasoning, improving adaptability to real-world embodied AI scenarios that rely on sensor data rather than 3D mesh reconstructions.
- UniVLG achieves SOTA results on multiple 3D referential grounding and 3D question-answering benchmarks while maintaining non-degraded 2D performance. Weaknesses:
- In Table 4, the comparison between the 2D-3D and 2D-only settings shows no performance degradation in the 2D-3D setting. However, since the baselines considered are primarily from work published before 2024, this raises the question of whether 2D referential grounding performance has reached its upper limit.
- As shown in Appendix A.11 (2D-3D Generalization Test), the model does not appear to fully utilize the 2D knowledge, which may limit its potential to leverage broader 2D datasets for future improvements.
其他意见或建议
I have no additional comments or suggestions.
Thank you for your feedback.
“The citations in this paper are primarily limited to point clouds and NeRF (e.g., Panoptic-Lifting). However, relevant works on 3D Gaussian Splatting (3DGS) have been omitted, such as GOI [1] which leverages 2D RES models to achieve 3D RES.”
Thank you for this suggestion, we will add this to our related work.
“In Table 4, the comparison between the 2D-3D and 2D-only settings shows no performance degradation in the 2D-3D setting. However, since the baselines considered are primarily from work published before 2024, this raises the question of whether 2D referential grounding performance has reached its upper limit.”
This is correct: we chose these baselines since other recent 2D grounding models train on an order of magnitude more data than the 2D datasets we used. Certainly, scaling with more 2D data is a direct avenue of future work. Our main point of this experiment was to show that, indeed, we can build a unified 2D-3D visual grounding model which benefits data-scarce 3D modality without sacrificing performance on 2D datasets.
“As shown in Appendix A.11 (2D-3D Generalization Test), the model does not appear to fully utilize the 2D knowledge, which may limit its potential to leverage broader 2D datasets for future improvements.”
Absolutely! Our intention with this experiment was to explicitly demonstrate this generalization gap and inspire future research on achieving stronger 2D-3D generalization. Further improving this generalization will be crucial to truly utilize the 2D datasets for the 3D datasets.
Happy to clarify any concerns that might come up and help increase your score further.
This paper presents a novel model called UniVLG for 3D vision-language tasks, including 3D visual grounding and 3D question-answering. By leveraging 2D visual grounding datasets, the model gains additional benefits, and the authors provide several empirical findings on improving performance—such as updating visual features. The proposed model demonstrates strong results on existing benchmarks.
给作者的问题
NA
论据与证据
Most claims made in the submission are supported by clear and convincing evidence.
方法与评估标准
- The method section focuses mainly on 3D-based visual grounding but provides limited details on the 2D-based visual grounding, making that aspect unclear.
- The authors mention a 2D–3D lifting with a 50% probability. What is the rationale for this design choice, and what purpose does it serve?
理论论述
NA.
实验设计与分析
- In Table 1, the GT accuracies for UniVLG on Sr3D and Nr3D are not provided.
- For the DET and GT setups, does UniVLG differ only in how the predicted bounding box is chosen, with the inputs and inference pipelines remaining unchanged between these two settings?
- The paper is titled “Unifying 2D and 3D Vision-Language Understanding,” yet the method design and experimental results focus predominantly on 3D vision-language understanding. The 2D data is used mainly to boost 3D performance, as shown by Table 4, where 3D data does not improve 2D performance. This suggests that the title may be overreaching.
- A key contribution is the ability to leverage 2D data for joint training. To better illustrate this, it would be helpful to show the scaling effects of adding more 2D data—namely, how larger quantities of 2D data incrementally improve performance.
补充材料
I only review the "Additional Implementation details" in the supplementary material.
与现有文献的关系
- The joint incorporation of 2D and 3D with 2D–3D lifting is novel.
- Although the decoder follows approaches in previous works such as Mask2Former and “Grounded 3D-LLM with Referent Tokens” (which should be included in the related work), the empirical insights—for example, “updating the visual feature”—are new contributions.
- The results are impressive, surpassing prior state-of-the-art performances.
- The focus on projected point clouds instead of mesh point clouds is significant and has been largely overlooked by earlier studies.
遗漏的重要参考文献
- "Grounded 3D-LLM with Referent Tokens" (https://arxiv.org/abs/2405.10370)
其他优缺点
Overall, this work is impressive and appears to be a valuable contribution to the field. However, there is still room for further improvement. Please refer to the previous sections for more details.
其他意见或建议
NA
Thank you for your review. We try to address your concerns below:
“The method section focuses mainly on 3D-based visual grounding but provides limited details on the 2D-based visual grounding, making that aspect unclear.”
The method is identical between 2D and 3D visual grounding. The model takes as input a language query, N RGB images of shape N × H × W × 3, and an associated 3D pointmap of shape N × H × W × 3. For 2D datasets, we have N=1 frame and we obtain the pointmap using neural 2D-to-3D lifting modules. For 3D datasets we obtain a pointmap by unprojecting the RGB-D images and camera parameters. The output consists of segmentation masks for each object mentioned in the sentence, a corresponding text span that refers to each segmented object, and optionally, generated text that answers the question. The segmentation mask shares the same representation between 2D and 3D – it is obtained as a K X M output mask where K is the number of objects, and M represents the spatial dimension of the feature map. For a single RGB image, we flatten the 2D feature map resulting in M = H * W; In the multi-view (3D) case, we simply include the N frames as: M = N * H * W * F. All layers and losses are shared between the two.
Let us know if this makes sense, we will make it more clear in our camera-ready version.
“The authors mention a 2D–3D lifting with a 50% probability. What is the rationale for this design choice, and what purpose does it serve?”
The 2D-to-3D lifting can be imperfect, and although it helps in 3D referential grounding tasks (like on ScanNet datasets), the noise from this lifting can hurt the 2D-only grounding performance. Thus we feed the original 2D image 50% of the time, so that at test time, the model can perform better on 2D referential grounding. When we do not do the 2D-to-3D lifting, we simply skip all 3D layers, and only pass the images through the 2D layers.
“In Table 1, the GT accuracies for UniVLG on Sr3D and Nr3D are not provided.”
The GT setup is not our main focus as it makes an unrealistic assumption of provided GT 3D boxes - we only include it for completeness as several prior methods report this number. We only train the UniVLG-3D-only model on the GT setup because in the joint setup we want to use 2D data, but using GT boxes is uncommon for 2D datasets — and we didn’t want to expend compute on that.
For the DET and GT setups, does UniVLG differ only in how the predicted bounding box is chosen, with the inputs and inference pipelines remaining unchanged between these two settings?
Not quite. In the DET setup, UniVLG decodes a segmentation mask from scratch – it does not select a box out of a set of proposals. In the GT setup, we assume access to GT masks as input, we pool visual features inside the given ground-truth masks, and the object queries predict a segmentation mask over the “pooled” feature tokens, one token per object. Thus, GT setup assumes GT masks as input, DET setup does not assume any extra input (apart from RGB-D frames, camera parameters, and language).
“The paper is titled “Unifying 2D and 3D Vision-Language Understanding,” yet the method design and experimental results focus predominantly on 3D vision-language understanding. The 2D data is used mainly to boost 3D performance, as shown by Table 4, where 3D data does not improve 2D performance. This suggests that the title may be overreaching.”
You’re right that the focus is on how to use 2D to help 3D performance, but one of the key ideas of this paper is: To achieve this goal of helping 3D performance with 2D data, unifying the model design is a promising direction. The design of UniVLG emphasizes on jointly training on both 2D and 3D data, sharing all model parameters and losses, and as we show in Table-4, it retains its 2D performance while helping 3D performance.
“A key contribution is the ability to leverage 2D data for joint training. To better illustrate this, it would be helpful to show the scaling effects of adding more 2D data—namely, how larger quantities of 2D data incrementally improve performance.”
We agree, and this is an important direction for future work, in addition to scaling the total amount of 2D data further. However, we did not have the available compute resources to perform this additional experiment.
“Grounded 3D-LLM with Referent Tokens” should be included in the related work”
We agree, thanks for pointing it out.
Happy to clarify any concerns that might come up and help increase your score further.
I appreciate the authors’ rebuttal, which addresses most of my concerns. However, I still strongly recommend reconsidering the title “Unifying 2D and 3D Vision-Language Understanding.” The primary focus of this work is clearly on 3D understanding, and the current title overstates the scope. Such overclaims may set a bad precedent for the field.
The paper presents UniVLA, a unified vision-language model that bridges the gap between 2D and 3D vision-language understanding. Reviewers acknowledged the novelty of the model and its strong performance on 3D benchmarks. Initial concerns centered around the clarity of the design choices and the limited discussion of 2D results. The rebuttal successfully addressed these issues. After the rebuttal, the paper received three Accepts and one Weak Accept. The AC therefore recommends acceptance.