OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
摘要
评审与讨论
The authors propose a unified one-tower referring framework that introduces the MRefM paradigm to capture the referential relationships between vision and text. They demonstrate the effectiveness and generality of MRefM across three settings on REC, PG, and RES tasks, consistently achieving good results.
优点
The proposed method is structurally simpler than previous methods and achieves higher grounding and RES performance.
The ablation experiments are thorough, validating the effectiveness of the two Ref mask modeling strategies.
缺点
Adapting BEiT, a robust general-purpose multimodal model, to downstream grounding tasks and achieving performance improvement does not seem to be a particularly interesting finding. Additionally, it is puzzling why introducing two types of relation scores in MIM and MLM would enhance the model's referring performance. Especially regarding the four types of masks in the Visual target-relation score, their effective mechanism is not intuitive and lacks explanation. Why did the authors decide to introduce referring information by predicting these four scores? Are there any reference works?
In Table 3, comparisons with some related works, such as UNINEXT [1] and HIPIE [2], are missing.
[1] Universal instance perception as object discovery and retrieval. CVPR 2023.
[2] Hierarchical Open-vocabulary Universal Image Segmentation. NeurIPS 2023.
问题
See weaknesses.
局限性
Yes.
We sincerely thank the reviewer for the thoughtful feedback. Please find below our responses to the questions raised in the review.
Firstly, due to the character limit in the reply box, we have included the figure and experimental tables in the PDF file at the top of this page. Please click on the file to view the corresponding figure and tables.
Q1. Adapting BEiT, a robust general-purpose multimodal model, to downstream grounding tasks and achieving performance improvement does not seem to be a particularly interesting finding.
We would like to highlight the value of our work from two perspectives.
- Firstly, although BEiT-3 is a general-purpose foundation model trained by the Mask Visual Language Modeling (MVLM) paradigm, it does not perform well in grounding and referring tasks. Our work is the pioneer attempt to implement referring tasks based on the BEiT-3 model. We propose the one-tower UniRef framework, which significantly improves fine-grained cross-modal perception compared to the original BEiT-3 model. It offers a new direction for the cross-modal perception of grounding fields
- Secondly, our aim is not only to transfer the BEiT-3 model to the grounding task, but also to propose a novel paradigm with certain generality. Specifically, while the existing MVLM is a general pre-training paradigm, it can only learn coarse-grained visual and linguistic knowledge. However, fine-grained cross-modal referring relations are common in our lives, and such a paradigm cannot model these referring relations; no previous work has explored them either. Therefore, our aim is to learn these subtle referring relations with the help of the general mask modeling paradigm. As shown in the experimental results of Table 2 and Table 3, our approach has also been proven effective for VQA and cross-modal retrieval tasks. Our work represents a novel attempt that we believe can provide valuable insights for future research.
Q2. Additionally, it is puzzling why introducing two types of relation scores in MIM and MLM would enhance the model's referring performance.
We explain the intrinsic mechanism of the two relation scores from the perspective of MIM and MLM, respectively.
-
In the current MIM paradigm, reconstruction is limited to relying solely on the visual features within the image. To enhance content reconstruction by leveraging cross-modal information as much as possible, our Referring MIM approach incorporates visual target-relation scores alongside visual modality content during reconstruction. This modeling approach presents increased difficulty as it necessitates reliance on textual information for reconstructing two more complex visual branch. Consequently, our model achieves a more comprehensive understanding of both visual and textual information. In this way, the model not only can perceive the information of the image modality itself but also have a more accurate understanding of the location and correlation of the key object features in different regions.
-
Similarly, the Referring MLM also aims to enhance the global comprehension and reasoning capabilities of the model for both visual and textual information. Specifically, existing MLM methods solely rely on contextual information within the text to reconstruct masked words. In addition to leveraging image modality information to restore masked words as much as possible, we guide the model in identifying specific words within the grounding text that require special attention during the referring process. The learning process of semantic target-relation score bears resemblance to knowledge distillation. In this way, the model can acquire a more comprehensive understanding of the referred text.
Q3. Especially regarding the four types of masks in the Visual target-relation score, their effective mechanism is not intuitive and lacks explanation. Why did the authors decide to introduce referring information by predicting these four scores? Are there any reference works?
- The response to this issue is placed in the global rebuttal field. Please refer to Common Question 1 above on this page.
Q4. In Table 3, comparisons with some related works, such as UNINEXT [1] and HIPIE [2], are missing.
As shown in Table R4, we compare our RES task result with UNINEXT [1] and HIPIE [2] under oIOU metric, and we will include the results in Tables 2 and Table 3 of the revised paper.
Besides, both of the two works belongs to the setting of multi-task multiple dataset-mixed training. Comparing with these two works, our base model exceeded UNINEXT by 2.79%(testA), 5.15%(testA), and 2.87%(test) respectively in the REC task for the RefCOCO/+/g three datasets. In the RES task, our base model surpassed UNINEXT by 1.42%(val), 3.85%(val), and 2.51%(val) respectively, and surpassed HIPIE by 1.02%(val), 3.85%(val), and 2.75%(val) in the same three datasets. It should be noted that while our model only utilizes RefC's data for intermediate pre-training, both UNINEXT and HIPIE employ more additional datasets as well as multi-task pre-training. Despite this distinction, our model already demonstrates superior performance compared to theirs.
We thank the reviewers for their efforts. As the discussion phase is nearing its end, we kindly remind the reviewer SMKQ to respond to our responses if it is convenient. We sincerely thank the reviewer SMKQ for the positive rating of our paper. Currently, no new questions have been raised by the reviewer, so we sincerely hope that our reply has successfully addressed his/her concerns. If not, we would sincerely appreciate any further comments that can improve the quality of our paper. We will incorporate all suggestions during this rebuttal to comprehensively revise our paper to achieve a higher standard! Lastly, we thank the reviewer SMKQ once again for the time and effort spent on reviewing our work! Thank you!
This paper proposes a Mask Referring Modeling (MRefM) paragram and a unified and extremely concise grounding and referring segmentation framework named UniRef that no longer requires the fusion or interaction of the Transformer structure and the special grounding tokens. A masked referring modeling (MRefM) is proposed to model the referential relationship, which encompasses referring-aware mask image modeling and referring-aware mask language modeling.
优点
A mask-referring modeling paragram is proposed to model the referential relation between visual and language effectively. They also propose a one-tower framework for grounding and referring segmentation in a unified modality-shared feature space. Experiments demonstrate the effectiveness of this method and its components.
缺点
- Lack of discussion of related work [a]. I suggest discussing the difference between this paper's referring-aware mask language modeling and masked contrastive learning in [a].
- Lack of computation costs analysis with other methods, including parameters/FLOPs/speed.
[a] VLT: Vision-Language Transformer and Query Generation for Referring Segmentation. In TPAMI 2023.
问题
See weakness.
局限性
Yes.
We sincerely thank the reviewer for the thoughtful feedback. Please find below our responses to the questions raised in the review.
Q1. Lack of discussion of related work [a]. I suggest discussing the difference between this paper's referring-aware mask language modeling and masked contrastive learning in [a].
We will include a comparative discussion between our Referring-aware Mask Language Modeling (Referring MLM) approach and the Masked Contrastive Learning (MCL) approach proposed in VLT [a] in Sec. 2.2 and 3.3 of revised paper. Specifically, we currently briefly present the difference discussion during this rebuttal stage below:
-
Firstly, the MCL proposed in VLT [a] constructs contrastive learning for three different types (i.e., same image same object (SISO), same image different object (SIDO), different image (DI)) of referring samples in the training batch. The aim is to make the features with SISO sample type as close as possible and the features with SIDO sample type as far away as possible. At the same time, MCL randomly masks prominent words in the query text of SISO type in a probability-guided manner to construct new positive samples, thereby increasing sample discrimination and model generalization. MCL to some extent proves that text masking is effective in referring tasks.
-
Secondly, our Referring MLM involves masking the query text using a referring-aware text masking strategy and reconstructing the linguistic content as well as the semantic target-relation score of the masked words. Our approach differs from MCL. Specifically, while both Referring MLM and MCL employ text masking techniques, our proposed Referring MLM requires not only text masking but also the reconstruction of textual content and the target-relation probability distribution of the masked text. Our proposed Referring MLM paradigm is a relatively more comprehensive approach to perform mask modeling.
Q2. Lack of computation costs analysis with other methods, including parameters/FLOPs/speed.
We compare the energy efficiency of our model with several well-known state-of-the-art works on the REC task from various perspectives, including the number of parameters, computational complexity (FLOPs), inference speed (FPS), and test time (s). The results are presented in Table R1, which can be found in the rebuttal PDF file located at the top of this page.
In this paper, we highlight two significant advantages of our model architecture over other frameworks: (a) instead of using a Transformer to fuse visual and language features, we only employ a simple lightweight task head; (b) Our one-tower architecture eliminates the need for early interaction techniques in the backbone network, thereby reducing the computational complexity of the model.
As can be seen from the Table R1, due to the simplification of our model's structure, the number of parameters and the calculation complexity are significantly lower than other well-known models. Specifically, our feature fusion and grounding head module only require 1.7M parameters, while other methods use 20M, meaning we only have about 8.5% of their parameter count. Additionally, our computation is only 34.9% of Grounding-DINO and 25.2% of MDETR. Moreover, our inference speed is 10 times faster than Grounding-DINO and TransVG++ (the speed also related to the image size used by the model). Despite these advantages, thanks to the modality-shared feature space, we outperform all these well-known works. We will include this experiment in the revised version.
We thank the reviewers for their efforts. As the discussion phase is nearing its end, we kindly remind the reviewer vP75 to respond to our responses if it is convenient. We sincerely thank the reviewer vP75 for the positive rating of our paper. Currently, no new questions have been raised by the reviewer, so we sincerely hope that our reply has successfully addressed his/her concerns. If not, we would sincerely appreciate any further comments that can improve the quality of our paper. We will incorporate all suggestions during this rebuttal to comprehensively revise our paper to achieve a higher standard! Lastly, we thank the reviewer vP75 once again for the time and effort spent on reviewing our work! Thank you!
This manuscript proposes UniRef, a framework aimed at unifying visual and linguistic feature spaces for referring expression comprehension and segmentation. The key presented is the Masked Referring Modeling (MRefM) paradigm, which includes referring-aware MIM and MLM. This approach seeks to streamline the architecture by eliminating the need for separate modality-specific encoders and complex interaction modules, achieving sota performance on several datasets.
优点
- The introduction of the MRefM paradigm effectively captures the referential relationship between visual and linguistic features, contributing to the robustness and accuracy of the model.
- The integration of visual and linguistic feature spaces into a unified framework simplifies the model architecture and potentially improves computational efficiency.
- The authors provide extensive experimental results across five datasets, demonstrating the effectiveness of the proposed approach in outperforming existing methods.
缺点
-
The technical contribution of the proposed method appears to be insufficient, as it primarily adapts traditional Masked Autoencoders into a 'Referring MAE', despite the rich experimentation presented.
-
Section 3.2 is intricate and challenging to follow. I recommend a thorough proofreading and revision of this section to enhance its clarity and readability.
-
Could you explain the rationale behind designing the system to encompass four masks: x-, y-, w-, and h-masks?
-
The manuscript asserts that the approach is lightweight, raising questions about its real-time applicability. It would be helpful if the authors detailed the computational demands and processing speed of the method in practical scenarios.
-
(Minor) While the manuscript has discussed the limitations of the proposed approach, I am interested in understanding how MrefM performs in more general contexts. For instance, evaluating the backbone linear probing (or its integration with LLMs) could demonstratebroader applicability. This should certainly be considered for future work.
问题
See weakness.
局限性
None.
We sincerely thank the reviewer for the thoughtful feedback. Please find below our responses to the questions raised in the review:
Firstly, due to the character limit in the reply box, we have included the figure and experimental tables in the PDF file at the top of this page. Please click on the file to view the corresponding figure and tables.
Q1. The technical contribution of the proposed method appears to be insufficient, as it primarily adapts traditional Masked Autoencoders into a 'Referring MAE', despite the rich experimentation presented.
We would like to highlight our technical contributions from the following points, while explaining the benefits resulting from our approach.
-
Firstly, to address the limitation of the previous MIM paradigm in capturing cross-modal referential relations, we propose the Referring-aware MIM paradigm, which enhances the backbone network's ability to comprehend fine-grained cross-modal information.
-
Secondly, we propose the referring-aware dynamic image masking strategy that improves the previous random image masking strategy and effectively directs the model's attention to the referred region.
-
Thirdly, we propose a referring-aware MLM paradigm that not only empowers the existing backbone model to reconstruct masked words based on contextual information but also enhances the model's attention and weighting towards crucial referring words.
-
Fourthly, based on the proposed MRefM paradigm, we designed two lightweight task heads and introduced a remarkably simplified one-tower grounding and referring segmentation framework, thereby obviating the necessity for complex cross-modal interaction techniques and cumbersome fusion codec modules. This approach offers a new solution to the fields of grounding.
The existing methods, such as MAE, reconstruct the image based on its context information and can only learn the uni-modal representation. With our proposed method, we are able to learn more general cross-modal representations and thus enhance the model's global reasoning ability.
Q2. Section 3.2 is intricate and challenging to follow. I recommend a thorough proofreading and revision of this section to enhance its clarity and readability.
After receiving the comments on this rebuttal, we conducted a thorough proofreading of Sec. 3.2 and identified some challenging aspects in its writing. Here are several points that may cause confusion. We will update in the revised version.
-
-
Paragraph 1 in Sec. 3.2, the explanation of Referring MIM's motivation is insufficient.
Explanation: This issue will be explained in Q3 - point 1.
-
-
-
Paragraph 2 in Sec. 3.2, the effectiveness mechanism of the four x-, y-, w-, and h-masks needs to be further explained.
Explanation: This issue will be explained in Q3 - point 2. To facilitate explanation, we further draw Figure R1 based on Figure 2 of the main text, which also will be included in Section 3.2 of the revised version.
-
-
-
Paragraph 4 in Sec. 3.2, this part lacks a global execution logic of 'Referring-aware dynamic image masking strategy' .
Explanation: We have revised the writing logic of paragraph 4.
-
We hope the above-mentioned points meet the reviewer's requirements.
Q3.Could you explain the rationale behind designing the system to encompass four masks: x-, y-, w-, and h-masks?
- The response to this issue is placed in the global rebuttal field. Please refer to Common Question 1 above on this page.
Q4. It would be helpful if the authors detailed the computational demands and processing speed of the method in practical scenarios.
We analyze our gains of computational efficiency in Tab. R1. As can be seen from the table, due to the simplification of our model's structure, the number of parameters and the calculation complexity are significantly lower than other well-known models. Specifically, our modality fusion and grounding head module only require 1.7M parameters, while other methods use 20M, meaning we only have about 8.5% of their parameter count. Additionally, our computation is only 34.9% of Grounding-DINO and 25.2% of MDETR. Moreover, our inference speed is 10 times faster than Grounding-DINO and TransVG++ . Despite the reduction in computational requirements, we already outperform all of these well-known works
Q5.(Minor) While the manuscript has discussed the limitations of the proposed approach, I am interested in understanding how MrefM performs in more general contexts.
To verify the generality of MRefM, as shown in Tab. R2 and Tab. R3, we follow the experimental test framework of BEiT-3 and conducted VQA fine-tuning experiments (Tab. 2), as well as cross-modal retrieval experiments (Tab. 3) on the MS COCO and Flickr30K datasets.
Our proposed MRefM is a fine-grained multi-modal pre-training paradigm that significantly enhances cross-modal tasks that involving logical reasoning and referring. As shown in the tables, our MRefM pre-training also leads to a considerable improvements in both VQA and retrieval tasks, demonstrating the effectiveness of our MRefM in cross-modal representation learning. Additionally, integrating MRefM into LLMs is a interesting directions, we will attempt corresponding studies in the future works.
Thank you for your response. After reviewing other reviews and responses, I have decided to increase my rating to 6. I hope you can carefully integrate the rebuttals into the revised version and thoroughly proofread the entire manuscript.
We sincerely thank the reviewer a3fR for the positive feedback and the increased rating of our paper! We would like to express our deep gratitude for the reviewer's valuable comments provided in this review, and we will incorporate all suggestions during this rebuttal to comprehensively revise our paper to achieve a higher standard and quality. Lastly, we thank the reviewer once again for the time and effort spent on reviewing our work! Thank you!
Dear reviewers, area chairs, senior area chairs, and program chairs,
We sincerely thank for the valuable and the thoughtful comments. It is pleasure that this work has been recognized by the three reviewers, including "the MRefM paradigm effectively captures the referential relationship", "a simple yet effective architecture", "the experiments are thorough", etc. The commonly concerns within the three reviewers lie in (a) The mechanism for the effectiveness of visual target-relation score is not clearly explained. (b) Lack of computational efficiency comparison experiments. (c) Include several relevant references. To address these concerns, we have carefully provided detailed explanations point-by-point within each reviewer's response box and have also included the relevant experiments, with the results presented in the rebuttal PDF file. We look forward to a better appreciation of this manuscript that incorporates our great efforts. Furthermore, this manuscript has been carefully revised according to the suggestions of the reviewers.
The followings are our detailed responses. We greatly thank the constructive suggestions that significantly help improve the quality of our paper.
Best regards,
The authors.
Due to the character limit in the response field, we address the first commonly asked question below.
Common Question 1: The mechanism for the effectiveness of visual target-relation score is not clearly explained..
- (Reviewer a3fR) Could you explain the rationale behind designing the system to encompass four masks: x-, y-, w-, and h-masks?
- (Reviewer SMKQ) Regarding the four types of masks in the visual target-relation score, their effective mechanism is not intuitive and lacks explanation. Why did the authors decide to introduce referring information by predicting these four scores? Are there any reference works?
In order to provide a clearer explanation, we would like to address the rationale behind the visual target-relation score (i.e., the four x-, y-, w-, and h-masks) designed in Sec. 3.2 of the main text from two perspectives. Specifically:
-
Point 1: The purpose of designing the Referring MIM algorithm.
In the existing MIM paradigm, reconstruction is limited to solely relying on the visual features within the image. To enhance content reconstruction by leveraging cross-modal information as much as possible, our Referring MIM approach incorporates visual target-relation scores alongside visual modality content during reconstruction. This modeling approach presents increased difficulty as it necessitates reliance on textual information for reconstructing the two visual branch. Consequently, our model achieves a more comprehensive understanding of both visual and textual information. In this way, the model not only can perceive the information of the image modality itself but also have a more accurate understanding of the location and correlation of the key object features in different regions.
-
Point 2: How and why the visual target-relation score (i.e., the x-, y-, w-, and h-masks) works?
To facilitate the explanation, we provide a clearer illustration of the four masks in Fig. R1 within the rebuttal PDF file. As mentioned in Sec. 3.2 of the paper, this score represents the spatial distance between the current patch region and the referred region, it enables implicit deployment of grounding capability within each token of the model. When reconstructing the visual features and target-relation score of each local patch, the model actually needs to have an global and comprehensive understanding of the text modality information and the visual information of the image. On this basis, the model needs to rely on the reconstructed visual features of the local patch to implicitly predict the specific location and size of the referred object, and then accurately predict the visual target-relation score. Finally, Referring MIM can enhance the model's global and multimodal understanding of textual and visual information, and then learn more general visual representations, which can have better generalization ability when deployed to downstream referring tasks.
The proposed Referring MIM is our own design, which is mainly used to improve the defects existing in MAE/BEiT, and we have not found a similar method in the existing work. However, we can find the rationale of our method in some classic computer vision papers, such as the YOLO series works [1], which predicts the location, size, confidence, and category of the object box corresponding to each grid cell based on the global understanding of the image. The paper [1] also confirmed that the object detection model obtained in this way has stronger generalization ability when transfer to detection tasks that differ greatly from the training data compared with other detectors.
[1] Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR. 2016.
The paper initially received borderline accepts from all reviewers with one reviewer adjusting their score favorably after the author rebuttal. For those reviewers who have not responded, the authors' rebuttal appears to effectively address the main concerns. Therefore, based on the recommendations, the AC also recommends the acceptance of the paper.