SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
摘要
评审与讨论
This paper presents SimVG, which decouples multimodal understanding from downstream tasks and uses the pretrained model to perform feature extraction and multi-modal fusion. A dynamic weight-balance distillation (DWBD) module is proposed to enhance the token branch's ability. A text-guided query generation module is developed to integrate text information into queries. Performace validate the effectiveness of the proposed method.
优点
1、Directly using a multi-modality encoder for multi-modal feature extraction and fusion avoids the need for redesigning a multi-modal fusion module. This simplifies the model structure. 2、The proposed DWBD module improves the model's inference efficiency through distillation. 3、Experiments on different benchmarks show better performance than previous SOTA methods
缺点
There is no significant weakness in this paper.
问题
- Will the multi-modality encoder be trained or are the parameters frozen?
- Visualization of the feature map could be provided.
局限性
No significant limitations
Q1: Will the multi-modality encoder be trained or are the parameters frozen?
A1: The multi-modality encoder (MME) weights are trainable throughout the training process, but the learning rate of MME is set to 0.1 times of that of the other parameters.
Q2: Visualization of the feature map could be provided.
A2: Thank you for you proposal. In the Figure 14 of rebuttal pdf, we add feature map visualization. This includes GradCAM-based feature heatmap visualization in MME and Attention Map visualization in the Decoder.
I thank the authors for their efforts and detailed rebuttal. I have read through all the other reviewers' comments and the authors' responses. I decide to raise my score.
This manuscript introduces SimVG, a framework based on BEiT-3 that simultaneously encodes image, text, and object tokens. Additionally, it proposes a dynamic weight-balance distillation (DWBD) method to improve the simpler branch (MLP), thereby enhancing reasoning speed. The effectiveness of the proposed method is demonstrated across multiple visual grounding datasets.
优点
-
The DWBD method enhances the performance of the lightweight branch by balancing the learning process, thereby improving the overall efficiency and accuracy of the model.
-
SimVG achieves competitive results across multiple visual grounding datasets, demonstrating the robustness and effectiveness of the approach.
缺点
-
The technical contribution of the proposed method appears insufficient. The approach primarily builds upon BEiT-3 by adding an object token and a fast MLP head with a distillation loss, which may seem more like an application of BEiT-3 rather than a novel contribution.
-
Although the paper aims to simplify the structure and improve reasoning speed, the overall architecture and the introduction of multiple new components (e.g., DWBD, TQG) add complexity during training. The process involves two-stage pretraining and fine-tuning steps, which can be cumbersome and resource-intensive.
-
Could you clarify whether the model was pretrained from scratch or if existing BEiT-3 weights were used?
-
I recommend thorough proofreading to enhance clarity and correctness, improving readability and quality.
问题
See weakness.
局限性
None.
Q1: The technical contribution of the proposed method appears insufficient. The approach primarily builds upon BEiT-3 by adding an object token and a fast MLP head with a distillation loss, which may seem more like an application of BEiT-3 rather than a novel contribution.
A1: Please refer to A1 in Response to Reviewer TBfU for comparisons with BEiT-3.
Further Explanation:
As shown in Figure 13(b) of rebuttal pdf, previous methods (VGTR, TransVG, MDETR, SeqTR) utilize Visual/Text Encoders pretrained on their respective modalities or use alignment-based pretraining models like CLIP (e.g., Dynamic MDETR). However, these methods do not integrate multimodal fusion during the pretraining process. Later methods embed text encoding into visual encoding (e.g., QRNet, VG-LAW). However they still rely on fitting multimodal fusion representations on small-scale downstream data. These methods can be represented in Figure 13(b)(1) of the rebuttal pdf.
Our approach diverges from these methods by leveraging upstream fusion multimodal models like ViLT and BEiT-3. We move the multimodal fusion representation to the pretraining phase using a large-scale dataset of image-text pairs. Our architecture can be represented by Figure 13(b)(3). One of the key innovations of this paper is the exploration of the importance of transferring multimodal fusion representation from downstream to upstream. Figure 13(a) illustrates that our method exhibits superior understanding of multimodal content, including complex details such as relative positional relationships, physical materials, and colors.
We hope this addresses the concerns and highlights the unique contributions and benefits of our approach. Thank you for your valuable feedback.
Q2: Although the paper aims to simplify the structure and improve reasoning speed, the overall architecture and the introduction of multiple new components (e.g., DWBD, TQG) add complexity during training. The process involves two-stage pretraining and fine-tuning steps, which can be cumbersome and resource-intensive.
A2: Thank you for your proposal.
About Training Resource Consumption: As shown in Table 9 of the rebuttal pdf, the parameter count in the SimVG Head (Contains DWBD and TQG) is significantly lower compared to RefTR, SeqTR, and MDETR. Additionally, Table 8 of the rebuttal pdf demonstrates that the number of training epochs and total training time of SimVG are notably lower than those of other methods. With a single RTX 4090, training SimVG on the RefCOCO+ dataset takes less than 6 hours.
About Distillation Complexity: Table 6 in the original paper presents two distillation modes. The one-stage mode involves synchronous learning and distillation, where the teacher model is trained and knowledge is distilled to the student model in a single training session. This mode does not require additional pre-training and does not incur extra overhead. The two-stage mode aims to further enhance distillation performance by first training the teacher model and then synchronously training the student branch. While the two-stage mode does increase training complexity, it only requires less than 10 hours on a single RTX 4090 to complete one two-stage training session on RefCOCO, which is still significantly more resource-efficient than most of the existing methods. Additionally, we will release our source code that supports both one-stage and two-stage distillation via GitHub.
| Method | val | testA | testB | Training Time |
|---|---|---|---|---|
| One-stage | 86.57 | 87.80 | 82.71 | ~5.5h |
| Two-stage | 86.96 | 88.22 | 83.16 | ~9h |
Q3: Could you clarify whether the model was pretrained from scratch or if existing BEiT-3 weights were used?
A3: The existing pre-trained weights of BEiT-3 are used, and all weights are trainable (no frozen) during the SimVG training process.
Q4: I recommend thorough proofreading to enhance clarity and correctness, improving readability and quality.
A4: We thank the reviewer for this valuale suggestion. We will do our best to improve the quality of the manuscript. We will also ask a native English speaker to prrofread and polish the manuscript.
Thank you for your response. After reviewing other reviews and responses, I have decided to increase my rating to 5. I hope you can carefully integrate the rebuttals into the revised version and thoroughly proofread the entire manuscript. While the technical aspects are borderline acceptable, my main concern remains the writing quality of this manuscript. Thank you.
Visual grouding is a typical task in vision and language domain. Existing methods only use limited downstream data to fit multimodal feature fusion, leading to significant performance degradation on complex texts. Therefore, it is necessary to decouple visual-language feature fusion from downstream tasks to promote deep integration between downstream tasks and pre-training tasks. in this paper, the authors proposed the SimVG model framework. They Introduced a dynamic distillation method and a query generation module. Experimental results on several datasets demonstrate the effectiveness of the model.
优点
- The design of the distillation method is innovative, and TQG enables the model to be extended to GREC, broadening its application scope.
- The experiments on serveal datasets of the model performs well, achieving state-of-the-art levels on multiple datasets with relatively few parameters.
缺点
- The writing needs improvement; for example, the motivation is not clearly and concisely described.
- The inference process of the model should be explained in the main text.
- In Table 3, why is there no comparison with PolyFormer-L, OFA-L, LGR-NET, and m-PLUG? For a fair comparison, at least these should be listed.
- There are some typos, such as:
- Line 156, "an caption" should be "a caption".
- Line 165, "R^{H/32 * W/32 * C}".
问题
In Table 3, why is there no comparison with PolyFormer-L, OFA-L, and m-PLUG? For a fair comparison, at least these should be listed.
局限性
In appendix, D4, the limitation section is provided.
Q1: The writing needs improvement; for example, the motivation is not clearly and concisely described.
A1: We thank the reviewer for pinpointing this issue. We will try our best to improve the writting of the manuscript in the final version. Also, we outlines the main motivations, innovations and advantages of SimVG in the Common concerns. Last, the Figure 13 in rebuttal pdf further illustrates our motivation:
-
Insufficient multimodal understanding: Current approaches that use a small amount of downstream data to fit multimodal representations are insufficient. They perform poorly in scenarios involving complex relative positional relationships, physical characteristics, and detailed color descriptions as shown in Figure 13(a) of the rebuttal pdf
-
Simple inference design principle: Our model design is centered around making inference simpler, including the use of MME (eliminating the text encoder like BERT) and DWBD distillation (requiring only a simple MLP in the head during inference).
Due to the page limitation, we have placed the more critical model structure in Figure 1 of the original paper. However, our motivation is expressed in the second and third sentences of the abstract, as well as in Figure 2 and its related discussion of the original paper. We will incorporate the motivation introduction from the rebuttal PDF into the final version of the manuscript.
Q2: The inference process of the model should be explained in the main text.
A2: The inference process is quite similar to the training process. We will include a more detailed description of the inference process in the revised version. The last sentence of the caption of Figure 2 briefly summarizes that the reasoning stage can be accelerated by using only the Token Branch. For the phrase grounding and REC tasks, since there is only one target, the number of queries is set to 1, and no additional post-processing is required. For the GREC task, there are cases where there is no target or multiple targets. We set the number of queries to 10, threshold to 0.7, and the post-processing is consistent with the original GREC method.
Q3: In Table 3, why is there no comparison with PolyFormer-L, OFA-L, LGR-NET, and m-PLUG? For a fair comparison, at least these should be listed.
A3: Comparisons are shown in Table 10 in rebuttal PDF. We will add references and comparisons to these works in the revised version. The original decision not to compare with the Swin-Large model is made to ensure a fair comparison based on FLOPs. According to the data from the official PyTorch website, under the same 224x224 input conditions, Swin-B has a 15.42 GFLOPs, while ViT-Large/32 has 15.38 GFLOPs.
Q4: There are some typos, such as: Line 156, "an caption" should be "a caption". Line 165, "R^{H/32 * W/32 * C}".
A4: We apologies for the typos. We will do our best to fix them in the final version. We will also ask a native English speaker to prrofread and polish the manuscript.
This paper introduces a transformer-based framework called SimVG for the visual grounding task, which, unlike CLIP-based models, decouples multimodal fusion from the downstream task into the model pretraining stage. SimVG modifies a recently proposed multimodal fusion encoder architecture (BEiT-3) to generate the fused feature representation, and adopts a lightweight MLP module instead of a complex encoder-decoder structure for visual grounding prediction. To make the MLP prediction head work, the paper proposes a synchronous distillation learning process that trains the MLP prediction head and a complex decoder branch at the same time w/ dynamic weights between the two branches.
Experiments on six widely used visual grounding datasets show that the proposed SimVG framework not only achieves the state-of-the-art performance, but also brings considerable improvements in efficiency and convergence speed.
优点
- This paper successfully adapts a recent multimodal pretraining framework (BEiT-3) to visual grounding, and proposed a few model architecture improvements, making the end-to-end framework more efficient (1x faster) and more accurate (2~3% improvements in prediction accuracy).
- The basic idea of the paper is presented clearly. The experiments performed in this paper is convincing to show the effectiveness of each of the proposed modules.
缺点
- The novelty of the paper is a bit limited to me since it basically borrows and applies the unified multimodal pretraining framework introduced in BEiT-3 to the visual grounding task. A clearer comparison between the proposed method and the BEiT-3 model is desired.
- The title and module names in Fig 3 is a bit confusing to me (at first sight w/o reading the whole paper) since many abbreviations are used. It's helpful to make them clearer to the readers.
- I noticed some grammar mistakes, e.g. in L167-168 "to interact with A with B".
问题
Please refer to the points I mentioned in the weakness section. Besides, I am wondering why the proposed synchronous distillation process is needed and its advantages over traditional model distillation process. Does traditional model distillation process work poorly?
局限性
No obvious limitations noticed by me.
Q1: The novelty of the paper is a bit limited to me since it basically borrows and applies the unified multimodal pretraining framework introduced in BEiT-3 to the visual grounding task. A clearer comparison between the proposed method and the BEiT-3 model is desired.
A1: Firstly, BEiT-3 is a pre-training architecture designed for global multimodal representation. BEiT-3 itself does not directly have the capability for visual grounding (See comparisons in Figure 13 (b) of the rebuttal PDF). Most importantly, equipping BEiT-3 with existing structures, such as the standard head in SeqTR, yields even worse results in downstream performance compared to SeqTR (see Table below). Notably, with our proposed method, we extend BEiT-3, giving it the capability for downstream detection, and achieve an overall improvement over existing SOTA. Experiments are conducted with ViT-B/32 on the RefCOCO dataset. Therefore, our contribution involves delving into the potential knowledge embedded in BEiT-3 and developing architectures to utilize this knowledge efficiently through the additional object tokens, some adaptive designs and token distillation.
| Method | val | testA | testB |
|---|---|---|---|
| SeqTR | 83.72 | 86.51 | 81.24 |
| BEiT-3 + SeqTR-head | 80.92 | 83.63 | 74.75 |
| SimVG (w BEiT-3) | 87.07 | 89.04 | 83.57 |
This paper leverages BEiT-3 while highlighting the importance of decoupling multimodal representation from downstream tasks to upstream pre-training, particularly for understanding complex text. To the best of our knowledge, this is the first exploration and experimental validation of this issue. The discussion in Table 4 and Figure 2 of the original paper provides the rationale behind our adoption of BEiT-3, as well as the core insights that this paper aims to convey. More details about our motivation, contributions, and advantages can be found in the Common concerns section.
Q2: The title and module names in Fig 3 is a bit confusing to me (at first sight w/o reading the whole paper) since many abbreviations are used.
A2: We thank the reviewer for pinpointing this issue, we will make them clearer by adding complete module names and brief desciptions in the revised version
Q3: I noticed some grammar mistakes, e.g. in L167-168 "to interact with A with B".
A3: We apologies for the typos. We will do our best to fix them in the final version. We will also ask a native English speaker to prrofread and polish the manuscript.
Q4: I am wondering why the proposed synchronous distillation process is needed and its advantages over traditional model distillation process. Does traditional model distillation process work poorly?
A4: We thank the reviewer for this insightful question. There are several reasons for adopting the synchronous distillation method:
- Alignment with the "Simple" design principle of this paper: Synchronous distillation eliminates the need for a two-stage process as it does not require pre-preparation of a teacher model. Instead, both the teacher and student models are trained simultaneously in a single training run.
- Inheriting the strong representation of the teacher model: Traditional distillation methods necessitate two independent models. In contrast, the synchronous distillation method shares the feature extraction components between the teacher and student models, only differentiating at the head. This approach allows the student model to inherit the superior representational capacity of the teacher model. The downside is that it can only reduce the model size of the head so cannot distill a smaller overall model.
- Experimental validation: We present a set of experimental data with variables including one-stage, two-stage, and whether the teacher model is frozen in the two-stage process, as well as traditional distillation with two independent models. The comparison between DWBD (w/o synchronous) and DWBD (twostage, DB frozen) demonstrates that the synchronous distillation mode, where the teacher and student models share the MME component, provides performance improvements. Further refining the decoder branch parameters during the two-stage process can enhance the student model's performance even more.
| Method | val | testA | testB | training Time |
|---|---|---|---|---|
| baseline | 85.47 | 86.75 | 81.66 | ~5h |
| DWBD(w/o synchronous) | 85.76 | 87.01 | 81.97 | ~11h |
| DWBD(onestage) | 86.57 | 87.80 | 82.71 | ~5.5h |
| DWBD(twostage,DB frozen) | 86.72 | 87.99 | 82.85 | ~8.5h |
| DWBD(twostage) | 86.96 | 88.22 | 83.16 | ~9h |
First of all, we would like to thank all the reviewers for your positive comments and valuable suggestions!
This rebuttal has two parts. First, please find our responses to some common concerns below. Then, we provide the response to each reviewer.
Common concerns
1. Motivation
1.1. Insufficient multimodal understanding
Figure 13(a) of rebuttal pdf shows that existing methods fail to adequately comprehend complex relative spatial relationships, physical materials, and detailed color descriptions. Due to the rich semantics and diversity of text, fitting multimodal fusion representations based on a small amount of downstream data is insufficient. Figure 13(b) highlights the differences between previous methods and our approach, which improves the issue of insufficient multimodal understanding by decoupling multimodal representation to upstream pre-training.
1.2. "Simple" architecture
We adopt the multi-modality encoder (MME) structure, which eliminates the need for an additional text encoder like BERT. By using the dynamic weight-balance distillation (DWBD) method, we enable synchronous learning of the teacher and student models with a single training run. Consequently, the decoder only requires a single MLP to accomplish phrase grounding, REC, and GREC tasks.
2. Novelty
2.1. Decoupling multimodal fusion to upstream pre-training
-
To the best of our knowledge, this paper is the first that emphasizes and explores the importance of decoupling multimodal fusion representation from downstream to the upstream pre-training.
-
Figure 13 (b) in rebuttal PDF specifically describes the differences between the previous framework and ours. Given the rich semantics and diversity of text, using a small amount of downstream data to fit fusion representations is certainly insufficient. Our method is the first study that explores the importance of decoupling multimodal fusion to upstream pre-training and validates this through experiments (refer to Table 4 in Section 4.4.1 and Figure 2 of the original paper).
2.2. Focusing on "Simple"
- The MME architecture eliminates the need for additional text encoding, such as BERT.
- The synchronous distillation approach allows for training the teacher and student models in a single training run, unlike traditional distillation methods that require pre-training the teacher model before distilling the student model.
- In the inference phrase, it can apply a simple MLP to achieve phrase grounding, REC and GREC tasks effectively, with performance close to or exceeding that of more complex decoder branches.
3. Main advantages of SimVG
- Simplified inference structure: The MME component eliminates the overhead of text encoder. The decoder component only requires a simple MLP to accomplish the phrase grounding, REC, and GREC tasks.
- Faster convergence: Noticeable acceleration in convergence speed (30 epochs vs. 60 epochs+).
- High performance: Maintains high inference speed and accuracy (compared to GroundingDINO: latency: 101ms vs. 120ms / accuracy: 89.55 vs. 84.92).
- Reduced training Data: Requires significantly less training data (28K vs. 174K+).
- Lower resource consumption: Training can be completed within 12 hours using a single NVIDIA 3090 GPU.
Dear Reviewers,
I hope this message finds you well. I am writing to kindly remind you to review the detailed rebuttal I have submitted in response to your feedback. I have carefully addressed the points you raised, and your insights would be invaluable in the final evaluation. Given your important role in this process, I would greatly appreciate it if you could provide your response at your earliest convenience.
Thank you for your time and consideration.
Best regards
This paper has received consistent feedback from all four reviewers. The reviewers engaged in thorough discussion and rebuttal, ultimately reaching a consensus. This paper presents a transformer-based framework named SimVG for visual grounding. SimVG has demonstrated SOTA performance across six datasets, with notable gains in efficiency and convergence speed. In light of the overall assessment and improvements made by the authors, the AC has decided to accept this paper.