PaperHub
5.3
/10
Poster4 位审稿人
最低4最高6标准差0.8
6
5
4
6
4.3
置信度
正确性2.8
贡献度2.5
表达2.8
NeurIPS 2024

Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model

OpenReviewPDF
提交: 2024-05-11更新: 2024-11-06

摘要

关键词
Visual Anchors; Vision Language Learning; Multimodal Large Language Models

评审与讨论

审稿意见
6

This paper introduces a new method to make a strong connector between a language model and a vision encoder, in order to build a vision-language model. The method is derived from the Perceiver Resampler, but uses a clever initialization for the queries. The authors validate their approach with a series of ablations, comparing the other connectors to theirs.

优点

  • This work meets a demand in the field of vision-language models, by creating a stronger connector between a language model and a vision encoder.
  • The authors obtained better scores with their connector than with the most used and best ones researchers currently use, while being efficient too.

缺点

  • The Anchor Selector is to me the meat of the paper. The AcFormer is a Perceiver Resampler with queries benefiting from a better initialization that depends on the input image, instead of the same queries for all images. However, I find that the description of the Anchor Selector algorithm and the justification for why it could work could be better explained in the main paper (the full algorithm is in the appendix). Providing a pseudocode instead of the code could also help.

问题

  • What is the effect of N, the depth of the AcFormer, on the performance on the benchmarks? Which depth do you recommend taking in practice?
  • In the paper, you mention that the strategy of the Anchor Selector is to take the attention map already computed to avoid having to re-compute something, and use it after modification for the initialization to the queries in your AcFormer. Have you tried other approaches where you are not focusing on efficiency and you are allowed to do additional operations, to see if it performs better?
  • In Figure 2, we see the effect of the layer on the attention map. Which layer are you considering for the final attention map that you are using for your algorithm? Have you tried different layers to see if it performs better?

局限性

Yes

作者回复

We sincerely thanks your comments. We will address your concern below.

Q1: About the anchor selection algorithm.

R1: Thanks for your suggestion. We will provide you with the pseudocode below. Assuming the Visual Feature Map is VRB×N×DV \in \mathbb{R}^{B \times N \times D}, the Visual Attention Map ARB×H×N×NA \in \mathbb{R}^{B \times H \times N \times N}, and the desired Anchor Token Number TT.

  1. Initialize Variables:

    • result is set to empty list [].
    • Calculate Per_head_num as (T1)/H(T - 1) / H, where [CLS] is chosen by default.
  2. Iterate Over Batches:

    • For each batch ii in the range BB:
      • Initialize max_indices with [0] (the index of the [CLS] token).

      • Iterate Over Heads:

        • For each head jj in the range HH:
          • Extract and sort attention scores for the [CLS] token, excluding the [CLS] token itself.
          • Select top Per_head_num indices and add to max_indices.
          • If duplicates occur, select additional indices to ensure the required number of unique tokens.
      • Update Selected Anchors:

        • Fetch selected_anchor according to max_indices and append it to result
  3. Return:

    • Return result.

Q2: The effect of the depth of the Anchor Former.

R2: Thanks for your concern. We present an ablation experiment below, conducted using the Vicuna-7b model, in which we select a total of 145 tokens.

DepthTextVQA(\uparrow)GQA(\uparrow)MMB(\uparrow)MME(\uparrow)Para Num (M)
357.760.968.11816.132.1
658.061.368.41846.163.1
958.261.268.31856.1125.9

From the table above, it is evident that using a depth of 6 with the training data for LLaVA-1.5 is sufficient. Further increasing the depth does not yield significant gains.

Q3: About other approaches for anchor selection.

R3: Thanks for your question. As we have mentioned in our paper, the PCA operation on the hidden states can also be employed for anchor selection. Assuming the output of the Vision Transformer is HRN×DH \in R^{N \times D}, we apply PCA and select the first 3 components to obtain HdRN×3H_d \in R^{N \times 3}. We then sum these three components and sort the results to identify visual anchors without relying on the attention map. Experimental results show similar outcomes. Additionally, we calculate the Jaccard distance for the indexs of the selected anchors between the attention method and PCA-based method, which reaches 70%, indicating a high degree of consistency between the two approaches. However, due to the computational expense of PCA, we are not able to fully training the model with this method. In our future work, We will further explore other methods to extract these tokens.

Q4: About the layer choice.

R4: Thanks for your concern. In our experiment, we choose the last but one layer. We observe that before layer 12, the [CLS] token's attention is primarily focused on the object, gradually integrating with other visual anchors. We test the attention maps of the last 10 layers for anchor selection and compute the Jaccard distance of the selected tokens' index. Our findings indicate that the selected tokens' index exhibit at least 90% overlap across different attention maps. Therefore, to ensure consistency with feature selection, we choose the penultimate layer for our configuration.

评论

Thank you for providing additional work and answering the questions.

评论

Thank you for your acknowledgment of our additional work and responses. We appreciate your constructive feedback that has helped refine our research. Please feel free to reach out if you have further queries or need additional clarification on our work.

审稿意见
5

This paper propose AcFormer, a novel vision-language connector in MLLMs. AcFormer is driven by visual anchors observed in vision transformer's PCA of the feature map and attention map of [CLS] token, where high value is observed in both of the maps. The author use the attention value of [CLS] token to select the visual anchor in a progressive way, the selected visual anchors are then used as information aggregation tokens through cross attention. Through extensive experiments the author prove the effectiveness and efficiency of AcFormer comparing with other multimodal connectors.

优点

  • As is shown in the main experiment result, AcFormer achieves about 2.3/1.7 time faster pretraining speed while retaining the overall performance.
  • Extensive ablation result is provided to further prove the effectiveness of AcFormer.
  • The writing and experimental setup is clear.

缺点

  • It seems to the reviewer that the main advantage of AcFormer comparing with original LLaVA-1.5 is its efficiency. However, only the training time is reported in the main table. Since all of them is less than 20 hours as is shown in table 6, apparently training time is not the main bottleneck of developing LLaVA-1.5 model. To further justify the efficiency of AcFormer, it's enouraged to report the inference time per token of the resulting models.
  • The motivation is not clear enough. The visual anchors discovered by the authors is a common phenomenon recently in vision tranformers, for example, [1] finds the outliers in ViT and provs that the outliers emerge because the ViT need some additional tokens to preserve golbal information. In this paper, more experiments should be done to unveil the reason behind the emergence of visual anchors, otherwise it not convincing enough to directly use the anchors to aggregate information, especially given that the performance improvement is limited comparing with the origial model.
  • AcFormer selects the visual tokens without considering the actual question, which means the token reduction could be harmful when the question is not about the major subject of the image.

[1] Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2023). Vision transformers need registers. arXiv preprint arXiv:2309.16588.

问题

  • In line 186-195, what is the difference between AcFormer on i1i-1 th layer and selecting some features from ii th layer?
  • In line 203-214, how are the 6 layers selected and how are the features fed before the projection layer? This part is not clear enough.

局限性

yes

作者回复

We sincerely thanks for your review. We will address your concern below.

Q1: About the inference time.

R1: Thanks for your concern of the effectiveness. We show the inference time and training memory needed below. We report the inference time for each benchmark using the prompt "Answer the question using a single word or phrase." (Max generation length is set to 3) For inference, we use 8 A100 GPUs, each with a batch size of 1. To test the training memory, we use a batch size of 4 to avoid out-of-memory issues.

ConnectorResolutionTextVQA(\downarrow)DocVQA(\downarrow)ChartQA(\downarrow)GPU Mem(\downarrow)
MLP336125s198s64s31.24g
Anchor Former33697s115s36s22.04g
MLP672571s803s276s71.58g
Anchor Former672384s470s141s32.95g
MLP1008---OOM
Anchor Former1008505s653s223s50.76g

From above results, it is evident that our model not only accelerate the training stage but also accelerate the inference. Also, it can greatly reduce the training memory, which enables us to handle higher resolution input. (For LLaVA-Next it supports maximum resolution of 672 ×\times 672)

Q2: About the motivation.

R2: Thank you for your suggestion. We appreciate you bringing the paper "Vision Transformers Need Registers" to our attention. However, it was not formally published before the NeurIPS 2024 submission deadline, so we had not reviewed it earlier. After examining the paper, we acknowledge that while both our work and this paper identify similar phenomena, our motivation and methodology are distinct.

Motivation. The paper "Vision Transformers Need Registers" believe that special tokens can be harmful. They propose using registers (extra tokens) to eliminate these tokens and thus enhance vision transformer backbone. In contrast, our work, delves deeper into the utility of these tokens and utilizing them to build Multimodal Large Language Model (MLLM).

Analyse on the effectiveness of AcFormer. We analyze the feature transformation process in Vision Transformers through both the attention map and feature map. Our experiments reveal that information tends to aggregate around these anchors. In Acformer, we leverage these anchors to be queries and all of the visual tokens to be keys and values. By this way, the anchors are trained to extract infromation from the whole image. This is different from Q-Former or Perceiver Resampler, which utilizes learnable queries to aggregate visual information.

Through extensive experiments, we evaluate our method. Results demonstrates comparable performance while significantly reducing costs, as shown above in Q1.

Q3: About the token reduction's harm on different question.

R3: Thanks for your concern. In our work, the Anchor Former consists of a stack of attention and feedforward layers. Instead of using anchors directly as image features, we use them as queries for the Anchor Former, with all image tokens serving as keys and values. During training, the model optimizes the Anchor Former parameters to implicitly extract visual features that is enough to answer different questions. (Covering both fine-grained and coarse-grained)

This is also evident by our experiment on DocVQA and ChartQA. These two benchmarks are about the document quesiton answering, which is more related to fine-grained visual information understanding. From below results, it can be found that our method can also handle the fine-grained visual question answering. Though sacrificing little performance, it greately accelerate the inference process.

ConnectorResolutionDocVQA(\uparrow)ChartQA(\uparrow)Inference Speed(\uparrow)
MLP67265.457.21 ×\times
Anchor Former67264.956.71.66 ×\times
MLP1008OOMOOM-
Anchor Former100868.858.11.21 ×\times

Q4: About the difference between AcFormer on i-1 th layer and selecting some features from i th layer?

R4: Thank you for your question. AcFormer is a stack of multi cross attention layers. So in the Anchor Former, we do not directly select specific features. The anchor selection occurs only after the Vision Transformer stage. We utilize the selected anchors as queries and all visual tokens as keys and values to perform cross-attention. Assuming the Anchors are IARM×DIA\in R^{M\times D} and the vision tokens are VRN×DV \in R^{N\times D}, the computation of the i-th layer in AcFormer can be illustrated as below.

IAi=IAi+Attni(query=IAi,key=value=V)IA_i = IA_i + Attn_i(query=IA_i, key=value=V)

IAi+1=IAi+FFNi(IAi)IA_{i+1} = IA_i + FFN_i(IA_i)

IA0IA_0 is the selected anchors after the vision transformer. V is all of the visual tokens. The final vision representation is IA6IA_{6} from the last layer (We have 6 layers for anchor former). We will revise the relevant sections of the article to achieve a clearer description."

Q5: how are the 6 layers selected and how are the features fed before the projection layer?

R5: Thanks for your question. There are six layers in Acformer with each contains an attention module and feed forward module. So there is not layer selection within the Acformer. There are only one anchor selection process between the Vision Transformer and Acformer. The selected anchors are applied as queries while all the visual tokens are applied as keys and values. They are intergrated with cross attention mechanism in Acformer as illustrated in Q4 above. Acformer are trained during all the stage (Pretrain and IFT).

Following the Anchor Former, the output features have a shape of number_of_anchors ×\times dimensions. We then use an MLP to project these dimensions to match the LLM's hidden size.

评论

Thsnks for the detailed rebuttal. I appreciate the addition report of inferencing speed and the additional report on the performance. And thanks for the clearification of my questions. These experiment address part of my concern and I would like to increase my rating. A minor correction: the paper 'Vision Transformers Need Registers' is published in ICLR 2024 on May 7th to May 10th 2024, which is earlier than NeurIPS 2024 deadline, May 22nd 2024. Not to mention this paper was submitted to arxiv on Sep 23rd 2023.

评论

Dear reviewer,

We sincerely thank you for your reply and correction. We will include a discussion with ‘Vision Transformers Need Registers’ in the next version of our paper.

Sincerely, The Authors of Submission 4497

审稿意见
4

This paper introduces a novel vision-language connector, Anchor Former (AcFormer), designed to enhance the efficiency and accuracy of multimodal models. By identifying visual anchors within Vision Transformers and utilizing a cost-effective progressive search algorithm, AcFormer leverages these anchors to aggregate information more effectively.

Extensive experiments demonstrate that AcFormer significantly reduces computational costs while maintaining or improving performance across various vision-language tasks compared to existing methods like Q-Former and Perceiver Resampler.

优点

  1. The motivation of this paper is strong, attempting to improve a fundamental building block in recent multimodal models.
  2. The paper is well written and easy to follow. Tables and figures are helpful for readers to quickly understand.
  3. Experiments are extensive and solid. They cover various types of connectors (linear projection (LLaVA), Q-Former (BLIP-2), and Perceiver Resampler) and datasets (POPE, MME, MMB, MM-Vet, TextVQA, GQA, VQAv2, VisWiz and SQAimg).
  4. This paper is insightful and can be useful to the community.

缺点

  1. The proposed method is not very effective. The key result comparison for this paper is AcFormer vs LLaVA-1.5 (linear), but in Table 1 and 2 the accuracy results are mixed (slightly in favor of AcFormer). The proposed method is not significantly better than the linear baseline.
  2. Linear connector results are missing in Table 3.
  3. Minor: The term Multimodal Large Language Model (MLLM) is self-conflicting. I suggest authors use Large Multimodal Model (LMM).

问题

  1. Besides slight accuracy gain and training time speedup, are there other benefits of the proposed method over the baseline connectors?
  2. Only pretraining time results are provided. Is the proposed method faster at inference time as well?
  3. It seems that the proposed method is more efficient in terms of number of visual tokens. Will this make a big accuracy difference on tasks with high resolution images (documents for example)?

I will raise my rating if the authors can convince me the significance of the key technical contribution of this paper.

局限性

The authors discussed them in the limitation section.

作者回复

We are grateful for your review. And we will address your concern below.

Q1: About the effectiveness.

R1: Thank you for highlighting this concern. Our primary motivation in this work is to enhance the efficiency of Large Multimodal Models (LMMs). While various token reduction methods exist, they commonly suffer from performance degradation.Our ablation study consistently indicates that our method outperforms other token reduction techniques, such as C-Abstractor and Perceiver Resampler.

Moreover, in Tables 1 and 2, while the results between LLaVA (Linear projection) and our method are mixed, our method shows an overall improvement when averaged across each benchmark. Although there are slight performance drops on some benchmarks (e.g., MM-Vet ~-0.3, VQAv2 ~-0.1), the speed is greatly improved. Overall, our method reduces computation cost while maintaining comparable performance.

ConnectorTextVQA(\uparrow)GQA(\uparrow)MMB(\uparrow)MME(\uparrow)
C-Abstractor53.460.267.81775.4
Perceiver Resampler52.156.465.41720.8
Anchor Former58.061.368.41846.1

Q2: About missing the linear connector result in Table 3.

R2: Thanks for pointing out this issue. We have listed the result of linear connector (i.e. the result of LLaVA that uses linear connector) in Table 1 and Table 2. We will add them to Table 3 to make it more completed.

Q3: About the name.

R3: Thanks for your suggestion. We will use Large Multimodal Model to replace MLLM.

Q4: About other benefits. (Inferencing time, Training cost)

R4: Thanks for your concern. Our proposed method not only improves accuracy and reduces training time but also significantly decreases training expenses (GPU memory) and accelerates inference. This is particularly important for building high-resolution large multimodal models.

We show the inference time and training memory needed below. We report the inference time for each benchmark using the prompt "Answer the question using a single word or phrase." For inference, we use 8 A100 GPUs, each with a batch size of 1. To test the training memory, we use a batch size of 4.

ConnectorResolutionTextVQA(\downarrow)DocVQA(\downarrow)ChartQA(\downarrow)GPU Mem(\downarrow)
MLP336125s198s64s31.24g
Anchor Former33697s115s36s22.04g
MLP672571s803s276s71.58g
Anchor Former672384s470s141s32.95g
MLP1008---OOM
Anchor Former1008505s653s223s50.76g

From the table, it is evident that our proposed method significantly reduces both inference and training costs without significant performance loss (refer to Q6 for more detailed analyse). In addition, this reduction allows us to develop higher resolution models with limited resource. For example, with 8 A-100 GPUs, our method can develop a model for input resolution of 1008x1008 while LLaVA-Next may suffer from out-of-memory issue.

Q5: About the inference time.

R5: We are grateful for your concern. We list the inference time in the table in Q4. The results demonstrate our method's effectiveness on accelerating the inference.

Q6: About performance on high resolution image.

R6: We present the performance of the MLP and our method on DocVQA and ChartQA below. For pretraining, we use LLaVA-558k. As LLaVA-Next's dataset is not publicly available, we augment LLaVA-665k with the DocVQA and ChartQA document datasets for Instruction-Finetuning (IFT).

ConnectorResolutionDocVQA(\uparrow)ChartQA(\uparrow)Inference Speed(\uparrow)
MLP67265.457.21 ×\times
Anchor Former67264.956.71.66 ×\times
MLP1008OOMOOM-
Anchor Former100868.858.11.21 ×\times

From the table above, it is evident that our method incurs only a slight performance drop compared to the baseline method. Additionally, our method can handle input resolutions up to 1008 × 1008 with less inference time than an MLP processing 672 × 672 resolutions. Notably, at higher input resolutions, any slight performance drops are compensated for, and even show significant improvements. For example, compared to MLP, our method may have a slight performance drop at 672 × 672, but it operates faster. By using higher resolutions like 1008 × 1008, we achieve better accuracy and still faster speeds, compensating for any initial performance loss.

评论

I've read all reviews and rebuttal responses. Thank you for providing detailed responses.

评论

Dear reviewer i5LS,

Thanks for your reply! If there are any other concerns, feel free to tell us. We are grateful for your response.

Sincerely, The Authors of Submission 4497

审稿意见
6

This paper proposes a way to select visual tokens, e.g. token pruning, by using attention map scores. This reduces the number of tokens needed in the network which saves compute costs. The method is evaluated on multiple datasets and shows reduced compute while maintaining performance.

优点

The paper is fairly well written and the experiments are good and show the benefit from the approach.

缺点

The approach isn't especially novel, many works have explored token pruning, token learning, token reduction before. While this approach is different from the previous ones, the differences are small.

问题

The compute cost overall seems to go down, but I'm curious about the implementation of the anchor selection. How long does that part take? How optimized is that algorithm?

局限性

N/A

作者回复

Thanks for your hard work in reviewing! We will address your concern below.

Q1: About the novelty.

R1: Thank you for highlighting this issue. Though there are indeed methods for token reduction or token pruning, such as CAbstractor and Perceiver Resampler, ours are different with them in many aspects.

  1. Motivation. For C-Abstractor, it mainly focus on maintaining the locality information. For Perceiver Resampler, it use the learnable queries to aggregate visual information. However, in our work, we attempt to propose image-specific information aggregation.
  2. Method. In C-Abstractor, the output tokens of vision transformer are aggregated with the conv pooling. In perceiver Resampler, the output tokens of vision transformer are aggregated with learnable queries (all of the image shares the same queries). While in Acformer, we use the anchors as queries and use all of the visual tokens as keys and values to carry out cross attention. By this way, it better maintain important info.
  3. Performance. Thanks to the visual anchors, it enables us to aggregate visual infromation with nearly no performance loss, which is evident Table below (parts of Table 3 from our paper).
ConnectorTextVQA(\uparrow)GQA(\uparrow)MMB(\uparrow)MME(\uparrow)
C-Abstractor53.460.267.81775.4
Perceiver Resampler52.156.465.41720.8
Anchor Former58.061.368.41846.1

Through the analyse above, we believe this work provides valuable insights into aligning visual and language modalities, and it offers a foundation for further exploration into the interpretability of the vision-language integration process.

Q2: About the implementation of anchor selection.

R2: Thanks for pointing out the problem for understanding. We employ the top-p method for anchor selection. Specifically, we consider the attention map of the [CLS] token in the corresponding layer of the Vision Transformer (the penultimate layer in our implementation), denoted as ARH×1×NA \in \mathbb{R}^{H \times 1 \times N}, where HH represents the number of heads and NN the number of visual tokens (excluding the [CLS] token itself). Our goal is to select MM tokens. To achieve this, we calculate the number of tokens each head should contribute using p=(M1)/Hp = (M-1)/H (we include the [CLS] token by default, so we subtract one). For each head, we sort the token indices based on AA and select the top-p indices for our results. In cases of duplication, we iteratively select the top-(p+1) indices until the desired number of tokens is achieved. By this way, we finally obtain the chosen visual anchors.

For example, if we aim to select 145 tokens and we are using the OpenAI-CLIP-L model, which has 16 attention heads, we first calculate each head's token contribution as (145-1)/16=9. We then initialize our selected tokens index list with the [CLS] token, res=[0]. For head 0, assuming the sorted token index with the attention map is [1,2,4,6,21,64,33,78,24,98,23,99...], then we append the top-8 index into the result list -> res = [0,1,2,4,6,21,64,33,78,24]. We then process head 1, assuming the sorted index with the attention map is [1,24,2,4,6,98,23,99,45,32,75,34,38,70...], as "1,2,4,6" are already in res, to avoid duplication, we append [24,98,23,99,45,32,75,34,38] into the result list. After processing each head, we have a list with 145 unique tokens' index. Finally, we sort this list and fetch the anchors according to the indext.

Q3: About the cost of anchor selection.

R3: Thanks for your concern on the cost of anchor selection. In our experiment, we employ the OpenAI CLIP-L-336 model, which features 16 attention heads in its Vision Transformer and processes 576 tokens (excluding the [CLS] token). The anchor selection primarily involves a single list sort of length 576. After evaluating this anchor selection process over 1000 iterations, we observe an average execution time of 7.5 ms (800 ms for one token generation). The baseline method takes 1300 ms to generate one token. Although our method adds an extra 7.5 ms for token selection, it reduces attention computation time by nearly 500 ms, highlighting its efficiency and lightweight nature. Despite the additional 7.5 ms from anchor selection, the overall attention cost is reduced by approximately 500 ms, leading to a net decrease in total computation time.

ConnectorOne token generation timeAnchor selection time
MLP1300msNA
Anchor Former800ms (-500ms)7.5ms
评论

Dear Ac and Reviewers,

Thank you for taking the time to review our paper, We are delighted to see that four reviewers recognized the strengths of our work, Specifically, Reviewer i5LS and Reviewer Ld45 found our method to be effective and insightful. Reviewer r5Ke acknowledged the extensive and meaningful ablation experiments we conducted. However, the reviewers also raised some concerns. For instance, Reviewer i5LS and Reviewer r5Ke wanted to know the comparasion of the inference time, while Reviewer U61n was interested in the implementation of the anchor selection. We have provided a detailed rebuttal addressing these concerns and kindly ask the reviewers to review our responses when convenient.

Thank you once again for your valuable feedback.

Sincerely, The Authors of Submission 4497

最终决定

This paper introduces a novel approach, AcFormer, which leverages visual anchors in vision transformers to improve the efficiency and accuracy of multimodal models. The paper addresses the challenge of integrating vision and language models by introducing a strong vision-language connector that reduces computational costs while maintaining or even improving performance. The authors provide extensive experiments, demonstrating significant efficiency gains, especially in inference speed and memory usage, without compromising performance across various benchmarks. While some reviewers raised concerns about the novelty and motivation behind using visual anchors, the authors effectively addressed these concerns by providing detailed explanations and additional experimental results in their rebuttal. The responses clarified the method's advantages and demonstrated its broader applicability, particularly in handling high-resolution images. Given the strong empirical results, the thoroughness of the experiments, and the authors' satisfactory responses to reviewer concerns, I recommend accepting this paper.