4.6

/10

withdrawn5 位审稿人

最低3最高6标准差1.4

3.8

置信度

正确性2.6

贡献度1.8

表达2.6

ICLR 2025

Unleashing the Power of Selective State Space Models in Vision-Language Models

Honghao Chen,Yibing Song,Shoufa Chen,Chongjian GE,Kaiqi Huang

OpenReview PDF

提交: 2024-09-25更新: 2024-11-13

摘要

关键词

Vision-Language Models; Mamba;

评审与讨论

审稿意见

评分: 3置信度: 42024-11-03

This paper proposes a new variant of vision-language model (VLM) called MambaVLM, which introduces multiple improvements to previous VLM with Mamba method, Cobra. Specifically, the paper proposes to concat the visual features from DINOv2 and SigLIP by sequence axis instead of the channel-axis in Cobra, followed by a new Mamba-based projector. Performance is validated on various VLM benchmarks.

优点

The paper is easy to understand with clear illustrations on the proposed methods.
According to the experiments, MambaVLM achieves overall better performance compared to previous VLMs, such as Qwen-VL, LLaVA-1.5, and Cobra.

缺点

The novelty is limited with the following reasons: (1) The model is based on Cobra, with minor changes on the concatenation of visual features and projector. (2) The scan directions are from VMamba, only the stitch-scan is novel.
Although the sequence-level concatenation improves the performance, it poses a great concern on the efficiency of the model, but the authors did not provide the inference speed, computational cost, and memory cost comparisons. Though the Mamba has linear computational complexity, longer sequence indeed increases the FLOPs and memory consuption, and the heavy projector also introduces additional costs. As a result, directly compare the model with existing methods such as Cobra without comparing the efficiency is unfair.
In Figure 1, directly comparing LLaVA-1.5 with MambaVLM to demonstrate the effective of Mamba and the superiority on training time is unfair, as MambaVLM uses better DINOv2-SigLIP encoder.
In lines 215~235, "regardless of how many channels ... loss of visual information" is overstated, lacking precise theoretical evidence to support the claims. Bottleneck-structures are widely used in networks such as ResNet, and according to information bottleneck principle, it is no clear evidence to state that the compression of channels will definitely lose the valuable information. Please reword.
In Table 1, some results (62.6, 76.3) of MambaVLM is not the best and should not be bolded. Please correct them.

问题

See weaknesses.

审稿意见

评分: 6置信度: 42024-11-03

MambaVLM is a highly efficient multi-modal large language model framework that integrates Mamba’s linear complexity with a novel cross-stitch scanning approach to improve both visual information interaction and vision-language alignment. Achieving competitive benchmark results with only 0.66 million data points and 14 hours of training on a single A800 node, MambaVLM significantly outperforms LLaVA-1.5 and rivals the performance of Qwen-VL, demonstrating Mamba’s potential in enhancing MLLM efficiency and effectiveness.

优点

The proposed method in the paper performs very well, achieving better performance than LLaVA 1.5 with only half the training time.
The approach is ingenious, using Mamba for long-context vision-language modeling is a promising avenue worth exploring.
The paper is written with a clear structure.

缺点

The performance comparison with the original LLaVA is somewhat unfair, as the method in the paper uses two visual encoders. It would be better if a version with only ViT-CLIP could be provided.
The method description in the paper is unclear; perhaps I missed where it explains how Mamba-VLM + Vicuna is implemented. It seems that if Vicuna is used, only the Mamba projector is related to Mamba. Of course, I also understand that the performance of VLMs is highly dependent on the performance of the LLM, and Mamba as an LLM is still relatively weak.

问题

Please see the section on weaknesses.

审稿意见

评分: 3置信度: 52024-11-04

The paper introduces a customized version of the Mamba framework within multimodal large language models (MLLMs). This framework has three core components: a visual long sequence, a Mamba projector, and a Mamba LLM. Experimental results on various benchmarks suggest improved performance and speed compared to several existing methods.

优点

The framework is concise and clear, making the proposed approach easy to understand.

缺点

1): Limited Novelty in Technical Contribution: While the paper proposes a "visual long sequence" as part of the framework, a significant body of literature already exists on augmenting visual features using ensembles of different visual encoders, as demonstrated in works such as [A-D]. The design of the Mamba projector, specifically its cross-stitch scanning scheme that concatenates four scanning paths, seems heuristic rather than theoretically grounded.

2): Unclear Motivation for Mamba Projector: The Mamba projector, the primary technical contribution of this paper, has an unclear motivation. The 1x1 convolutional MLP layer can be treated as a full attention layer, suggesting that the Mamba projector is an approximation. Lines 250–253 argue that "a simple MLP layer may not be able to accomplish sufficient vision-language alignment and interaction of different visual features. Therefore, we devise a lightweight mamba projector…" However, this rationale does not sufficiently justify the addition of the Mamba projector.

3): Unfair Experimental Comparisons: For instance, in Table 4, using a longer visual sequence generally increases latency. Models such as TinyLLaVA and MobileVLMv2 should be substituted with the Mamba LLM. In Table 2, MambaVLM shows superior performance, largely attributed to encoder ensembling—a common approach in the literature.

4): Presentation Quality: The paper’s overall clarity and presentation could benefit from further refinement.

References:

[A]: BRAVE: Broadening the Visual Encoding of Vision-Language Models, ArXiv.

[B]: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs, CVPR 2024.

[C]: Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders, ArXiv.

[D]: Law of Vision Representation in MLLMs, ArXiv.

问题

See Weakness

审稿意见

评分: 5置信度: 32024-11-08

This paper introduces MambaVLM, a novel framework that utilizes the Mamba model, a state-of-the-art selective structured state space model renowned for its linear computational complexity and its efficiency in managing long sequences. The authors enhance the Mamba model by incorporating visual long sequences and a cross-stitch scanning mechanism, specifically tailored to boost interaction and alignment between visual and linguistic data. Through extensive experiments and qualitative analyses, they establish MambaVLM not only as a powerful tool for MLLM tasks but also as a pioneering approach that sets a new benchmark for future research in the field.

优点

The authors develop visual long sequences that enhance representation capabilities, ensuring more robust and detailed visual data processing.
The authors introduce an innovative cross-stitch scanning mechanism designed to improve the interaction between visual and linguistic data, optimizing vision-language alignment.
The authors present MambaVLM-a robust and streamlined MLLM framework. Their extensive testing across various benchmarks validates the effectiveness of their approach.

缺点

The contributions are vague; it would be better to clearly summarize the contributions of this paper at the end of the Introduction. This article simply replaces the traditional MLLM with the Mamba model, and the proposed Stitch-Scan is merely a data augmentation stitching method.
The experiments are insufficient. The core argument of this article is: "we first construct visual long sequences with multiple vision encoders, which not only enrich visual representations but also leverage the advantages of Mamba in handling long sequences. Notably, this design will not undermine the efficiency obviously, which is in stark contrast with the common cognition of Transformer-based MLLMs." Is there any experimental or theoretical support for this conclusion? How much is "not undermining the efficiency obviously" specifically? It is recommended that a row be added to Table 4 so that the visual tokens of MambaVLM and MobileLLaMA-2.7B are also consistent at 144, which would support the above point.
Formula 7 is expressed non-standardly; do not mix mathematical symbols with code.
In Formula 8, Hv = Merge(Hv1, Hv2, Hv3, Hv4), the Merge method is not explained in the text. What specific merging technique is used, just a simple concatenation?
In Table 1, the Qwen-VL model outperforms MambaVLM in performance on TextVQA and VQAv2 with a data scale of 665K. Typically in papers, bold numbers indicate the best results obtained by models, but this is not the case in your table. If the bold numbers have a special meaning, please explain this in the text. Additionally, the same issue occurs in Table 2.

问题

Please see weakness above.

审稿意见

评分: 6置信度: 32024-11-09

This paper propose a new way to integrate the Mamba architecture into the multi-modal large language models (MLLM). The technical contribution include: 1. propose using visual long sequence to utilize Mamba's linear complexity. 2. design a cross-stitch scanning approach to extract and combine spatial and semantic features simultaneously. The proposed method outperforms LLaVA-1.5 with less training time and better inference efficiency, and achieve similar performance with model trained on larger dataset such as Qwen-VL.

优点

The proposed method achieve competitive results on benchmarks like Open-ended VQA and challenge sets. It outperforms LLaVA-1.5 with less training time.
The proposed method has good intuition on how to better utilize the Mamba's efficiency with good visualizations.

缺点

The proposed method is likely to be dependent on vision encoders. It would be more solid if the author could conduct additional experiments on encoders other than DINOv2 + SigLIP. Also, the author does not show how proposed method perform on single vision encoder MLLM.
There is not enough ablations experiments on the scanning orders. For example, no comparison with only using Hv1.

问题

In the introduction, the authors mention that the proposed framework is also compatible with Transformer-based LLMs, but there seems no experiments on applying the proposed method on transformer LLMs?
What is the Merge operator in the equation (8)?

撤稿通知

2024-11-13

We thank the reviewers for their valuable comments and we will revise accordingly.