PaperHub
7.3
/10
Spotlight3 位审稿人
最低7最高8标准差0.5
7
8
7
4.0
置信度
正确性3.7
贡献度3.0
表达3.7
NeurIPS 2024

VMamba: Visual State Space Model

OpenReviewPDF
提交: 2024-05-09更新: 2024-12-29
TL;DR

This paper introduces a new architecture with promising performance on visual perception tasks, which is based on mamba.

摘要

关键词
State Space ModelTransformerComputer VisionFoundation Model

评审与讨论

审稿意见
7

This paper presents VMamba, a novel vision backbone model inspired by the famous Mamba state-space sequence model. The main contribution of VMamba is its ability to achieve efficient visual representation learning with linear computational complexity. The core of VMamba is the VSS block, which incorporates the 2D-Selective-Scan module (SS2D), thereby extending the Mamba model that is a 1D selective scan good for NLP tasks. With SS2D, we can work nicely with inductive biases associated with 2D image space.

VMamba's architecture consists of multiple stages with hierarchical representations (similar to ViT). The authors introduce three model sizes: Tiny, Small, and Base. The VSS blocks replace the S6 module from Mamba with the SS2D module, and further enhancements are made by eliminating unnecessary components and optimizing the architecture for better computational efficiency - using the Triton language.

Extensive experiments demonstrate VMamba's promising performance across various visual perception tasks, including image classification on ImageNet-1K, object detection, instance segmentation on MSCOCO, and semantic segmentation on ADE20K. VMamba consistently achieves superior accuracy and throughput compared to existing benchmark models, showcasing its scalability and adaptability to different input resolutions and downstream tasks.

优点

  • 2D-Selective-Scan Module: The introduction of the 2D-Selective-Scan (SS2D) module is a creative solution to bridge the gap between 1D selective scan and 2D vision data.
  • Comprehensive Experiments: The paper provides extensive experimental results on multiple benchmarks, including ImageNet-1K, MSCOCO, and ADE20K, demonstrating the effectiveness and robustness of VMamba across various tasks.
  • Clear Explanation: The paper is well-written, with clear explanations. The authors provide detailed descriptions of the architecture, modules, and experimental setups, making it accessible to readers.
  • Visualization: The use of visualizations, such as activation maps and effective receptive fields (ERF), helps in understanding the SS2D mechanism and the model's behavior, which is very important part of all the ablation studies.
  • Impact on Visual Representation Learning: VMamba addresses a critical issue in vision models by reducing computational complexity from quadratic to linear, which can significantly impact the field of visual representation learning.

缺点

  • Limited Comparison with Other SSM-based Models: While the paper does compare VMamba with several benchmark models, it would benefit from a more detailed comparison with other state-space models (SSM) in the vision domain. Specifically, models like S4ND and Vim are mentioned, but the comparisons are somewhat brief. Providing more in-depth analysis and results would strengthen the argument for VMamba's superiority.
  • Adding more interesting works to Related Work section: There are some interesting works on neuromorphic vision and processing with SSMs that authors should cite and mention:

[1] State Space Models for Event Cameras. Nikola Zubić, Mathias Gehrig, Davide Scaramuzza - CVPR 2024, Spotlight

[2] Scalable Event-by-event Processing of Neuromorphic Sensory Signals With Deep State-Space Models. Mark Schöne, Neeraj Mohan Sushma, Jingyue Zhuge, Christian Mayr, Anand Subramoney, David Kappel - ICONS 2024

  • Generalization to Other Tasks: The experiments focus mainly on standard benchmarks for image classification, object detection, and segmentation. However, it is not clear how well VMamba generalizes to other types of visual tasks such as video analysis, 3D vision, or more complex scene understanding. Including some preliminary results or at least discussions on these aspects could highlight the versatility of VMamba further.
  • Clarity in Mathematical Derivations: Some of the mathematical derivations, especially in the relationship between SS2D and self-attention, are complex and may not be easily accessible to all readers. Simplifying the explanations or providing more intuitive visual insights alongside the formal derivations could enhance understanding. Also, they are not rigorously mathematically proven.

问题

  • Could you provide more detailed comparisons with other state-space models (SSMs) used in the vision domain, such as S4ND and Vim? Specifically, how does VMamba perform in terms of accuracy, computational efficiency, and memory usage compared to these models?
  • How well does VMamba generalize to other types of visual tasks beyond image classification, object detection, and segmentation? Have you considered evaluating VMamba on tasks such as video analysis, 3D vision, or more complex scene understanding? How does it scale on these tasks?
  • How sensitive is VMamba to various hyperparameters? It would be helpful to know if specific hyperparameters are critical to achieving the reported performance and if there are guidelines or best practices for tuning them.
  • How easily can VMamba be integrated into existing deep learning frameworks and pipelines? Are there any specific requirements or modifications needed for seamless integration?

局限性

Authors addressed everything regarding limitations section.

作者回复

Response to Reviewer c4kg

We appreciate the reviewer’s thoughtful review and constructive comments. In our responses, we address the following concerns: a detailed comparison with SSM-based methods, the generalizability of VMamba, sensitivity to hyper-parameters, and potential for integration into various frameworks.

Detailed Comparison with SSM-based Methods

In Table 1 of the main submission, we have already compared our method to S4ND [2] and Vim [4] in terms of the number of parameters, train throughput, and the Top-1 accuracy on ImageNet-1K. To provide a more comprehensive evaluation, we additionally conduct comparison on FLOPs and the memory usage, and the results are reported in the following table.

Moreover, we also compare the performance (both effectiveness and efficiency) change with increasing input resolution in Figure 1 in the attachment. For qualitative comparison, we visualize the ERF of S4ND and Vim, and the results are shown in Figure 2 in the attachment. We will include these results and the associated analysis in the revised manuscript.

ModelHierarchicalParams (M)FLOPs (G)TP. (img/s)Test Mem. (M)Train TP. (img/s)Train Mem. (M)Top-1 (%)
DeiT-SFalse22M4.6G17615822404456279.8
DeiT-BFalse86M17.5G50310321404951181.8
S4ND-ViT-BFalse89M17.1G39822214001586880.4
Vim-SFalse26M5.3G8111055344 \dagger (232)9056 \dagger (16150)80.5
Swin-TTrue28M4.5G12443092987979881.3
ConvNeXt-TTrue29M4.5G11982498702945082.1
S4ND-Conv-TTrue30M5.2G68339453691884382.2
Vanilla-VMamba-TTrue23M5.6G63860421951645282.2
VMamba-TTrue30M4.9G168630645711239482.6

[Performance comparison between VMamba and benchmark methods. \dagger indicates the value is measured with mix-resolution while Vim does not support training with mix-resolution (values in the brackets are results obtained with fp32).]

Additional Related Studies

We thank the reviewer for bringing these inspiring studies to our attention. We will include references to these papers in the revised version.

Versatility of VMamba

Due to our limited computational resources, we have focused on conducting experiments on benchmark tasks in vision modeling. However, we recognize the importance of illustrating the potential of the proposed method in more generalized tasks.

A preliminary literature review of recently proposed SSM-based approaches in vision tasks, along with our private communications with researchers in the field, highlights the potential of the 2D selective scan technique (SS2D) introduced in this study. SS2D does not make specific assumptions about the layout or modality of the input data, which allows it to be generalized to various tasks. For example, SS2D can process video data by traversing a spatial-temporal plane of frame patches. To our knowledge, recent studies leveraging scanning patterns analogous to SS2D have shown success in various tasks, including image restoration and multimodal data understanding, in addition to those mentioned in the question. We will add these results to the final version and cite their works if they are published by then.

Despite not being inherently prohibited, we anticipate challenges in directly migrating SS2D to diverse downstream tasks due to varying requirements. Bridging the gap between SS2D and these tasks, along with proposing a more generalized scanning pattern for vision tasks, is a promising research direction. We will include this discussion in the revised version, hopefully to provide readers with some inspiration.

Clarity in Mathematical Derivations

Due to limited space, we have included detailed proofs in the appendix and will provide more rigorous and clearer derivations in the revised version. We also recognize the significance of providing more intuitive and accessible explanations, and will include them in the revised version.

Sensitivity ot Hyper-parameters

According to our experience, we have not found any hyperparameter to which VMamba is particularly sensitive. This observation is also supported by the ablation results on single hyper-parameters (initialization approach in Table 11 and activation function in Table 15) as well as different combinations (Tables 12, 13, and 14) included in the Appendix.

We conducted additional experiments on the influence of the learning rate, and the results are reported in the following table. We will include this discussion in the revised version.

ModelParams (M)FLOPs (G)lrTop 1. (%)
VMamba-Tiny30M4.91G2e-382.70
VMamba-Tiny \dagger30M4.91G1e-382.62
VMamba-Tiny30M4.91G5e-482.16

[The performance of VMamba-T with different learning rate. Results marked by \dagger is the default setting used in the submission. All the models here are trained on [SERVER 2].]

Potential of Integrating into Various Frameworks

The core of VMamba lies in the design of the SS2D module, which aims to bridge the gap between 1D sequence scanning and 2D plane traversing, rather than specific architectural configurations. SS2D can function as an end-to-end token mixer, allowing it to be integrated into various mainstream backbone networks in computer vision.

Indeed, integrating SS2D into existing frameworks requires additional considerations. One critical aspect is the numerical precision settings in the model, which significantly impact performance and computational speed. Another important factor is the inclusion of normalization layers to stabilize the training process. We will include these points in the revised version to assist researchers who may want to build upon our work.

评论
  1. Authors did Detailed Comparison with SSM-based Methods along with experiments.
  2. They will include works of Zubić et al. and Schoene et al. in the related works section.
  3. Authors said that they will discuss more the generalized scanning pattern for vision tasks in the paper as future work, which is very interesting.
  4. Authors "have not found any hyperparameter to which VMamba is particularly sensitive".
  5. Pretty robust model to the changes in hyperparameters, they did experiments, for example learning rates, which is great.

Given that the authors have addressed all my concerns with clear and effective experimental evidence, I am updating my score from Accept (7) to Strong Accept (8).

审稿意见
8

This paper transplants the Mamba (Selective State Space Model), a linear complexity model originally designed for 1D language processing, into VMamba to process image data. It introduces the 2D selective scan and various acceleration techniques to facilitate the modeling of 2D data and enhance the speed of the network. The proposed VMamba model is trained and evaluated on a number of representative downstream tasks including ImageNet-1K classification, COCO object detection, and ASE20K semantic segmentation, and it is compared with strong baselines. A range of analyses and visualizations on the theoretical perspectives, design choices, and behavior of the model are also presented.

优点

  1. VMamba is one of the first papers to attempt using Mamba, one of the most efficient and performant linear complexity models to date, to learn visual data and demonstrate effectiveness.
  2. The paper proposes a series of innovations to adapt the original Mamba's 1D sequential scanning to process 2D image data (SS2D) and increase the model's processing speed (image throughput) without compromising performance.
  3. In-depth deductions, comprehensive analyses, experiments, and visualizations on design choices, theoretical aspects (e.g., the relationship between SSM and Self-Attention), and model behaviors have been presented, carrying a huge volume of insightful findings that are valuable for future research.
  4. The proposed VMamba is compared to representative downstream tasks, including ImageNet-1K classification, COCO object detection, and ASE20K semantic segmentation. It shows comparable or better (and consistent) results to strong baselines (e.g., Swin, DeiT, and the concurrent Vim) and superior efficiency.
  5. As a general and simple visual model, the proposed VMamba potentially carries huge extension and generalization potential, which could inspire and impact a wide range of visual research.

缺点

I didn’t find any critical weakness in this paper. Apart from some limitations that have already been mentioned by the authors, such as large-scale experiments, training strategies, and hyperparameter search, the only part that I hope the paper can show more results is the ablation of some design choices. For instance, the performance change of removing the entire multiplicative branch, where Table 5 does not show a straight ablation because of more than 1 variable change. This problem also exists in some other tables of some other design choices and hyperparameters. But again, I think neither of these weaknesses is critical.

问题

Could the authors explain more on the statement in Lines 153-154, “such modification prevents the weights from being input-independent, resulting in a limited capacity for capturing contextual information”?

Others: Repetitive reference entries: [50] and [51], [50] and [60].

局限性

The authors adequately discuss the limitations and potential societal impact of this work. This paper also points out several potential improvements and future directions, with which I highly agree.

作者回复

Response to Reviewer ZfkW

We appreciate the reviewer’s thoughtful review and positive comments about our study. In the following sections, we address the reviewer’s primary concern regarding the lack of ablation on design choices and clarify several other issues raised.

More Ablation on Design Choices

First of all, we would like to clarify that our primary reason for modifying multiple hyper-parameters simultaneously is to ensure that the number of parameters and FLOPs remain comparable, facilitating a fair comparison between different model variants. In Table 5 (corresponding to Figure 3 (e) in Section 4.3 of the main submission), we detail the configurations used to optimize the overall performance of VMamba, balancing both effectiveness and efficiency rather than isolating the impact of each hyperparameter.

However, we sincerely acknowledge the importance of analyzing the significance of each individual hyperparameter and architectural design choice on the overall performance. We plan to conduct more comprehensive experiments to extend the results of experiments isolating each hyperparameter in Table 5, and include those results in future versions of this study.

To address the issue mentioned in the reviewer's comment, we have conducted additional experiments to analyze the influence of changing a single variable. The results are reported in the following table. Values for Step (e.1) and Step (e.2) are copied from Table 12 and Table 14 in the appendix, respectively, while Step (d.1) and Step (d.2) present new results obtained during the rebuttal process.

Modeld_statessm_ratioDWConvMultiplicative BranchLayersFFNParams (M)FLOPs (G)TP. (img/s)Train TP. (img/s)Top-1 (%)
Vanilla-VMamba-T162.0TrueTrue[2,2,9,2]False22.9M5.63G42613882.17
Step(a)162.0TrueTrue[2,2,9,2]False22.9M5.63G46716582.17
Step(b)162.0TrueTrue[2,2,9,2]False22.9M5.63G46418482.17
Step(c)162.0TrueTrue[2,2,9,2]False22.9M5.63G63819582.17
Step(d)162.0FalseTrue[2,2,2,2]True29.0M5.63G81324881.65
Step(d.1)161.0FalseTrue[2,2,2,2]True22.9M4.02G1336 \dagger405 \dagger81.05 \ddagger
Step(d.2)161.0FalseTrue[2,2,5,2]True28.2M5.18G1137 \dagger348 \dagger82.24 \ddagger
Step(e)161.0FalseFalse[2,2,5,2]True26.2M4.86G117936082.17
Step(e.1)161.0TrueFalse[2,2,5,2]True26.3M4.87G116435882.31
Step(e.2)11.0TrueFalse[2,2,5,2]True25.6M3.98G194264781.87
Step(f)12.0TrueFalse[2,2,5,2]True30.7M4.86G134046482.49
Step(g)11.0TrueFalse[2,2,8,2]True30.2M4.91G168657182.60

Details of accelerating VMamba. \dagger and \ddagger indicate the value is obtained from [SERVER 1] and [SERVER 2], respectively. All other experiments are conducted on [SERVER 0].

Clarification of the Statement

There is a typo in the mentioned statement, and the correct version is "such modification prevents the weights from being input-dependent, resulting in a limited capacity for capturing contextual information" (i.e., change from "input-independent" to "input-dependent"). We will fix this typo and conduct thorough proofreading to prevent further errors in the revised version.

Detailed Explanation of the Referred Statement. S4ND [2] extends S4 [1] to higher-dimensional contexts through a straightforward outer product, with the essential condition being that the SSM in S4 is implemented using 'accelerated convolution'. Specifically, S4 utilizes a global convolutional operation to compute the output of the SSM, denoted as y\mathbf{y}, given the input data u\mathbf{u} and the kernel function K=CeAΔB\mathbf{K} = \mathbf{C}e^{\mathbf{A\Delta}}\mathbf{B}.

Efficient computation is achieved if A\mathbf{A} has an 'Normal Plus Low-Rank' (NPLR) form and Δ\Delta is constant, enabling the low-rank approximation of K\mathbf{K} in the spectral domain, allowing the convolution to be efficiently computed with Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (IFFT). Conversely, if Δ\Delta is input-dependent or context-aware, the kernel function will no longer maintain a low-rank form in the spectral domain, leading to a substantial increase in the convolution computation time.

The efficacy of a recurrent model is significantly limited by its capacity to effectively compress context [3]. By leveraging the task of selective copying and employing Induction heads, Mamba [3] illustrates that LTI models lack content awareness. Consequently, it is concluded that a fundamental principle in developing sequence models is selectivity: the context-aware capability to emphasize or disregard specific inputs within a sequential state.

Repetitive Reference Entries

We will address the issues mentioned in the comment and conduct thorough proofreading to prevent further errors in the revised version.

评论

Thanks to the authors for providing a thorough and solid response to all my concerns. Based on all the reviewers' comments and the rebuttal, I am happy to keep my rating as a Strong Accept (8).

审稿意见
7

Summary

This paper proposes VMamba, which adopts the recently proposed selective linear state space model, Mamba, in the domain of computer vision. The paper evaluates variants of VMamba on tasks such as image classification, object detection, and semantic segmentation. To improve performance and efficiency, VMamba incorporates several architectural and implementation enhancements.


post-rebuttal: score 6 -> 7

优点

Strengths

  • The writing is simple and clear, quite accessible to readers.
  • After implementing enhancements, VMamba achieves good performance - computational efficient and quantitatively well-performed.
  • Additional analysis such as the effective receptive field and relationship between attention and the updates in state space are insightful.

缺点

Weaknesses

  • As an architecture exploration paper I don’t see many weaknesses.

问题

Questions

  1. Is positional embedding used when encoding the patches? Apologize if this is already state somewhere in the paper.
  2. If I understand correctly, Figure 3 for section 4.3 shows that performance improved with smaller d_state and expand ratio. This is quite surprising since one might expect degrading performance when network capacity is reduced. Could you provide any insights into this phenomenon?

局限性

Yes, the author adequately addressed the limitations.

作者回复

Response to Reviewer dvTH

We thank the reviewer for the constructive comments and are glad they appreciate the performance of VMamba. Below, we clarify the reviewer’s concerns regarding the detailed structure and the influence of hyper-parameters on VMamba.

Usage of Positional Embedding

To clarify, VMamba does not use positional embedding. Sorry for any confusion caused, and we will make this clear in Section 4.1 Network Architecture (lines 129-136) of the revised version as follows:

"Subsequently, multiple network stages are employed to create hierarchical representations" \rightarrow "Without further incorporating positional embedding, multiple network stages are employed to create hierarchical representations."

Explanation of Performance Improvement

In step (e) shown in Figure 3 for Section 4.3, we manage to save parameters and FLOPs by reducing the expansion ratio and eliminating the entire multiplicative branch. This allows us to increase the number of layers from [2,2,2,2] to [2,2,5,2], resulting in the observed performance improvement. Similarly, in step (g), lowering the expansion ratio enables us to increase the depth of the model with additional layers. For step (f), the performance improvement is due to the larger expansion ratio and the addition of extra DWConv blocks. By using a smaller d_state value, we keep parameters and FLOPs comparable. We will provide more details on these points in the revised version.

Influence of d_State. In Section H.3 of the Appendix, we explore the impact of adjusting the d_state parameter on VMamba. Table 12 shows that increasing d_state from 1 to 4 yields only marginal performance gains while significantly reducing throughput, indicating a substantial negative impact on VMamba's computational efficiency. To mitigate this, we propose lowering the ssm_ratio parameter to reduce overall network complexity. We find the best performance at (d_state=8, ssm_ratio=1.5).

Influence of ssm_ratio. We also analyze VMamba's sensitivity to the ssm_ratio parameter, with results presented in Table 13 of Appendix H.4. The results clearly indicate that lowering the ssm_ratio significantly reduces performance but also greatly increases the inference speed. On the other hand, adding more layers boosts performance but also decelerates the model.

评论

Thank you for the rebuttal. I will raise my score, this is a good paper.

作者回复

Response to all

We thank the reviewers for their thoughtful reviews and constructive suggestions. We’re glad that the reviewers recognized the innovation and influence of the proposed 2D-Selective-Scan (SS2D) module, as well as the extensive experiments and thorough analysis supporting VMamba. In the following, we provide a shared response to common concerns raised by the reviewers, and also include a PDF file (referred to as the attachment) with additional experimental results to support our discussion. Additional results are included in the attachment (figures) as well as in the separate responses to each reviewer (tables).

Ablation Study on Hyper-parameters

All reviewers have raised concerns regarding the influence of hyper-parameters. Due to the mismatch between our limited computational resources and the extensive range of design choices, we did not initially conduct a comprehensive ablation study on all hyper-parameters, focusing instead on a subset included in the appendix. As suggested by the reviewers, we have now conducted additional experiments on this topic.

Comparison with SSM-based Models

Another focus of the reviewers is the need for a more in-depth comparison between VMamba and SSM-based models, such as S4ND [2] and Vim [4]. We recognize the importance of these comparisons and have conducted additional experiments as suggested. The results include comparisons of FLOPs, visualizations of the Effective Receptive Fields (ERFs), and analyses of the changes in performance (both effectiveness and efficiency) with increasing input resolution.

Statement on Experiment Platforms

Please note that there are slight differences between the platforms we used for the original study and this rebuttal.

UsageCPUGPUNotation
Original WorkAMD EPYC 75428 ×\times Tesla A100 GPU[SERVER 0]
Rebuttal (Testing)Intel Xeon Platinum 8358Tesla A800 GPUs[SERVER 1]
Rebuttal (Training)Intel Xeon Platinum 8480C8 ×\times Tesla H100 GPU[SERVER 2]

We investigate the influence of computational platforms on evaluation results as follows. For [SERVER 0] and [SERVER 1], we test the generalizability to inputs with increased spatial resolutions, and the results are shown in the following table. Both training and inference throughput values are measured with a batch size of 3232 using PyTorch 2.2. The training throughput calculations include only the model forward pass, loss forward pass, and backward pass.

ModelImage SizeParams (M)FLOPs (G)[SERVER 0] TP. (img/s)[SERVER 0] Train TP. (img/s)[SERVER 1] TP. (img/s)[SERVER 1] Train TP. (img/s)
VMamba-Tiny2242224^230M4.91G14904181463453
VMamba-Tiny2882288^230M8.11G947303952305
VMamba-Tiny3842384^230M14.41G566187563187
VMamba-Tiny5122512^230M25.63G340121339120
VMamba-Tiny6402640^230M40.04G2147521675
VMamba-Tiny7682768^230M57.66G1495314953

We also compare the differences between VMamba-T trained on [SERVER 0] and [SERVER 2] in the following table.

ModelParams (M)FLOPs (G)LRTop 1. (%)
VMamba-Tiny [SERVER 0]30M4.91G1e-382.60
VMamba-Tiny [SERVER 2]30M4.91G1e-382.62

According to the results shown in the above two tables, there is only a subtle difference between the results obtained on [SERVER 0] and [SERVER 1]/[SERVER 2]. Therefore, we disregard the influence of computational platforms and will include results obtained with consistent machines in the revised version.

Citations:

[1] Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In ICLR, 2021.

[2] Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré. S4nd: Modeling images and videos as multidimensional signals with state spaces. NeurIPS, 35:2846–2861, 2022.

[3] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.

[4] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. In ICML, 2024.

[5] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
最终决定

This paper introduces VMamba, a new vision backbone model inspired by the Mamba state-space model, designed for efficient visual representation learning with linear computational complexity. The key innovation is the VSS block, which incorporates a 2D-Selective-Scan (SS2D) module, adapting the 1D selective scan of Mamba for 2D image data. Experiments show that VMamba outperforms existing models in image classification, object detection, instance segmentation, and semantic segmentation, demonstrating superior accuracy, scalability, and adaptability. All reviewers are satisfied with the authors' responses.