PointMamba: A Simple State Space Model for Point Cloud Analysis
摘要
评审与讨论
This paper propose a simple but effective Mamba-based method named PointMamba for point cloud analysis. This paper is the first paper that studies the Mamba-based method for point clouds. The experiments are comprehensive and the paper is in very good shape.
优点
- This paper demonstrates excellent writing quality, with a clear and evident motivation.
- Figure 1 is highly comprehensible, particularly in its comparisons with Point-MAE.
缺点
- (No need for experiments) The use of validation datasets is prevalent in the field, with consistently high metrics reported. I encourage the inclusion of indoor segmentation and detection tasks, as exemplified in Point-M2AE and MaskPoint, in future iterations of your work. Additionally, tackling the classification task on Objaverse-LVIS appears to be a more demanding and stimulating challenge.
- To be honest, there are numerous papers that adhere to the evaluation paradigm established by Point-BERT. However, I am genuinely interested in exploring novel discoveries in self-supervised learning specifically applied to point clouds.
问题
- It has been experimentally determined that Point-MAE, Point-M2AE, and MaskPoint do not accurately replicate their reported performance, as observed in this study. It is important to investigate whether PointMamba exhibits consistent stability across all of these tasks. If not, providing mean and standard deviation values similar to those presented in Table 3 is crucial.
- Additionally, it has been observed that conducting experiments with a larger number of points using Mamba results in an increase in training time. Have you encountered this phenomenon in your own experiment? If so, kindly elaborate on possible solutions to enhance its suitability for real-world applications, such as Auto-Driving.
- Could you please elaborate on the reasons why incorporating Hilbert and Trans-Hilbert techniques in your current version has led to improved performance compared to the reordering strategy implemented in the initial version? Moreover, Hilbert and Trans-Hilbert techniques seem to borrow from Point Transformer. If so, the main contribution lies in integrating Mamba into Point-MAE.
- (No need for experiments, just discuss.) There are several Mamba-based methods after this paper. Please just discuss the advantages and disadvantages compared with them.
局限性
- The authors have addressed the limitations and potential negative societal impact of their work.
- To question 4: “…several Mamba-based methods after this paper…”
Reply: Good suggestion! Indeed, there are several papers after this paper, we list the comprehensive comparisons as follows:
PCM [1] combines Vision Mamba with PointMLP and incorporates consistent traverse serialization at each stage. To enhance Mamba’s capability in managing point sequences with varying orders, PCM introduces point prompts that convey the sequence’s arrangement rules.
Point Mamba [2] uses an octree-based ordering scheme and combines Mamba with PCT and OctFormer as their baseline. Mamba blocks with bi-directional scanning extract hierarchical point features, and a Feature Pyramid Network (FPN) is utilized for classification or segmentation tasks.
Mamba3D [3] introduces an enhanced Vision Mamba block, which includes both a token forward SSM and a backward SSM that operates on the feature channel. It proposes a Local Norm Pooling block to extract local geometric features.
PointTramba [4] introduces a hybrid approach that integrates Transformers and Mamba. It segments point clouds into groups and utilizes Transformers to capture intra-group dependencies, while Mamba models inter-group relationships using a bi-directional, importance-aware ordering strategy.
Note that some methods (e.g., [4]) are clearly based on our code or baseline, which proves the value of our approach to the community. We will add these discussions in the revised version.
[1] Point could mamba: Point cloud learning via state space model
[2] Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy
[3] Mamba3d: Enhancing local features for 3d point cloud analysis via state space model
[4] PoinTramba: A Hybrid Transformer-Mamba Framework for Point Cloud Analysis
- To weakness 1: “(No need for experiments)…encourage the inclusion of indoor segmentation and detection tasks…”
Reply: Thanks for this very constructive suggestion, and we will explore a unified Mamba-based foundation model for various 3D vision tasks (including indoor segmentation, detection tasks, and classification tasks on Objaverse-LVIS ) in the future. We also appreciate the reviewer's understanding that finishing such experiments in a limited time is difficult.
- To weakness 2: “…Point-BERT…genuinely interested in exploring novel discoveries in self-supervised learning…”
Reply: We totally agree with this insightful point. Actually, this paper also explores a new self-supervised pre-training paradigm named serialization-based mask modeling (Sec. 4.2). Specifically, different previous methods (e.g., Point-BERT) using random masking, we consider the unidirectional modeling of Mamba and propose to randomly choose one space-filling curve to generate the serialized point tokens for mask modeling. In addition, the proposed extremely simple order indicator effectively maintains the distinct characteristics of these different scanning strategies. As a starting point for this area, we hope this paper can provide useful insight for the community and encourage researchers to focus on the potential of Mamba in 3D vision tasks.
- To question 1: “… consistent stability across all of these tasks…providing mean and standard deviation…”
Reply: Thanks! We promise that all code will be released and the results can be easily reproduced. Following your suggestion, we report the mean and std in all datasets under three runs, as shown below. We can see that the proposed method achieves consistent stability across different tasks/datasets.
| Method | OBJ-BG | OBJ-ONLY | PB-T50-RS | ModelNet40 | ShapeNetPart |
|---|---|---|---|---|---|
| PointGPT (NeurIPS 23) | 93.39 | 92.43 | 89.17 | 93.3 | 86.2 |
| PointMamba (Ours) | 3 runs: [93.76, 93.98, 94.32] | 3 runs: [92.31, 92.60, 92.47] | 3 runs: [89.44, 89.31, 89.25] | 3 runs: [93.6, 93.6, 93.5] | 3 runs: [86.18, 86.12, 86.16] |
- To question 2. “… a larger number of points…training time…possible solutions…real-world applications…”
Reply: Thanks! Following your suggestion, we evaluate the training time (second/epoch) with different input lengths (from 128 to 2048) with a batch size of 16, as shown below. We find that with point tokens increasing, the advantages of our approach over Point-MAE will be even more pronounced, further indicating the efficiency of our PointMamba. In addition, to apply the PointMamba to Auto-Driving, a possible way is to design an efficient voxel-based tokenizer that can effectively transform the larger number of point clouds into a series of long point/voxel tokens, which is a promising future work.
| Sequence length | 128 | 256 | 512 | 1024 | 2048 |
|---|---|---|---|---|---|
| Point-MAE | 97.24 s/epoch | 128.99 s/epoch | 194.56 s/epoch | 351.64 s/epoch | 797.78 s/epoch |
| PointMamba | 89.18 s/epoch | 112.57 s/epoch | 150.55 s/epoch | 225.56 s/epoch | 374.89 s/epoch |
- To question 3: “…improved performance compared to…the initial version? …borrow from Point Transformer…the main contribution lies in integrating Mamba into Point-MAE”
Reply: Thanks! In fact, the improved performance compared to the initial version is not simply replacing the reordering strategy but rather the comprehensive new proposed technology. Specifically, compared with the initial version, we make the following new contribution: 1) The proposed point scanning strategy transforms the unstructured 3D point clouds into a regular sequence, which can provide diverse perspectives on spatial locality via scans from different clues; 2) The order indicator maintains the distinct spatial characteristics of different scanning when model training, preserving the integrity of the spatial information, which is crucial for the unstructured point clouds ; 3) Serialization-based mask modeling randomly choose a space scanning curves for masking, allowing the model exact the general local relationships from different scanning clue, better matching the requirement of unidirectional modeling of Mamba.
Furthermore, we would like to clarify that our method are totally different from Point-MAE: 1) For processing steps, Point-MAE directly generates the random tokens via a lightweight PointNet. In contrast, we propose a point scanning strategy, which leverages the space-filling curves to transform the unstructured point clouds into a regular sequence. Although Point Transformer also adopts space-filling curves, the motivation and objective are different. Point Transformer utilizes space-filling curves to partition the point cloud, capturing spatial contexts. In contrast, our work mainly focuses on transferring the point clouds to serialization-based sequences and combines it with Mamba to implement global modeling. 2) For feature extraction, Point-MAE employs a vanilla Transformer. We merge the vanilla Mamba with the proposed order indicator, making the Mamba preserve the integrity of the spatial information from different scanning curves; 3) For pre-training, the PointMAE is the traditional random mask modeling paradigm, while our PointMamba proposes serialization-based mask modeling, which employ serialization masking modeling, combining with the proposed order indicator; 4) We also provide detailed theoretical analyses of why PointMamba work well in point cloud tasks, while it seems lack on PointMAE.
Overall, the current version makes substantive improvements compared with Point-MAE and our initial version. Extensive experiments also validate the effectiveness of our method (See Fig.1 of the manuscript).
For question 4, please see the reply presented in the “comment” part.
Thanks for the authors' responses, most of my questions have been solved. Therefore, I will keep my rating as positive. I look forward to your future work exploring PointMamba on more challenging datasets.
Thank you for your recognition of our responses and for maintaining a positive rating. We value your suggestion and will incorporate this into our future research plans.
In summary, PointMamba, an innovative state space model tailored for point cloud analysis, successfully harnesses the global modeling prowess of Mamba, a representative SSM from the NLP domain. By adopting a linear complexity algorithm, PointMamba addresses the computational challenges posed by traditional Transformer-based methods while maintaining their global modeling capabilities. Its key techniques include utilizing space-filling curves for effective point tokenization and employing a non-hierarchical Mamba encoder as the foundation. Extensive experiments across multiple datasets validate the superior performance of PointMamba while significantly reducing GPU memory usage and FLOPs. This work not only demonstrates the vast potential of SSMs in 3D vision tasks but also provides a simple yet effective baseline for future research in this domain.
优点
-
Linear Complexity with Global Modeling: PointMamba leverages state space models to achieve linear complexity while maintaining global modeling capabilities, overcoming the computational challenges of traditional Transformers.
-
Efficient Point Cloud Representation: The use of space-filling curves for point tokenization enables efficient representation of point clouds, capturing spatial structure while facilitating global feature extraction.
-
Simple and Effective Mamba Encoder: The non-hierarchical Mamba encoder provides a simple yet powerful backbone for PointMamba, enabling fast and accurate global feature modeling.
-
Superior Performance: Comprehensive evaluations show that PointMamba achieves state-of-the-art performance across multiple datasets, demonstrating its effectiveness for point cloud analysis tasks.
缺点
-
Although PointMamba borrows structurally from Mamba, it may not take full advantage of the unique characteristics of point cloud data, such as spatial distribution, density variations, and local geometry.
-
PointMamba, while inheriting the strengths of Mamba, may have inherited some of the limitations of its design for natural language processing tasks. These limitations may not be applicable to point cloud analysis tasks, such as the lack of specific preprocessing steps, feature extraction methods, or post-processing techniques for point cloud data. The feature extraction and processing steps in the paper are largely similar to those of the previous PointMAE.
-
Although PointMamba achieves global modeling of linear complexity by introducing state-space models, there may be trade-offs between model complexity and performance in real-world applications. Because according to the experimental results of the thesis, the effect of partially using PointMamba becomes worse instead.
-
According to Table 5 of the ablation experiment, the serialization operations Hiber and Trans-Hiber have the most obvious impact on the experimental results. Meanwhile, the framework diagram at the core of the paper, i.e., Figure 4, shows that the only change is Hiber, from the directly cited literature [27], and there is almost no additional description or setting in the paper.
问题
-
Based on weakness 4, is it possible that PointMamba actually only works well in Hiber conditions? As this very much affects the final results and conclusions of the paper.
-
Intuitively, PointMamba seems to be just a replacement of the Mamba block (Equation 4) for the previous Transformer block, and even if this results in good performance (Figure 1), I am not convinced. The paper is theoretically and experimentally fleshed out, but I still don't find an obvious innovative contribution. Therefore, I would like to confirm whether the authors have designed a specialized Mamba technique for 3D point clouds?
局限性
Content-wise, the paper has no obvious limitations, and its core technology builds on Mamba, which is popular in other fields. Formatting-wise, the paper shows the mamba icon several times; please confirm whether this representation is appropriate in an academic paper.
- To weakness 1&2: “… may not take full advantage of the unique characteristics of point cloud data…/…may not be applicable to point cloud…are largely similar to…Point-MAE.”
Reply: Thanks. We respect the reviewer's opinion. However, we believe this paper considers the characteristics of point clouds. Specifically, through a pilot of experiments, we first demonstrate that simply replacing the Transformer with a Mamba can not achieve ideal performance due to the unidirectional modeling. Based on this, we customize the key innovative designs to handle the point cloud data: 1) The proposed point scanning strategy transforms the unstructured 3D point clouds into a regular sequence, which can provide diverse perspectives on spatial locality via scans from different clues; 2) The order indicator maintains the distinct spatial characteristics of different scanning when training, preserving the integrity of the spatial information, which is crucial for the unstructured point cloud; 3) Serialization-based mask modeling randomly choose a space scanning curves for masking, allowing the model exact the general local relationships from different scanning perspectives, better matching the requirement of indirection modeling of mamba. Thanks to these proposed simple yet innovative key designs, our PointMamba works well and performs better than its Transformer-based counterparts in point cloud tasks. We will make these technical contributions more clearly in the revised version.
Besides, compared with the Point-MAE, the differences are also distinct: 1) For processing steps, Point-MAE directly generates the random tokens via a lightweight PointNet. In contrast, we propose a point scanning strategy, which leverages the space-filling curves to transform the unstructured point clouds into a regular sequence; 2) For feature extraction, Point-MAE employs a vanilla Transformer. In contrast, we merge the vanilla Mamba with the proposed order indicator, making the mamba preserving the integrity of the spatial information from different scanning curves; 3) For pre-training, the Point-MAE is the traditional random mask modeling paradigm, while our PointMamba proposes serialization-based mask modeling, allowing the model exact the general local relationships from different scanning curves via the proposed order indicator; 4)We also provide detailed theoretical analyses of why PointMamba work well in point cloud tasks, while it seems lack on Point-MAE.
Overall, although our method presents a simple pipeline, the processing steps, feature extraction, pre-training, etc, are significantly different from the previous methods. As R2 said, “…the proposed method is simple, elegant, and effective, establishing a solid Mamba-based baseline for point cloud analysis tasks…”, we believe our method would be interesting and valuable for people in this area. We will deeply discuss the differences in the revised version.
- To weakness 3: “…trade-offs between model complexity and performance…the effect of partially using PointMamba becomes worse instead.”
Reply: Thanks. Please refer to Tab.1, Tab.2, and Tab.3 of the manuscript, where our method consistently performs better on all conducted datasets compared with the single-modal SOTA Transformer-based methods. Note that some methods (e.g., ACT) adopt multi-modal information (e.g., text description or 2D), which is unfair for comparison, but we still surpass it in most datasets.
- To weakness 4: “…have the most obvious impact on the experimental results…no additional description or setting in the paper.”
Reply: Thanks. We prove that directly replacing the transformer with mamba in Point-MAE can not achieve ideal performance due to the unidirectional modeling of mamba in Point Cloud. Therefore, we propose serialization operations, including point scanning strategy, order indicator, and serialization-based mask modeling. Therefore, serialization operations naturally reflect the performance gains, which verify the effectiveness of our proposed key designs.
Besides, we would like to clarify that Fig.4 is not the only proposed core. It just means the proposed serialization-based mask modeling paradigm and the comprehensive overall core designs are presented in Fig.2. Compared with the Transformer-based method (e.g., Point-MAE, PointGPT), we present substantive differences: point scanning strategy, order indicator, serialization-based mask modeling (please see the reply to weakness 1&2). Furthermore, the Hilber is a representative space-filling curve, which we adopt into our proposed key designs. The preliminary of Hilbert is listed in Sec. 3 of the manuscript, and we provide further descriptions in reply to the minor problem of Reviewer 2 (Please refer to it).
- To question 1: “…only works well in Hiber conditions?…”
Reply: Thanks. We have conducted experiments to analyze the effect of different scanning curves in our key designs, as shown in Tab.6 of the original manuscript. Specifically, we argue that using the space scanning curve sequences along a specific pattern of spatial locations offers a more logical sequence modeling order for SSM. The experiments also prove that it can achieve notable performance gains compared with the random paradigm. Among them, Hilbert achieves the best results due to their superior locality-preserving properties.
- To question 2: “…obvious innovative contribution…specialized Mamba technique for 3D point clouds?”
Reply: Good question! This paper is not just a replacement of the Mamba block (Equation 4) for the previous Transformer block, please see the reply to weakness 1&2.
Thanks to the authors for the reply. My concerns have been partially addressed, mainly because I am not completely sold on PointMamba's results.
As a result, after the authors' rebuttals, I still believe it has merit, so I'm willing to change my rating to positive.
We appreciate your thought-provoking reviews and are pleased to see your positive decision. To substantiate our results, we will release the code. Thank you once again for your positive rating.
This paper introduces PointMamba, an interesting method for point cloud analysis that utilizes a linear complexity state space model (SSM) instead of traditional Transformer architectures. PointMamba employs space-filling curves for point tokenization and features a simple, non-hierarchical Mamba encoder. Additionally, the authors propose an effective order indicator and serialization-based mask modeling strategy. Comprehensive evaluations across multiple datasets demonstrate PointMamba's superior performance and significantly reduced computational costs compared to Transformer-based methods.
优点
-
The paper reads very well. I appreciate the motivation of the paper and agree that we should make an effective method as simple as possible. As the first Mamba-based work for point cloud tasks, the proposed method is simple, elegant, and effective, establishing a solid Mamba-based baseline for point cloud analysis tasks.
-
The paper considers the limitations of Mamba's unidirectional modeling and proposes practical solutions like serialization-based mask modeling strategies and order indicators.
-
The theoretical analysis of Mamba used in point cloud is reasonable, and the paper is easy to reproduce.
-
The experiments are convincing and support the main idea of the paper. The authors provide thorough evaluations, including comparisons with SOTA methods, ablation studies, and analyses of each component.
缺点
-
Some experiments are missing. For example, from Table 11, it would be beneficial to include more masking ratios, especially the performance when masking 90% point patches. Besides, from Table 12, could the authors provide a more detailed analysis of why average pooling performs better? Additionally, what about the performance of max pooling?
-
Have the authors considered the potential benefits of integrating PointMamba with Transformer architectures? For example, a transformer layer can be used as the final prediction head. Such a combination can take advantage of both architectures and will not introduce many computational costs.
-
The point clouds have complex structures, and this paper only considers two orders. In my view, introducing more orders (e.g., introducing three serialization methods and tripling the input length) can better capture the geometry information of the point clouds, which might be beneficial for learning.
-
During the pre-training strategy, do the authors use different order indicators or the same indicator? Figure 4 is somewhat ambiguous.
Minor: Using space-filling curves to scan point clouds is an interesting attempt. Although it is common knowledge for some readers, it still would be helpful if the authors provided a more detailed introduction to space-filling curves in the preliminaries part.
typos: PointMAE -> Point-MAE
问题
See weakness.
局限性
The authors have adequately discussed the limitations of their work.
- To Weakness 1: “Some experiments are missing…masking 90% point patches…the performance of max pooling?”
Reply: Good question! We conduct the mentioned missing experiments:
(1) As shown in the table below, masking 90% point patches may harm performance, as a higher masking ratio makes reconstruction tasks too difficult during pre-training.
| Masking ratio | Loss | OBJ-BG | OBJ-ONLY |
|---|---|---|---|
| 0.9 | 2.00 | 92.43 | 91.05 |
(2) Average pooling takes into account the entire sequence with all the information, while max pooling selects the maximum value of each token in each dimension through a nonlinear process. We empirically find that only use max-pooling may destroy the global information of the extraction and cause performance drop, as shown in the table below.
| OBJ-BG | OBJ-ONLY | |
|---|---|---|
| Average pooling | 94.32 | 92.60 |
| Max pooling | 93.39 | 91.36 |
- To Weakness 2: “… benefits of integrating PointMamba with Transformer architectures … a transformer layer can be used as the final prediction head …”
Reply: Good suggestion! Based on your advice, we kept the total number of blocks the same and replaced the last layer with a Transformer. As shown in the table below, we observed a confirmed performance drop. We argue that optimizing two modules simultaneously can be challenging, and developing an effective hybrid architecture is a promising topic for future research.
| # TP | FLOPs | OBJ-BG | OBJ-ONLY | |
|---|---|---|---|---|
| 12 Mamba-block | 12.3 | 3.1 | 94.32 | 92.60 |
| 11 Mamba-block + 1 Transformer-block | 13.1 | 3.4 | 91.98 | 89.64 |
- To Weakness 3: “… introducing more orders …”
Reply: Thanks! Based on your suggestion, we conducted an experiment by adding an additional z-order sequence to triple the input length. As shown in Fig. 3 and Appendix A of our paper, using two orders enables global modeling for PointMamba; however, adding redundant information may lead to performance degradation.
| FLOPs | OBJ-BG | OBJ-ONLY | |
|---|---|---|---|
| Hilbert + Trans-Hilbert | 3.1 | 94.32 | 92.60 |
| Hilbert + Trans-Hilbert + Z | 3.6 | 93.43 | 91.36 |
- To Weakness 4: “…Figure 4 is somewhat ambiguous.”
Reply: We apologize for the ambiguity. During pre-training, different serialized point tokens have different order indicators. We will clarify Fig.4 in the next version.
- To Minor & Typos:
Reply: Thanks for the comments! Here is a brief introduction to Hilbert space-filling curves. Denoting the coordinates of voxelized points as and convert them into binary format using bits ( e.g., ). is then iterated to , making exchanges when the current bit is 0, and inversions otherwise. By concatenate the bits into and apply a 3-fold Gray decoding, the traversal position can be obtained. All voxels are then sorted into a single sequence based on their traversal position . By recording the traversal position for all potential voxel coordinates, the points can be serialized. We will add this introduction and carefully revise our paper.
Thank you for the rebuttal! Most of my concerns have been addressed. I have one more question regarding the Hilbert space-filling curve: How is the starting point of the curve determined?
We appreciate the reviewer acknowledging that our rebuttal addressed the concerns. In response to the new question, specifically, we voxelize the key points of the point cloud and shift the minimum coordinate to the origin, with the voxel at (0,0,0) serving as the starting point of the Hilbert curve. We will add this explanation in the revised version.
Thank you for the further response. Now, all of my concerns have been addressed well. After considering other reviewers' comments and the authors' rebuttal, I am now more convinced about the value of the paper. I believe this work has great potential to contribute to the NeurIPS community. Therefore, I tend to accept this paper and encourage the authors to incorporate the above discussions.
We sincerely thank the reviewer for the constructive feedback and support. Your comments are valuable for us to improve the quality of this work. We will incorporate your suggestions and clarify key points in the revision. Many thanks again!
This work utilizes the Mamba architecture for point clouds. It employs Hilbert and Trans-Hilbert curves to order the point clouds, thus addressing the unidirectional modeling nature of Mamba. Additionally, it replaces transformer blocks with Mamba blocks. The proposed PointMamba demonstrates reasonable performance following pre-training.
优点
- This work tries introducing a new network architecture to the point cloud domain, which is appreciated.
- The presentation is clear, accompanied by well-crafted figures.
缺点
-
While it is worthwhile to explore how the Mamba architecture performs when applied to point clouds, this work exhibits limited novelty or insights in architectural innovation for point clouds. The authors themselves acknowledge this by stating, “It should be noted that this paper does not claim algorithmic novelty but rather presents a simple and solid Mamba-based baseline.” In my opinion, works that directly adopt an existing network from other domains to point clouds without sufficient insights and innovation should not be accepted by top-tier conferences like NeurIPS.
-
The authors only demonstrate the performance of pre-trained PointMamba by comparing it with other widely used baselines like PointNext. This implies that PointMamba cannot surpass previous methods when trained from scratch. To effectively showcase the solid merits of PointMamba, it would be more reasonable to provide comparisons without pre-training.
-
In Table 4, an important baseline, PointNext, is omitted. According to the PointNext paper, PointNext exhibits significantly better performance than the proposed PointMamba (87% without pre-training vs. 84.4% with pre-training). This omission raises doubts about the effectiveness of PointMamba. To give a "solid Mamba-based baseline", comprehensive comparisons with previous methods should not be omitted.
-
In Figure 1, it would be beneficial to include comparisons with widely used methods like PointNext to convincingly demonstrate the necessity of introducing Mamba for point clouds. While PointMamba likely offers inference advantages, this contribution stems from the original Mamba paper, not this work. Additionally, does PointMamba have lower training efficiency?
-
The paper claims that the proposed Hilbert and Trans-Hilbert ordering is advantageous based on results from ScanObjectNN. However, I am not entirely convinced. I suggest the authors also present results on ShapeNet-Part and ModelNet40, for which only comparing against a random baseline is enough.
问题
See the weakness section.
局限性
See the weakness section.
- To Weakness 1:
Reply: Thanks. We would like to emphasize that simply replacing the Transformer with the Mamba cannot achieve ideal performance due to the unidirectional modeling (as shown in below table). Thus, we customize the key designs to handle the point cloud data: 1) The proposed point scanning strategy transforms the unstructured 3D point clouds into a regular sequence, which can provide diverse perspectives on spatial locality via scans from different curves; 2) The order indicator maintains the distinct spatial characteristics of different scanning when training, preserving the integrity of the spatial information, which is crucial for the unstructured point cloud; 3) Serialization-based mask modeling randomly choose a space scanning curves for masking, allowing the model exact the general local relationships from different scanning perspectives, better matching the requirement of unidirectional modeling of Mamba.
Considering that this paper is the first work that discusses the potential of Mamba in point cloud tasks, it would bring new insights to the community and be interesting for many people. A similar toy example is the paper you mentioned, PointNeXt, an empirical study, that mainly studies the data augmentation in point cloud without specific new designs (90% of performance gains from data augmentation and hyper-parameter adjustment, while the remaining 10% comes from micro-designs such as residual connections), but many people still like such a simple yet effective paper. We hope the reviewer could reconsider the motivation and insights of our method (also see reply to weaknesses 2 & 3).
| Method | OBJ-BG | OBJ-ONLY | PB-T50-RS | ModelNet40 |
|---|---|---|---|---|
| Point-MAE | 92.77 | 91.22 | 89.04 | 93.2 |
| Simply replace Transformer with Mamba in Point-MAE | 92.25 | 90.69 | 87.11 | 92.4 |
| PointMamba (ours) | 94.32 | 92.60 | 89.31 | 93.6 |
- To Weaknesses 2 & 3:
Reply: Thanks! It seems that the reviewer might be confusing different performance metrics on ShapeNetPart (Inst. mIoU, 86.2%, and 87.0% for PointMamba and PointNeXt, respectively, are correct). Note that, the performance of PointNeXt is somewhat misleading and shouldn’t be directly compared due to the following reasons: 1) PointNeXt adopts extensive data augmentation, while our method and other Transformer-based methods only adopt random rotation for ScanObjectNN and random scaling for ShapeNetPart; 2) PointNeXt adopts the voting strategy and manual refinement on this dataset (i.e., ShapeNetPart), which is an unfair post-processing that is usually resisted by the Transformer-based methods.
As such, no vanilla Transformer-based method can outperform PointNeXt when these unfair practices are applied. Therefore, we believe the reviewer should not deny a series of good Transformer-based works for these reasons. For fair comparisons, we remove all tricks and pre-training, as shown below, where we perform better than PointNeXt on most datasets.
Overall, PointNeXt is the MLP paradigm instead of Transformer-based, and it mainly discusses the effect of various data augmentation in vanilla PointNet. In this paper, we aim to unlock the potential of Mamba in point cloud tasks, discussing whether it can be a viable alternative to Transformers. Thus, the direct comparison of our PointMamba is the vanilla Transformer-based methods. Through extensive experiments, we clearly outperform the SOTA Transformer-based counterparts and even include the cross-modal method (ACT, ICLR 23).
| Method | Data augmentation | Voting | ShapeNetPart (scratch/pre-training) |
|---|---|---|---|
| PointNeXt (NeurIPS 22) | Random rotation, Random scaling, Random translation, Random jittering, Normal Drop, Height appending | Yes | 87.0/- |
| PointNeXt (NeurIPS 22) | Random scaling | No | 85.7/- |
| ACT (ICLR 23) | Random scaling | No | 85.8/86.1 |
| PointGPT (NeurIPS 23) | Random scaling | No | 85.6/86.2 |
| PointMamba | Random scaling | No | 85.8/86.2 |
| Method | Data augmentation | OBJ-BG (scratch/pre-training) | OBJ-ONLY (scratch/pre-training) | PB-T50-RS (scratch/pre-training) |
|---|---|---|---|---|
| PointNeXt (NeurIPS 22) | Random rotation, Random scaling, Random translation | 90.88/- | 90.36/- | 87.7/- |
| PointNeXt (NeurIPS 22) | Random rotation | 90.71/- | 89.50/- | 87.20/- |
| ACT (ICLR 23) | Random rotation | -/93.29 | -/91.91 | -/88.21 |
| PointGPT (NeurIPS 23) | Random rotation | -/93.39 | -/92.43 | -/89.17 |
| PointMamba | Random rotation | 91.74/94.32 | 90.19/92.60 | 87.27/89.31 |
- To weakness 4:
Reply: Thanks! 1) We will include the suggested methods in Figure 1 in the revised version. In addition, the clarification of PointNeXt is presented in reply to weaknesses 2&3; 2) Although Mamba performs inference advantages, it can not work well in the point cloud by simply replacing Transformer with Mamba. PointMamba not only keeps the efficiency of Mamba but also makes it achieve promising performance in the point cloud. 3) Besides, as shown below table, PointMamba presents superior training efficiency (second/epoch) compared with the convincing Transformer-based method, especially when the length of point tokens increases.
| Sequence length | 128 | 256 | 512 | 1024 | 2048 |
|---|---|---|---|---|---|
| Point-MAE | 97.24 | 128.99 | 194.56 | 351.64 | 797.78 |
| PointMamba | 89.18 | 112.57 | 150.55 | 225.56 | 374.89 |
- To weakness 5:
Reply: Good suggestion! We now provide the analysis of our point-scanning strategy on ShapeNet-Part and ModelNet40. These results clearly prove the effectiveness of our proposed key designs when adopting Mamba to point clouds. We will release the code to ensure the results are convinced.
| ModelNet40 | ShapeNetPart | |
|---|---|---|
| Random | 92.7 | 85.7 |
| Hilbert + Trans-Hilbert | 93.6 | 86.2 |
Hi authors,
Thanks for your time.
For the performance of "Simply replace Transformer with Mamba in Point-MAE", did you pre-train this model or train it from scratch?
Dear reviewer,
Thanks for your kind reply! To ensure a fair comparison, the setting of "Simply replace Transformer with Mamba in Point-MAE” refers to swap the Transformer with Mamba in the architecture of Point-MAE, followed by applying the same pre-training strategy as used in the default Point-MAE. Please note that the data augmentations used are also kept the same. To further compare the results under the from-scratch, we are currently conducting experiments and will update you within the next few hours. We appreciate your patience and will provide the results as soon as possible.
Thank you once again for your kind reply!
Dear Reviewer Daqu,
Thank you for your patience. We have now provided a comprehensive comparison, including Point-MAE, simply replacing Transformer with Mamba in Point-MAE, and our proposed PointMamba. As shown in the Table, the results clearly demonstrate that simply replacing the Transformer with Mamba does not perform well, whether trained from scratch or with pre-training. In contrast, our proposed PointMamba achieves promising results, benefiting from the key designs we proposed. We will be happy to include the discussions of pre-trained and scratch in the revised version, and we hope this will address the raised concerns.
| Method | OBJ-BG (scratch/pre-training) | OBJ-ONLY (scratch/pre-training) | PB-T50-RS (scratch/pre-training) | ModelNet40 (scratch/pre-training) |
|---|---|---|---|---|
| Point-MAE | 91.05/92.77 | 90.02/91.22 | 86.05/89.04 | 92.3/93.2 |
| Simply replace Transformer with Mamba in Point-MAE | 90.36/92.25 | 89.50/90.69 | 85.58/87.11 | 91.8/92.4 |
| PointMamba (ours) | 91.74/94.32 | 90.19/92.60 | 87.27/89.31 | 92.4/93.6 |
Given that the discussion phase is quickly passing, we look forward to your reply and thank you once again for your time.
Best regard,
Paper 940 Authors
Dear Reviewer Daqu,
We sincerely appreciate your time and effort in reviewing our paper. We hope our explanations have addressed your concerns. As we are in the discussion phase, we welcome any additional comments or questions regarding our response or the main paper. If further clarification is needed, please do not hesitate to mention it, and we will promptly address your inquiries. We look forward to receiving your feedback.
Best regard,
Paper 940 Authors
Dear Reviewer Daqu,
Thank you for your time and valuable feedback. As the discussion phase is nearing its end, we remain open to addressing any remaining questions or concerns. We would greatly appreciate it if you could consider improving the evaluation after reviewing our responses. Thank you very much for your consideration.
Sincerely, Paper 940 Authors
Dear Authors,
Thank you very much for providing additional experimental results. I have no further questions at the moment and would like to decide on my final score after discussing it with AC and the other reviewers.
Have a good day!
Best, Reviewer
Dear reviewer Daqu,
We sincerely thank you for the time and feedback. We hope our existing rebuttal has addressed your previous concerns well. If you have any further questions during the next discussion period, please let us know, and we would be happy to answer them. Thank you once again!
Sincerely,
Paper 940 Authors
Dear Reviewers,
We are grateful to the reviewers for their invaluable feedback and the time they dedicated to evaluating our work. We are excited to see that reviewers identified the novelty of our technical contribution (R2), clear motivation (R2, R3), convincing experiments (R2), superior performance (R2, R3, R4), and well-written (R1, R2, R3).
We respond to each reviewer separately with a detailed analysis to answer all the questions. We think that our following rebuttal has sufficiently addressed the concerns raised by the reviewers. Please reply if you have any further questions. Thank you again for your insightful feedback, and we look forward to continuing the discussion.
Best regard,
Paper 940 Authors
This paper received mixed ratings of Reject, Strong Accept, and two Borderline Accepts. Three Reviewers see the merits of this work and share positive opinions. Reviewer Daqu (who suggested Reject) has concerns in the insights of this work in architectural innovation for point clouds, and comparison to baseline PointNext. After reading the authors rebuttal and the reviewers' comments, the AC agrees with the other 3 reviewers that introducing Mamba to point clouds in non-trival and useful to the community. Comparative results to PointNext have been provided in the rebuttal. The AC suggests Accept and encourages the authors to includes these additional importants results provided in Rebuttal into the camera ready version.