Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba
摘要
评审与讨论
The paper introduces Hamba, a novel technique for reconstructing 3D hand models from a single RGB image. This technique addresses the limitations of previous transformer-based methods, which struggle with occlusion, truncation, and capturing the intricate spatial relationships between hand joints. Hamba combines graph learning with Mamba state space modeling to create a Graph-guided State Space (GSS) block. This block effectively learns the structured relationships among hand joints and leverages both local and global features through a fusion module. The proposed framework utilizes significantly fewer tokens than traditional methods, improving both efficiency and precision.
Hamba has demonstrated its effectiveness through extensive benchmarking, outperforming state-of-the-art methods on metrics such as PA-MPVPE and F@15mm on the FreiHAND benchmark. Additionally, the approach promises scalability and adaptability, as the GSS block can be integrated into other tasks, indicating potential applications beyond hand modeling.
优点
-
The proposed network structure effectively improves prediction accuracy, as evidenced by results on the FreiHAND and HO3D leaderboards.
-
The experiments related to accuracy in hand pose estimation are thorough and detailed.
-
The method is easy to follow.
-
The implementation code is provided in the supplementary materials.
缺点
-
The authors emphasize the efficiency of their method, but the paper includes only accuracy-related experiments. There are no ablation studies to demonstrate the claimed efficiency.
-
The authors assert that their method can serve as a “plug-and-play module for other tasks.” However, the paper only includes experiments related to hand pose estimation. The authors should at least attempt to apply their module to full-body pose estimation.
-
The paper contains several typographical errors, such as the citation error on Line 92.
问题
My questions based on the weaknesses are:
-
Can the authors provide evidence of their method’s efficiency in terms of inference time and GPU memory usage?
-
Can the authors attempt to transfer their module to full-body pose estimation or other scenarios to validate its plug-and-play capability?
局限性
The authors discuss limitations and neglect the social impacts in the paper.
Q1. Efficiency of the model
R: We did not claim the "efficiency" of the model in terms of Inference time or GPU memory in our manuscript. From the line "GSS block uses 88.5% less tokens", we implied that compared to transformer-based models that utilize a large number of tokens for 3D hand reconstruction, the proposed GSS block uses fewer tokens and is token-efficient. As requested by the reviewer, we provide the ablation for the method's efficiency in terms of inference time, FLOPs, and GPU memory usage, which shows our model is also more lightweight compared with Transformer-based models.
Table: Comparison of the Model efficiency
| Model | Tokens↓ | Param (Backbone) | Param (JR) | Param (Decoder)↓ | Param (All)↓ | FLOPs (Decoder)↓ | Runtime (Backbone) | Runtime (JR) | Runtime (Decoder)↓ | GPU Memory↓ |
|---|---|---|---|---|---|---|---|---|---|---|
| GCN + Transformer | 192 | 630 M | 27.6 M | 149 M | 782 M | 830 MFLOPS | 18.7 ms | 9 ms | 21.9 ms | 20947 MB |
| GCN + SS2D (OUR) | 22 | 630 M | 27.6 M | 71.8 M | 733 M | 649 MFLOPS | 18.7 ms | 9 ms | 11.8 ms | 3413.2 MB |
| 88.5%↓ | - | - | 51.8%↓ | 6%↓ | 21.8%↓ | - | - | 46.1%↓ | 83.7%↓ |
Q2. Transfer to full body human reconstruction
R: We adapted our proposed model to the body mesh recovery task. Our model achieved comparable performance with 4D-humans (called HMR2.0b) (ICCV 2023). We trained our model on the same mixing datasets as 4D-humans. Furthermore, we compare the two models as shown in the table below. We only trained on a single A100 GPU for 300K steps due to the rebuttal time constraint. The metrics and results on various datasets are shown in the Table below. Hamba showed improvements on LSP-Extended and COCO datasets and achieved comparable results on the 3DPW dataset, even though trained for fewer steps. The performance of our model may be further improved by training more iterations as HMR2.0b did. This confirms our proposed module is capable of serving as a plug-and-play component to solve similar or downstream tasks. We have also included the visual results for in-the-wild scenarios in Figure 1 of the rebuttal PDF.
Table: Transfer Results of full-body mesh recovery task compared to HMR2.0b (ICCV 2023). It confirms that the GSS Block acts as a plug-and-play for 3D body reconstruction.
| Model | Training Details | | LSP-Ext | | COCO | | 3DPW | |||
|---|---|---|---|---|---|---|---|
| | @0.05 ↑ | @0.1 ↑ | | @0.05 ↑ | @0.1 ↑ | | MPJPE ↓ | PA-MPJPE ↓ | ||
| HMR2.0b [18] | 8 x A100s, 1 Million Steps | | 0.530 | 0.820 | | 0.860 | 0.960 | | 81.3 | 54.3 |
| Hamba (OUR) | 1 x A100, 300K Steps | | 0.539 | 0.832 | | 0.856 | 0.966 | | 81.7 | 54.7 |
Q3. Typographical error
R: We thank the reviewer for pointing out the typographical error in the citation. For the final version, we will again proofread the manuscript for typographical and grammatical errors.
Q4. Social Impact of the Paper
R: We have already discussed the Broader Impacts in line 334. We will provide more discussions about Social Impacts in the revision. The proposed Hamba framework for 3D hand reconstruction from a single RGB image can significantly enhance human-computer interaction, medical diagnostics, and rehabilitation by providing more accurate 3D hand estimation. It holds promise for improving sign language recognition and robotic dexterity. The technology can also help for economic growth in various tech industries. However, it raises potential privacy concerns that need to be addressed to ensure ethical use.
I appreciate the authors’ professional and comprehensive discussion in the rebuttal. Compared to previous methods, the authors’ network framework significantly improves efficiency. Additionally, the results in human body reconstruction demonstrate the extensibility of their approach. Therefore, I am inclined to maintain my original score.
Thank you very much for your positive feedback and for acknowledging the efficiency and extensibility of our approach. We will include your suggestions and polish in the revision. We would like to highlight that:
-
The proposed Hamba is the first to apply the Mamba Framework to 3D reconstruction. Specifically, our core idea is to reformulate Mamba's scanning into a graph-guided bidirectional method for 3D hand reconstruction.
-
We also designed a simple yet effective Graph-guided State Space (GSS) block, which bridges graph learning and state space modeling, offering significant value to the community. To demonstrate the plug-and-play versatility of our GSS block, we provided ablation studies for the 3D body. It holds great potential for advancing 3D human reconstruction.
-
The proposed Hamba outperforms current SOTAs across all datasets (FreiHAND, HO3Dv2, HO3Dv3, HInt-VISOR, NewDays, Ego4D), notably achieving a PA-MPVPE of 5.3mm and F@15mm of 0.992 on FreiHAND, and ranks 1st in two 3D hand reconstruction leaderboards.
This paper proposes Hamba, a Mamba-based framework for single-view 3D hand reconstruction. Its main contribution is to introduce a graph-guided bidirectional scanning mechanism to fully exploit the joint relations and spatial sequences for accurate hand reconstruction. It additionally fuses global spatial tokens with local graph-based features to further improve the performance. The proposed state space block uses 88.5% less tokens than the attention-based methods, while achieving new state-of-the-art hand reconstruction accuracy on various benchmarks for single-image 3D hand reconstruction.
优点
(1) Technical novelty. This paper proposes a quite novel image feature extractor that is particularly effective for hand reconstruction; it modifies the Mamba framework to further exploit the graph-based joint relations. To the best of my knowledge, this is the first work to demonstrate that visual Mamba can be effective for image-based articulated shape reconstruction task. I wonder if this method works well for, e.g., human body reconstruction as well.
(2) Strong experimental results. The proposed method achieves strong experimental results across various widely-used hand reconstruction benchmarks. The comparisons are also done against the competitive baselines that are very recently proposed (e.g., HaMeR [57]).
(3) Good presentation. Overall, the paper is well organized and easy to read. The figures are also clearly presented.
Additionally, I appreciate the authors for also submitting the code for the reproducibility of the proposed method.
缺点
(1) Ablation study with Transformer + GCN. I’ve found that the motivation for using Mamba and GCN - to capture both long-range dependencies and graph-based local dependencies - is very similar to the motivation for Graformer [R1, R2], where Transformer (instead of Mamba) and GCN are used together. Additional ablation study with Graformer-like architecture (e.g., replacing the state space block with self-attention block in the proposed architecture) would be informative.
[R1] Gong et al., DiffPose: Toward More Reliable 3D Pose Estimation, In CVPR, 2023.
[R2] Zhao et al., Graformer: Graph-oriented transformer for 3d pose estimation, In CVPR, 2022.
问题
(1) Justification for the intermediate 3D reconstruction. To obtain the 2D joint positions used for image feature extraction, the model intermediately performs the full 3D reconstruction via regressing MANO parameters (Equation 4). I am not sure why this is less overhead compared to employing off-the-shelf 2D joint detectors (lines 151-152).
局限性
Authors have discussed the limitations.
Q1. Ablation study with Transformer + GCN.
R: As requested by the reviewer, we replaced the state space block with GCN + Attention (from Graformer [R2]) in our Hamba model and evaluated it on the Freihand benchmark dataset. Further we compared it with our GCN + SS2D (see Table below). Both models are trained on the same dataset setting with a single A6000 GPU for 60K steps. Our GCN + SS2D model shows improvement on all metrics compared to the “Graformer-like Transformer + GCN” architecture. This confirms that our state-space model has better capability to learn the relationship between hand joints.
Table: Additional Ablation w GCN+Attention. Illustrates that GCN + state space modeling outperforms GCN + Attention.
| Method | PA-MPJPE ↓ | PA-MPVPE ↓ | F@5mm ↑ | F@15mm ↑ |
|---|---|---|---|---|
| w GCN + Attention | 7.0 | 6.6 | 0.730 | 0.985 |
| w GCN + SS2D (OUR) | 6.6 | 6.3 | 0.738 | 0.988 |
Q2. Transfer to full body mesh recovery.
R: We adapted our proposed model to the body mesh recovery task. Our model achieved comparable performance with 4D-humans (as called HMR2.0b) (ICCV 2023). We trained our model on the same mixing datasets as 4D-humans. Furthermore, we compare the two models as shown in the table below. We only trained on a single A100 GPU for 300K steps due to the rebuttal time constraint. The metrics and results on various datasets are shown in the Table below. Hamba showed improvements on LSP-Extended and COCO datasets and achieved comparable results on 3DPW dataset, even though trained for fewer steps. The performance of our model may be further improved by training more iterations as HMR2.0b did. This confirms that our proposed model is capable of serving as plug-and-play components to solve similar or downstream tasks. We have also included the visual results for in-the-wild scenarios in Figure 1 of the rebuttal PDF.
Table: Transfer Results of full-body mesh recovery task compared to HMR2.0b (ICCV 2023). It confirms that the GSS Block acts as plug-and-play for 3D body reconstruction.
| Model | Training Details | | LSP-Ext | | COCO | | 3DPW | |||
|---|---|---|---|---|---|---|---|
| | @0.05 ↑ | @0.1 ↑ | | @0.05 ↑ | @0.1 ↑ | | MPJPE ↓ | PA-MPJPE ↓ | ||
| HMR2.0b [18] | 8 x A100s, 1 Million Steps | | 0.530 | 0.820 | | 0.860 | 0.960 | | 81.3 | 54.3 |
| Hamba (OUR) | 1 x A100, 300K Steps | | 0.539 | 0.832 | | 0.856 | 0.966 | | 81.7 | 54.7 |
Q3. Justification for the intermediate 3D reconstruction
R: In Hamba, the Joints Regressor (JR) performs the intermediate 3D reconstruction, and the 2D joints (from this 3D re-projection) serves as the input to the token sampler (TS). This particularly helps in effective token selection that encodes the strong local context. The JR is necessary else the GSS block might learn from irrelevant tokens, especially during the early training stage and may be influenced by the background and continue to make random guesses.
-
For the joint regressor, we use stacked SS2D layers with one layer MLP head. Compared to this simple architecture, employing heavy off-the-shelf joint detectors like MediaPipe [49] or OpenPose [4] would increase the model complexity.
-
Instead of regressing 2D hand joints, we do an intermediate 3D reconstruction to get the initial MANO parameters. This serves as an effective initialization for the Hamba model. Using an off-the-shelf 2D Joints Estimator could not provide this MANO parameters initialization, and we aren’t able to obtain a good hand reconstruction.
-
Popular existing off-the-shelf hand 2D detectors are not trainable (e.g. Mediapipe), and could not take advantage of the features extracted by the backbone. Our JR design is simple yet effective, thus providing robust results.
Thank you for the authors' efforts to address my concerns, especially for showing the ablation study results with Transformer + GCN. Most of my questions have been addressed.
Thank you for your positive feedback and for acknowledging our efforts in addressing your concerns. We will incorporate your suggestions into the revision.
This paper presents an approach for 3D hand reconstruction from a single view. The main idea is to introduce a graph-guided Mamba framework in the model for hand reconstruction, by bridging graph learning and state space modeling. Building on top of the recent Hamer approach, the final proposed model, Hamba is evaluated on various datasets where it demonstrates strong performance.
优点
- The quantitative results show improvements over the baselines.
缺点
-
I am not sure I follow the motivation of the paper. We read in the paper that "existing [..] methods fail to capture the semantic relations between different joints" (Ln 4) which is mentioned again later as "lack of understanding the semantic relations between hand joints" (Ln 38) or "applying attention to all tokens does not fully utilize the joint spatial sequences, resulting in an inaccurate 3D hand mesh in real-world scenarios" (Ln 41). I am not sure how the proposed approach improves over these observed weaknesses. I can see some quantitative improvements, but I do not think that we get enough support for these arguments, i.e., how Hamba improves on these issues, more specifically over the baseline approach (Hamer [57]).
-
The improvements over the baseline are consistent, but relatively minor, particularly on the HInt dataset.
-
Based on the training details of the supplementary, it looks like the method is starting from the Hamer checkpoint and finetunes it for 170k iterations. What is the performance of the Hamer model if it also finetuned for 170k iterations? It would be interesting and fair to see this comparison as well.
-
There are very few qualitative comparisons with the baseline Hamer model.
问题
-
Why there are no results on the Ego4D subset of HInt?
-
The paper mentions that the GSS block uses 88.5% less tokens. How does that affect the final system in terms of runtime, number of parameters and FLOPs, etc.
-
Could you clarify their motivation that I discuss in the weaknesses?
-
Is it possible to see the additional results of the Hamer model when finetuned the same way that Hamba is finetuned?
局限性
I think the discussion on limitations could be longer. The paper shows a few failure cases, but the limitation section reads like an afterthought. It could be extended with more of the observed failure cases, as well as other limitation/weaknesses (potentially runtime? number of parameters? reliance on a hand detector? etc).
Q1.Clarification of motivation?
R: Our main motivation is to improve SOTA methods (e.g., HaMeR [57]) by modeling the structural relation in the hand skeleton that leads to the model's performance improvement. HaMeR [57] designed a ViT-based model, using ViTPose weights and large datasets to achieve good performance. However, HaMeR requires a large number of tokens per image for reconstruction, and applying attention to all image tokens does not fully utilize the joint spatial sequences, which often results in an inaccurate 3D hand mesh in real-world scenarios.
This raises concerns about whether such model structures can effectively capture the relationships between joints. To address this limitation of the previous model, we utilize the emerging Mamba model, known for its sequence processing capabilities. However, since the Mamba model is primarily designed for vector-sequence inputs, such as those used in NLP or time-series tasks, we adapted it by integrating GCN layers. This adjustment aims to enhance the model's ability to learn the fixed skeletal structure of hands. Our key idea is to reformulate the Mamba scanning into graph-guided bidirectional scanning for 3D reconstruction using a few effective tokens.
All of them are verified in our ablation study, as shown in Table 5 of the manuscript. When we remove the Mamba blocks or GCN layers, performance drops significantly.
Q2. Results on the Ego4D?
R: We were unable to download the EGO-4D dataset before the NeurIPS deadline because it required signed consent and agreement from the official dataset maintainers. Additionally, we encountered technical challenges due to the dataset's large size (1 TB) and the additional 4 TB of storage needed for frame extraction from the Ego4D clips. Now that we have successfully downloaded the dataset, we present our results in Table 3 (included in the submitted PDF). Our Hamba achieves the SOTA performance, surpassing other models on the Ego4D dataset.
Q3. The effectiveness of GSS block with 88.5% less tokens.
R: We provide additional ablation to show the effectiveness of the GSS Block. For this ablation, we replaced the GSS Block with a self-attention transformer taking all 192 image tokens. The comparison is shown in the Table below. The ablation confirms the effectiveness of our proposed GSS Block in reconstructing hands in 3D while utilizing fewer tokens.
Table: Ablation comparing the Tokens, Parameters, FLOPs and the Runtime demonstrating the effectiveness of the SS2D (GSS Block compared to Transformers).
| Model | Tokens↓ | Param (Backbone) | Param (JR) | Param (Decoder)↓ | Param (All)↓ | FLOPs (Decoder)↓ | Runtime (Backbone) | Runtime (JR) | Runtime (Decoder)↓ | GPU Memory↓ |
|---|---|---|---|---|---|---|---|---|---|---|
| GCN + Transformer | 192 | 630 M | 27.6 M | 149 M | 782 M | 830 MFLOPS | 18.7 ms | 9 ms | 21.9 ms | 20947 MB |
| GCN + SS2D (OUR) | 22 | 630 M | 27.6 M | 71.8 M | 733 M | 649 MFLOPS | 18.7 ms | 9 ms | 11.8 ms | 3413.2 MB |
| 88.5%↓ | - | - | 51.8%↓ | 6%↓ | 21.8%↓ | - | - | 46.1%↓ | 83.7%↓ |
Q4. Finetuned for 170k iterations on Hamer.
R: We want to clarify that our Hamba does not use the HaMeR (CVPR 2024) checkpoint as a starting point. our Hamba and HaMeR have significantly different decoders and overall model architectures, making it impossible to directly load or start from HaMeR's checkpoint. What we intended to convey is that Hamba used the same encoder (the ViTPose backbone) as HaMeR, and we only loaded the backbone's weights. We appreciate the reviewer pointing out this confusion, as it could be confusing to other readers. We will revise the sentence in the supplementary material to make this distinction clearer.
As requested, we fine-tuned the official HaMeR checkpoint for 170K more steps and compared it with Hamba (see Tables 1, 2, and 3 for comparison on FreiHAND, HO3Dv2 and HInt datasets in the submitted pdf). This confirms that merely fine-tuning HaMeR for 170K more steps does not improve performance. We will provide more discussion about it in the revision.
Q5. About the relative minor on the HInt dataset.
R: We would like to emphasize that the HInt dataset is used exclusively as an 'in-the-wild' test set, meaning that none of the models have been trained or fine-tuned on the HInt training set. This makes the evaluation particularly challenging. Despite this, our model significantly outperforms other popular models such as MeshGraphormer (ICCV 2021), METRO (CVPR 2021), and HandOccNet (CVPR 2022). Additionally, when compared to HaMeR (CVPR 2024), Hamba demonstrates consistent improvements. Specifically, it achieves a notable 3% to 6% increase in performance over HaMeR on the Ego4D-VIS@0.05, Ego4D-ALL@0.05, and VISOR-VIS@0.05 subsets.
Q6. More visual comparisons
R: Due to the submission length constraints (9 pages for NeurIPS and 1 page for the rebuttal), we limited the number of visual comparisons. We put 4 more comparison images in the rebuttal pdf, and we plan to include additional visual comparisons in the revision's appendix.
Q7. Response to Limitations
R: We agree with the reviewer’s suggestion that like other SOTA models, like HaMeR (CVPR 2024), MeshGraphormer (ICCV 2021) and METRO (CVPR 2021), etc., our model also relies on hand-detector to crop the hand image. We will include more failure cases with detailed discussion in our final manuscript.
I want to thank the authors for the additional analysis and results. These are very helpful to better contextualize their contribution. I want to acknowledge that I read the rebuttal and I add a few comments and a question I have.
-
The additional results on Ego4D are welcome.
-
The answer to Q3 above is very helpful and I think that this table should be included in the final version as well. I think that the other metrics besides number of tokens (e.g., FLOPs, runtime, number of parameters) are more helpful to highlight the relative benefit of the proposed method.
-
The additional evaluation after further finetuning Hamer for 170k iterations is also a welcome addition, although the gap between HaMeR-170K and Hamba is definitely more marginal. Regardless, I think it would be helpful to include these comparisons in the final version too.
-
I originally hoped that some of the qualitative comparisons with HaMeR would have been included in the supplementary video as well any length constraints. That's something that can be added in the final version.
-
Something that confused me in the rebuttal is that you mention that "Hamba does not use the HaMeR (CVPR 2024) checkpoint as a starting point". But if I understand correctly, you are actually using the weights from the HaMeR backbone (ViT-H). Or do you use initialize the backbone with different weights (from ViT MAE? from ViTPose?). I found that statement confusing given what is written in the Appendix.
Thank you for your positive feedback. We’re glad the additional results and analysis have been helpful.
-
We will include the additional results from the rebuttal in the final manuscript, including the number of tokens (e.g., FLOPs, runtime, number of parameters).
-
We appreciate your feedback on the additional evaluation after further fine-tuning HaMeR. We will include these comparisons in the final version to provide a more comprehensive discussion.
-
We understand the importance of more qualitative visual comparisons with HaMeR. We will include these in the supplementary video as suggested.
-
Yes, your understanding is correct. (1) We used the pre-trained HaMeR backbone (ViT-H) weights to initialize our "backbone", which is a common practice to accelerate training, for example, the TRAM [2], TokenHMR[3], and WHAM[4] also used the backbone weight from 4D-Human [1] to accelerate training. However, our network architecture differs significantly from HaMeR, so this is not a typical fine-tuning process. (2) What we mean to say is that we did not load the HaMeR Transformer Decoder weights. We will revise the manuscript to clarify this distinction and include comparisons with ViT MAE and ViTPose, as suggested.
[1] Humans in 4D: Reconstructing and Tracking Humans with Transformers. ICCV23
[2] TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos. ECCV24
[3] TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation. CVPR24
[4] WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion. CVPR24
We want to highlight that:
-
It is first to demonstrate that Visual Mamba can be effective for image-based articulated 3D reconstruction tasks. The proposed Hamba is the first to incorporate graph learning and state space modeling (SSM) for 3D hand reconstruction, which is significant to the community.
-
The proposed GSS block make our model SOTA on all FreiHAND, HO3Dv2, HO3Dv3, HInt-VISOR, NewDays, and Ego4D benchmarks.
Looking forward to a possible score improvement from your end for our paper!
Thank you very much for your detailed comments which helped us improve our manuscript!
Dear Reviewers and ACs,
We appreciate the insightful review and constructive feedback that has helped us enhance our manuscript. Through our comments, we have tried to clarify the confusion and effectively address all questions asked by the reviewers.
-
It is the first work to demonstrate that visual Mamba can be effective for image-based articulated shape reconstruction tasks (acknowledged by shEo). The proposed Hamba is the first to incorporate graph learning and state space modeling (SSM) for 3D hand reconstruction, which is significant to the community.
-
This paper proposes a quite novel ... for hand reconstrution. (acknowledged by shEo).
-
Strong performance with rigorous comparisons to recent models (HaMeR, CVPR 2024) (acknowledged by i6yS, shEo, TVui).
-
Good presentation and easy to follow. (acknowledged by shEo, TVui).
-
To further show that our GSS block can act as a plug-and-play module, we have provided an ablation in rebuttal (requested by TVui, shEo) for full Human Body Reconstruction, which has greatly supported our approach.
-
The proposed GSS block make our model SOTA on all FreiHAND, HO3Dv2, HO3Dv3, HInt-VISOR, NewDays, and Ego4D benchmarks (acknowledged by shEo, TVui).
-
Furthermore, we have included the results on the Ego4D dataset as well (requested by i6yS).
We have already submitted the source codes (acknowledged by TVui) and model hyperparameters as supplementary material. We are eager to engage in further discussions to improve the quality of the paper.
We deeply appreciate it if you could reconsider the score accordingly. We are always willing to address any further concerns.
Best Regards,
the Authors
This paper is a borderline case, with three reviewers giving scores centered around 5 (more precisely: 4, 5, 6). The paper essentially combines the strengths of Mamba and Hamer to make a better single-view 3D hand fitter. While the improvements are moderate, the technical formulation of the work is rather interesting, two of the reviewers appreciated it from the beginning, and the third reviewer seemed to have improved his opinion of the paper after the rebuttal (albeit without this leading to a raise in the score). Overall the Area Chair recommends acceptance.