Block-to-Scene Pre-training for Point Cloud Hybrid-Domain Masked Autoencoders
We propose to pre-train a universal point cloud hybrid-domain model via a block-to-scene pre-training strategy.
摘要
评审与讨论
This paper addresses the limitations of existing masked autoencoder (MAE) methods for point cloud self-supervised learning, which are often domain-specific and hinder generalization. The authors introduce a novel Point cloud Hybrid-Domain Masked AutoEncoder (PointHD-MAE) that employs a block-to-scene pre-training strategy. This hybrid model comprises an encoder for the object domain and a decoder for the scene domain, allowing for specialized handling of both scene and object point clouds. The pre-training process involves selecting point blocks from scenes and transforming their coordinates to the object space, using an object-level mask and reconstruction pipeline to recover masked points and learn universal object representations. Additionally, a scene-level block position regression pipeline helps learn scene representations by utilizing features from the object space. The extensive experiments conducted across various datasets and tasks demonstrate the model's generalization capabilities and overall superiority. The authors plan to release the code for their approach.
优点
- The writing is clear and easy to understand.
- The perspective trying to be addressed is both interesting and meaningful.
缺点
- In the related work section, only object-level self-supervised methods are summarized, and there is a lack of summary for scene-level self-supervised methods.
- In the experimental section, comparisons are made only with object-level self-supervised methods, missing comparisons with scene-level self-supervised methods, such as "Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning" and "GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding." This is important because the paper focuses on pre-training for point cloud hybrid-domain.
问题
- Regarding Figure 1 (c), the description in lines 72 to 74 is unclear. It would be best to clearly explain the settings for each data point.
- In line 208, the normalization process may cause deformations in the local structure of objects, such as height compression or expansion. Will such deformations affect semantic understanding? Since the level of deformation may vary for two chairs in different scenes, could this interfere with semantic comprehension?
- In line 243, what the content embedding and position embedding of the queries are? My understanding is that you add the transformed point block features to a randomly initialized content embedding; so where does the position embedding come into play?
- The paper lacks comparison with the other object-level and scene-level self-supervised methods on inference time and FLOPs.
The paper presents PointHD-MAE, a model designed to bridge the gap between object and scene point cloud domains. The model uses a hybrid-domain structure with distinct encoders for each domain, employing a novel block-to-scene pre-training strategy to enhance generalization. This strategy involves reconstructing random point blocks in object space and regressing their positions in scene space to learn both object and scene-level features.
优点
-
The paper proposes a new framework to combine scene and object-level point cloud self-supervised learning.
-
The writing is well-organized, clear, and easy to follow.
缺点
1, Although the idea of combining scene and object levels for pre-training is well-motivated, it does not introduce any novel elements.
2, The detection results on ScanNetV2 are impressive. However, the results on SUN-RGBD appear incremental, which is puzzling given that SUN-RGBD is generally easier compared to ScanNetV2. The performance on SUN-RGBD is still far from state-of-the-art, suggesting significant room for improvement.
3, I believe the paper leverages the Point-MAE baseline from RECON [1], but this is not mentioned, raising concerns. Consequently, the improvement in the classification task appears incremental.
4, The ablation study is insufficient. The paper does not adequately demonstrate the effectiveness of object-level, scene-level, and combined pre-training strategies.
5, There are no visualization results provided.
[1] Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining
问题
See weakness.
The paper introduces PointHDMAE, a hybrid-domain masked autoencoder designed for robust self-supervised learning on 3D point clouds across both object and scene domains. By implementing a "block-to-scene" pretraining strategy, the model uses randomly selected point blocks within scenes to train separate encoders for object and scene representation, enhancing the model's generalizability without additional domain-specific adaptation. Experimental results reveal the effectiveness of PointHDMAE in various downstream tasks, including object classification, scene detection, part segmentation, and point cloud completion.
优点
- The hybrid-domain architecture effectively addresses the challenge of generalizing across different point cloud types, a limitation in many prior models.
- The block-to-scene pretraining strategy is innovative, offering a structured approach to learning detailed object and scene representations concurrently.
- Extensive experimentation demonstrates the model's robustness and improved accuracy across multiple tasks and datasets, affirming the approach’s versatility.
缺点
- The approach relies on computationally intensive pretraining, especially with multiple encoders and the need for numerous point blocks, which could limit scalability.
- The fixed frequency of selected point blocks might restrict adaptability to varied data densities and contexts within scenes.
问题
- How sensitive is the model’s performance to the number of point blocks used during training, and could an adaptive block selection improve efficiency?
- In real-world applications, how does PointHDMAE handle varying scene complexities and point cloud densities, particularly in less structured environments?
In this paper, a point cloud hybrid-domain masked autoencoder is proposed to improve the generalization ability. Specifically, the authors address the challenge of inconsistent input data by using domain-specific encoders to process data from their respective domains. A block-to-scene pretraining strategy is proposed to regress the block position and reconstruct the block point. Results on several datasets and tasks demonstrate the effectiveness of the proposed method.
优点
- This paper is well motivated.
- The proposed method outperforms existing methods under different tasks and datasets.
缺点
- In the pretraining stage, it's strange that only scene-level point cloud data is used. From the motivation of this paper, a more natural way is to utilize both scene-level and object-level point cloud for pretraining. For example, the block reconstruction can be also applied to object-level point cloud.
- Some details are unclear. In section 4.1, the authors claim that "we can also leverage pre-trained object point cloud models on the ShapeNet55 (Chang et al., 2015) dataset to initialize our object-level models". But it's unclear whether this strategy is used.
- Lack of dicussion with related works. The key component, block-to-scene strategy, is very similar with UP-DETR [1]. It seems that the proposed strategy is a simple extension of UP-DETR from image to point cloud, without carefully considering the problem faced by this task.
- Compared with existing methods. The proposed method is much more complicated, training cost, parameter numbers, et al. should be reported for fair comparison.
[1] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers. TPAMI 2022.
问题
Please address my concerns about the "weakness".
This paper propose PointHDMAE, a hybrid-domain masked autoencoder pre-trained with a block-to-scene strategy. The method uses a hybrid architecture with separate encoders for scene and object data, aiming to address the generalization limitations of existing domain-specific MAE models.
The model shows promising results across various downstream tasks, demonstrating its generalization capability in handling both object and scene point cloud data without additional domain adaptation training.
优点
-
The motivation is clear and valid. Developing a general model for 3D point cloud representation learning is indeed valuable.
-
The experimental results are comprehensive, covering a range of tasks such as point cloud classification, part segmentation, object detection, and completion.
缺点
-
Novelty alone is not enough. While the motivation is reasonable, the proposed "general autoencoder" is essentially a combination of separate scene-level and object-level encoders, which lacks true innovation. If that's the approach, why not simply use combined two-level pre-trained branches with a switch to create a general model?
-
Why are the '#Params (M)' values the same with and without PointHDMAE in Table 9?
-
The result for Point-M2AE is missing in Tables 2 and 3.
-
Separating Multimodal Self-Supervised Learning and Single Modal Self-Supervised Learning seems unnecessary. While some methods benefit from the added knowledge of other modalities, PointHD-MAE could also gain from additional training data. Comparisons should be made based on performance, not restricted by modality or data differences.
-
More qualitative results are needed.
-
Presentation consistency would improve readability, such as highlighting and bolding text in tables.
-
The manuscript does not discuss any limitations.
问题
Please kindly see the weaknesses above.
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.