Geometric Feature Embedding for Effective 3D Few-Shot Class Incremental Learning
摘要
评审与讨论
This paper investigates few-shot class incremental learning for 3D object classification using foundation models. Building on the work of FoundationModel (Ahmadi et al.), the authors employ a frozen, pre-trained large-scale 3D encoder (Uni3D) to extract generalizable features for each point. They then construct enhanced text embeddings based on prompts generated from category names, which are combined with geometric features to calculate similarity and assign the final label. A key contribution of the paper is the method for constructing abstract geometric features using spectral clustering and Laplacian eigenmaps, as well as the way to fuse the text embeddings from the prompts with the geometric features through transformer. Experimental evaluation across multiple datasets and settings demonstrates that the proposed method achieves clear improvements in performance.
给作者的问题
-
What principles guided the design of the five settings presented in Figure 1?
-
How does the observation in lines 87-93 lead to the conclusion that "one challenge to address to advance FSCIL on 3D point clouds" is "enhancing the model’s ability to learn robust feature representations"?
论据与证据
The evidence provided to support the claims in the paper is insufficient in certain areas, as some details are lacking. For instance, the specific configuration used when removing a module in Table 3 is not clearly explained, leaving this aspect of the experiment unclear.
方法与评估标准
Yes, the proposed method is evaluated both within-dataset incremental learning and cross-dataset incremental learning, in line with the protocols established by existing methods.
理论论述
The paper does not present any theoretical claims.
实验设计与分析
Yes, the overall experimental settings are valid, as they follow established protocols from existing methods. However, the ablation study in Table 3 lacks sufficient support, as the detailed configuration for removing a module is not clearly provided.
补充材料
Yes. I have reviewed all parts.
与现有文献的关系
The key idea of this paper is closely related to the concept of cross-modal feature fusion, specifically integrating text and visual cues, which has been explored in prior research.
遗漏的重要参考文献
N.A.
其他优缺点
Strengths:
- The proposed method is sound and well-constructed.
- The paper demonstrates clear improvements across multiple datasets and FSCIL settings.
- The ablation studies are comprehensive.
Weaknesses:
- It would be beneficial to include visualizations of the basis vectors (and their evolution as new classes are introduced) or the distribution of geometric features.
- Implementation details of the transformer encoder are not provided.
- In Table 3, the detailed configuration for removing one module is not clearly explained.
其他意见或建议
N.A.
We sincerely appreciate your insightful feedback, which has guided us in refining the manuscript and addressing key concerns. Below, we provide detailed responses to each of your questions, supported by additional analyses and clarifications.
Q1: Clarification of Ablation Studies in Table 3
A1: Thank you for prompting us to clarify this section. Table 3 systematically evaluates the contributions of two critical components in the geometric feature extraction module:
- Dynamic Geometric Projection Clusters: These clusters are constructed via spectral clustering and Laplacian eigenmaps to encode shared geometric structures.
- Attention Weights for Basis Vectors: Learnable weights prioritize cluster centers relevant to incremental tasks.
The ablation settings are:
-
Row 1: Raw point cloud features without DGPC.
-
Row 2: DGPC with equal weights for all basis vectors.
-
Row 3: DGPC + learnable attention weights.
Key findings are now presented more succinctly: -
DGPC Necessity: Removing DGPC (Row 1) degrades average accuracy by 2.7%, as raw features fail to integrate geometric-textual semantics.
-
Attention Mechanism: Adding learnable weights (Row 3) improves harmonic accuracy by 2.1% over Row 2, demonstrating adaptive weighting’s role in suppressing noise and outliers.
The results validate that DGPC and attention weights synergistically enhance stability and discriminability. These results are now contextualized in Section 4.4, and additional visualization examples are available in Figure 2 of the anonymous link. We previously did not express this clearly, but based on your feedback, we will restructure the table and discussion for better interpretability.
Q2: Visualization of Dynamic Projection Clusters
A2: To validate the effectiveness of dynamic geometric feature projection clusters during incremental learning, we visualize the geometric features extracted by DGPC at each incremental stage, as shown in Figure 2 of the anonymous link. Visualizations reveal that DGPC-enhanced features exhibit tighter intra-class clustering and clearer inter-class separation. These results demonstrate that DGPC effectively encodes task-invariant geometric priors, enabling robust feature extraction across incremental phases. We have integrated key examples into the main text in our revised version.
Q3: Implementation Details of the Transformer Encoder
A3: The Transformer encoder comprises 2 standard layers, each with 8-head self-attention. We have enhanced the implementation details section in the revised version to include more comprehensive information.
Q4: Design Principles of Figure 1
A4: We sincerely appreciate your guidance in improving Figure 1's clarity. In light of your thoughtful suggestions, we have reorganized and optimized Figure 1 (detailed in Figure 1 of the anonymous link) to improve its clarity. It now contains 7 comparisons:
- SOTA Baselines: Methods (1)-(2) adopt strategies from C3PR [1] and FoundationModel [2].
- Cross-Combination Strategies: Methods (3)-(7) integrate various prompt styles with distinct training strategies.
For detailed explanations of Figure 1, we sincerely invite you to consult our response to Reviewer 8rPH's Question 2.
Q5: Linking Observations to Robust Feature Learning
A5: We appreciate your guidance in strengthening this connection. The experiments (lines 87-93) demonstrate that complex prompt designs yield inferior performance compared to simple prompts. This observation highlights a critical limitation: existing methods overly rely on manually crafted text semantics while failing to autonomously extract geometry-aware robust features from 3D point clouds. Consequently, models exhibit excessive sensitivity to textual variations and struggle to adapt to distribution shifts in incremental phases with limited samples.
3D-FLEG addresses this by:
- Geometric Feature Embedding: Explicitly encoding spatial structures into prompts via dynamic projection clusters, bypassing dependency on complex text engineering.
- Unified Optimization: Forcing joint alignment between geometric features and text semantics during incremental training, enabling the model to prioritize discriminative cross-modal patterns from sparse data.
By incorporating supplementary experiments, enhanced visualizations, and expanded explanations, we sincerely hope to have addressed all raised concerns. Please let us know if you feel any additional adjustments would better address your concerns.
References
[1] Canonical shape projection is all you need for 3d few-shot class incremental learning, ECCV 2024.
[2] Foundation Model-Powered 3D Few-Shot Class Incremental Learning via Training-Free Adaptor, ACCV 2024.
The paper proposes 3D-GLEG, a method to improve 3D few-shot class incremental learning by incorporating geometric features into the learning process. The authors propose two modules: a geometric feature extraction module and a geometric feature embedding module. By leveraging geometric information, 3D-FLEG achieves superior performance on four datasets, ModelNet, ShapeNet, ScanObjectNN and CO3D.
给作者的问题
-
For Figure 1, can the authors elaborate more on each prompt-training and training strategies and also add the citations if necessary? Since there are no detailed descriptions for the alignment module, how can the authors conclude that their method, embedding geometric features, is simpler than the alignment module?
-
Can the authors explain how they get the initial features, , as they also mentioned the point cloud features ?
论据与证据
-
The claim that Laplacian Eigenmaps can extract geometric structure from point cloud data is not well supported.
-
The claim that AdaptiveAvgPool1d for fine-grained feature extraction sounds not true for me. Based on my understanding, average pooling is not typically used for fine-grained feature extraction. Instead, it tends to extract global, smoothed features.
-
The claim that the geometric feature embedding module can ensure that data from both modalities interact on similar levels of abstraction is not fully supported according to section 3.4. How do equations 5 and 6 ensure that the point cloud features and text features have similar levels of abstraction? They are only in the same dimensional space. The authors only use a simple text prompt template that includes class names, which I believe contains abstract features. However, point cloud features should contain more detailed features.
方法与评估标准
The proposed method and evaluation criteria are appropriate for the FSCIL problem. The evaluation on ModelNet, ShapeNet, ScanObjectNN and CO3D provides a solid benchmark, covering both real-world and synthetic 3D datasets. The metrics, including accuracy, harmonic mean accuracy, and relative accuracy drop, effectively measure both new class adaptation and forgetting mitigation.
理论论述
The paper does not provide formal theoretical proofs but makes claims about the effectiveness of Laplacian eigenmaps in preserving geometric structures and the dynamic geometric feature projection clusters in improving feature representation. While the method is conceptually plausible, the paper does not rigorously prove that the transformed features are explicitly geometry-aware.
实验设计与分析
The paper evaluates the method on four datasets and uses multiple metrics. The results indicate improved performance over baseline methods. However, there are some potential issues:
- No direct ablation study that analyzes the contribution of Laplacian eigenmaps or whether they truly enhance geometry-awareness and how.
- More detailed discussion of why the method performs better across datasets.
补充材料
I reviewed the appendices, including the dataset partition, explanation of evaluation metrics, additional results and experiments on the number of basis vectors. I have no doubt about these.
与现有文献的关系
For 3D FSCIL problem, previous works like Microshape proposed a universal description language to reduce domain discrepancies, and C3PR adapted CLIP to handle FSCIL task. This paper solves the problem by integrating geometric features, reducing reliance on foundation models and complicated training strategies.
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
- It is confusing that both the ground-truth label and final feature representations computed from U are represented with the notation y.
- Some of the Microshapes is spelled as "Micrpshapes"
Thank you for your insightful feedback, which has helped us significantly improve the clarity and rigor of our manuscript. Below are our detailed responses:
Q1: Claims (Laplacian Eigenmaps, AdaptiveAvgPool1d, geometric feature) are not well supported
A1: (1) Laplacian Eigenmaps for Geometric Structure Extraction
Theoretically, Laplacian Eigenmaps minimizes , where are low-dimensional representations of data points , and reflects their proximity ( if neighbors; otherwise, ). This ensures nearby points in the original space remain close in the reduced space.
Using the graph Laplacian , the problem reduces to minimizing . The eigenvectors of capture the manifold's structure, preserving local geometry effectively. A similar theoretical analysis appears in [1].
Besides, we have additionally conducted an ablation study to validate the role of Laplacian Eigenmaps as given below, demonstrating a 1.8% improvement in accuracy when utilized. We have incorporated the comprehensive theoretical proof and analysis in our revised paper.
| Laplacian eigenmaps | Average | Accuracy | (%) | |
|---|---|---|---|---|
| Session0 | Session1 | Session2 | Session3 | |
| ✗ | 93.8 | 91.2 | 86.4 | 85.0 |
| ✓ | 93.8 | 91.9 | 87.5 | 86.8 |
A1: (2) AdaptiveAvgPool1d for Fine-Grained Features
We apologize for the confusion caused by the description of "fine-grained" features. You are right that traditional average pooling (AvgPool) typically extracts globally smoothed features. AdaptiveAvgPool1d in 3D-FLEG differs by dynamically adjusting pooling windows to capture local geometric statistics rather than fixed-window averaging. Specifically, it quantifies geometric attribute distributions across localized regions along the channel dimension, enabling fine-grained pattern extraction [2], where "fine-grained" refers to the statistical representation of local geometric details. We have revised our manuscript to clarify this distinction and articulate our design rationale.
A1: (3) Modality Abstraction Alignment
We have strengthened Section 4.3 to clarify the modality abstraction alignment:
A dual-pooling strategy (Eq. 5) extracts multi-scale geometric features, bridging the gap between text and point cloud details. The Transformer encoder (Eq. 6) then refines these features, emphasizing geometry relevant to text prompts and reducing noise.
Cross-entropy loss (Eq. 8) ensures consistency in the shared semantic space, aligning geometric and textual abstractions.
Q2: Cross-Dataset Superiority
A2: Thank you for prompting this critical analysis. 3D-FLEG’s cross-dataset superiority stems from its geometry-centric design:
- Dynamic Projection Clusters capture task-invariant geometric patterns, which generalize across synthetic-to-real domains.
- Geometric Feature Embedding directly fuses these features with text semantics, bypassing domain-specific text variations.
This synergy enables 3D-FLEG to achieve 7% higher accuracy in cross-dataset tasks. We have revised Section 4.3 to clarify this mechanism.
Q3: Symbol Confusion and Misspellings
A3: Thank you for highlighting these inconsistencies. We have reviewed the manuscript and made corrections to the symbols and typographical errors to improve clarity and rigor.
Q4: Detailed Descriptions of Prompt and Training Strategies in Figure 1
A4: We appreciate your suggestion to improve Figure 1. We have detailed the experimental setup principles with relevant citations. For further details, please see our response to Reviewer 8rPH's Q2. The "Alignment Module and Dual Cache System" requires caching five samples per class, causing significant computational and memory overhead. In contrast, our geometric embedding module integrates geometric features directly with text prompts, eliminating the need for complex alignment training and caching.
Q5:Clarification on Initial Features and
A5: To address this ambiguity, we have revised Section 3 to explicitly define:
: Features extracted from base-class data using the frozen Uni3D encoder. These features are used to construct dynamic geometric projection clusters.
: Features processed during incremental training from the same Uni3D encoder. These are dynamically reprojected through DGPC using learnable attention weights to extract geometric features.
Your feedback has guided us in refining both the theoretical foundations and presentation of our work. All revisions have been incorporated into the manuscript.
References
[1] Laplacian eigenmaps for dimensionality reduction and data representation, Neural computation 2003.
[2] Point cloud segmentation of overhead contact systems with deep learning in high-speed rails, Journal of Network and Computer Applications 2023.
The paper proposes a model called 3D-FLEG for the 3D few-shot class incremental learning task. The model has a geometric feature extraction module that obtains geometric features through clustering and Laplacian eigenmaps, and it includes a geometric feature embedding module to fuse these geometric features with text features, considering modality heterogeneity.
给作者的问题
Please see the Experimental Designs part about the missing ablations. Providing the analysis on the mentioned ablation studies will further enhance the quality of the paper.
论据与证据
The claim that the reliance on text prompts and training strategies limits the robustness and performance of few-shot class incremental learning is reasonable and well supported.
方法与评估标准
The proposed method is well explained, and the evaluation criteria and datasets used are appropriate for the task.
理论论述
There are no theoretical claims.
实验设计与分析
The main experiments are comprehensive and demonstrate the effectiveness of the proposed method. Nonetheless, some ablation studies are missing. For instance, the Geometric Feature Extraction Module uses spectral clustering as its first step, but the paper does not specify the number of clusters used in this step and analyze how varying this parameter affects performance. Additionally, it would be useful to know how the model's performance varies if the prompt style is changed, for example, to GPT-generated prompts.
补充材料
I reviewed the additional results provided in the supplementary material.
与现有文献的关系
The proposed designs complement the broader literature and introduce new designs.
遗漏的重要参考文献
The following papers are related to the paper's context and should be cited—including computing geometric features via clustering, improving generalization on novel classes, and fusing multimodal knowledge for novel class learning:
- ICCV 2023, Generalized Few-Shot Point Cloud Segmentation Via Geometric Words
- CVPR 2024, Rethinking Few-shot 3D Point Cloud Semantic Segmentation
- ICLR 2025, Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation
其他优缺点
The paper is clearly written and easy to follow. The motivations for the design choices are clear and reasonable.
其他意见或建议
N/A
Thank you for your thoughtful and constructive feedback. We deeply appreciate your insights, which have guided us in refining our manuscript. Below, we outline the specific revisions made in response to your concerns:
Q1: Ablation Studies on the Number of Clusters
A1: We deeply appreciate your guidance on this critical aspect. Honestly speaking, we have added a detailed ablation analysis of the number of clusters in Table 6 of Appendix D (also given below), demonstrating how it affects model performance. Note that the cluster numbers are the same as the Basis Vectors Count in our experiments, as each cluster center is adaptively mapped to a corresponding basis vector within the projection space during dynamic projection cluster construction.
As shown in Table 6, aligning cluster numbers with the base model's feature dimension (1024 in our case) balances information retention for accuracy and the redundancy of basis vectors. We have also included additional sensitivity analyses on cluster update rates as given in Fig. 6 of Appendix D.
We have revised our manuscript to emphasize these findings more clearly in the main paper.
| Average | Accuracy(%) | Harmonic | Accuracy | (%) | |||
|---|---|---|---|---|---|---|---|
| Basis Vectors Count | 0 | 1 | 2 | 3 | 1 | 2 | 3 |
| 256 | 93.4 | 91.5 | 86.4 | 85.0 | 86.8 | 76.2 | 76.8 |
| 512 | 93.2 | 91.5 | 86.3 | 85.9 | 87.0 | 76.2 | 77.1 |
| 1024 | 93.8 | 91.9 | 87.5 | 86.8 | 87.0 | 77.4 | 77.5 |
| 2048 | 93.6 | 91.6 | 87.5 | 86.7 | 87.3 | 77.7 | 77.7 |
| 4096 | 93.8 | 91.9 | 87.8 | 86.4 | 87.0 | 77.7 | 78.3 |
Q2:Impact of Prompt Style on Model Performance
A2: Thank you for highlighting the importance of prompt-style analysis.
We implemented a more comprehensive comparison of the 7 experimental configurations.
They include:
- SOTA Baselines: Methods (1)-(2) adopt strategies from C3PR [1] and FoundationModel [2].
- Cross-Combination Strategies: Methods (3)-(7) integrate various prompt styles with distinct training strategies.
From that, we recorded and provided a new Figure 1 (now visible in Figure 1 of the anonymous link) in our revised version.
As shown in Figure 1, the performance of our geometric embedding (Method 7) remains stable across prompt variations, proving reduced dependency on prompt quality. This demonstrates that geometric feature embedding alleviates reliance on prompt quality by encoding structural priors, ensuring stable performance even under variations in prompt style. Compared with "Alignment Module + Dual Cache System" (Method 2), which caches 5 samples per class, our strategy replays only a single sample and achieves 7% higher accuracy with lower memory overhead.
Q3:Cited related papers
A3: We sincerely appreciate the suggestion. The following works have been incorporated to strengthen our related works review:
- Geometric Word-Based Segmentation [3]: Validates cluster-driven feature learning, aligning with our dynamic projection cluster design. (We have added it to Section 3.3 in our revised paper.)
- 3D Few-Shot Generalization [4]: Highlights domain adaptation challenges, motivating our geometry-centric approach for cross-dataset robustness. (We have added it to Section 3.4 in our revised paper.)
- Multimodal Fusion [5]: Supports our cross-modal alignment strategy via joint geometric-textual optimization. (We have added it to Section 3.4 in our revised paper.)
These references are now cited in relevant sections as indicated in the corresponding bracket.
We hope our revised works have addressed your concerns. Should further clarifications or adjustments be needed, we are fully committed to incorporating your guidance. Thanks for your feedback, as it is invaluable in refining our work.
References
[1] Canonical shape projection is all you need for 3d few-shot class incremental learning, ECCV 2024.
[2] Foundation Model-Powered 3D Few-Shot Class Incremental Learning via Training-free Adaptor, ACCV 2024.
[3] Generalized Few-Shot Point Cloud Segmentation Via Geometric Words, ICCV 2023.
[4] Rethinking Few-shot 3D Point Cloud Semantic Segmentation, CVPR 2024.
[5] Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation, ICLR 2025.
Thank you for the detailed rebuttal. Now my concerns have been addressed and I would update my recommendation to accept.
Dear reviewer,
We sincerely appreciate your time and constructive feedback throughout the review process. We are delighted to hear that our rebuttal has addressed your concerns and that you now recommend acceptance. Your insightful comments have significantly strengthened our paper, and we are grateful for your valuable contribution to improving our work.
Best wishes,
All authors
There are shared favorable opinions that the authors present solid and comprehensive evaluations and well-executed. The authors were enthusiastic about providing the requested ablation studies, and a couple of rounds of author-reviewer discussions led to a unanimous consensus that was positive. Note that multiple reviewers suggested that a theoretical analysis of the proposed method or module would be more beneficial in grasping the gist of the work. Since no significant weaknesses were found, AC is glad to recommend acceptance of this hardworking paper.