On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation
摘要
评审与讨论
This paper explores the parameter-efficient finetuning on 3D point cloud models, for large-scale point clouds. It introduces a Geometric Encoding Mixer (GEM), which includes a Spatial Adapter for local fine-grained information refinement and a Context Adapter for global context modeling. The experiments are extensively conducted on various datasets, and comparion with several finetuning methods in 2D domain.
优缺点分析
Strengths:
- The experiments are conducted on various datasets, and more experiments and visualizations are provided in the supplementary material.
- The idea to combine local spatial refinement and global context modeling is meaningful.
- The paper is well-organized.
Weaknesses:
- The core question is: compared with 2D large models, the current point cloud models are actually not big, which challenges the necessity of exploring parameter-efficient tuning on point cloud models. As a result, more discussions and experiments are expected to be provided.
- In all tables, the comparison with point cloud parameter-efficient fine-tuning methods discussed in related works is not extensive and inclusive. This should be carefully discussed and compared, especially comparing with some "tokenization" methods as mentioned in motivation of the paper (Line 7), considering both effectiveness and efficiency.
- In the proposed Context Adapter, the idea of using latent tokens in cross-attention to reduce computational cost is not novel, which has been explored in [1, 2].
- The efficiency comparison and the results on outdoor segmentations should be indicated in the main paper, instead of just supplementary material. More importantly, as the main target of the proposed Context Adapter is to improve efficiency, an efficiency comparison specifically designed under whether using this module or not, should be provided.
- No qualitative results are provided in either the main paper or the supplementary material.
Refs:
[1] Perceiver: General Perception with Iterative Attention. ICML 2021.
[2] Perceiver IO: A General Architecture for Structured Inputs & Outputs. ICLR 2022.
Minors:
- In all tables, what do "Pct." and "Learn" mean? Any abbreviation should be explained in the table captions.
- L149 typo: Challenges.
- Empty citation at L274.
问题
The authors are expected to carefully respond the Weaknesses 1~4 as mentioned above, especially focusing on providing more comparisons and discussing the novelty.
More importantly, the motivation/necessity of exploring parameter-efficiently learning on point cloud models is welcome to discuss.
局限性
Yes.
最终评判理由
I have carefully read the reviewer's rebuttal and other reviews.
Although this paper lacks some experiments and discussions, it deserves to be accepted.
However, the discussions and experiments should be carefully added, and many minors should also be carefully revised in the final version.
格式问题
N/A.
We thank the reviewer for the constructive feedback and address the main concerns in turn.
Q1: Necessity of PEFT for 3D point clouds
A1: We would like to discuss the necessity of 3D PEFT method for scene understanding from the following perspectives.
(1) Rapid scaling of 3D models. Recent large point cloud transformers now already exceed 100M parameters and operate on scenes with millions of points, leading to significant GPU memory and storage demands, especially when adapting to multiple downstream tasks.
Importantly, modern 3D backbones are rapidly approaching the scale of 2D vision foundation models, such as ViT and SAM/SAM2, many of which fall in the 100M-300M parameters range. However, 3D inputs often involve far more tokens than their 2D counterparts (e.g., millions of unordered points vs. 14×14 or 16×16 patches from 224×224 image), making PEFT not only relevant but necessary for 3D scene understanding.
(2) Necessity of PEFT for large 3D scene models. While early PEFT methods like BitFit (on BERT-base, 110M) and Adapters (on ResNet, <50M) targeted moderate-sized models, recent works extend PEFT to ViT and SAM (~100M), such as VPT [1], AdaptFormer [2], RLRR [3], and DA-VPT[4], achieving strong performance with minimal trainable parameters. These developments highlight that PEFT is not limited to billion-sacle models. Rather, it is motivated by the trade-off between efficiency and performance, even at 100M scale. In 3D vision, however, 3D PEFT [5,6] has so far focused primarily on shape-level tasks (e.g., object classification and shape part segmentaion) with models around ~20M parameters. In contrast, scene-level understanding demands far higher input density, token complexity, and model capacity, making novel PEFT solutions for 3D scenes both necessary and timely.
(3) Growing demand for scalable 3D fine-tuning. 3D scene understanding has long been central to 3D vision but historically constrained by limited data due to the high cost of 3D data collection and annotation. Recently, however, the rise of spatial intelligence and embodied AI has spurred progress in generating 3D data at scale, including echo-view scenes, scene-level reconstruction, and 3D asset synthesis. Notably, feed-forward reconstruction methods like VGGT [7] now enable dense scene-level point clouds directly from video. As 3D data becomes increasingly accessible, model capacity is scaling rapidly. Yet, fine-tuning 3D models with 100M+ parameters, especi8ally in resource-constrained settings, e.g., online adaptation on edge devices, remains computationally expensive. Our results show that GEM matches full fine-tuning, while updating <2% of parameters and converging faster. This makes GEM a compelling approach towards efficient and scalable deployment of existing and future large 3D models.
In summary, 3D scene understanding contains more tokens, current 3D backbones exceed 100M parameters, reaching the size of ViT/SAM models, and more data is expected to further scale the 3D model; thus, scalable deployment is demanded. Therefore, we believe PEFT methods for 3D scene understanding at this stage are suitable and necessary; it also advocates for more interest into not only parameter-efficient but also efficient fine-tuning of large 3D model as a whole.
Q2: Further comparisons against tokenization methods
A2: We would like to extend the discussion into the following.
(1) As briefly discussed in L127-138, tokenization methods are mainly employed on object-level understanding (3D object classification, shape-part segmentation). The underlying assumptions of these methods do NOT transfer to scenes with millions of points.
Compared with 2D images that can be split and resized into regular patches, tokenization heuristics are inferior in large-scale point clouds, such as farthest point sampling (FPS) and superpoints. Earlier scene-level approaches like PointNet++ [8] or SegCloud [9] explore such strategies and find that they sacrifice fine spatial detail. Modern 3D segmentation models now rely on U-Net-like or transformer backbones with direct point processing and augmented attention variants to retain these details, but they are distinct from PEFT in this context.
(2) To address your concern specifically and test cross-domain applicability, we further apply GEM to 3D shape datasets, ShapeNetPart, and also extend leading 3D shape PEFT methods into 3D scenes, to make tokenization-based PEFT methods comparable.
| Methods | Params. | Cls. / Inst. (mIoU in ShapeNet) |
|---|---|---|
| ReCon (ft.) | 27.06M | 84.52 / 86.1 |
| + PointLoRA [5] | 5.63M | 83.98 / 85.4 |
| + PointGST [6] | 5.59M | 83.98 / 85.8 |
| + GEM | 5.58M | 84.02 / 85.8 |
| Methods | Params. | Pct. | mIoU in ScanNet |
|---|---|---|---|
| Sonata (lin.) | 0.02M | 0.02% | 72.5 |
| + PointGST [6] | 2.2M | 2.0% | 74.8 |
We show that GEM still leads the performance in shape analysis, where Cls. denotes per-class mIoU and Inst. denotes per-shape mIoU. Nonetheless, extensions obtains a performance only on par with BitFit (74.7 mIoU) and lags behind GEM (78.3 mIoU). We also observe non-stable training and encounter NaN loss during the training of tokenization-based method, largely due to unstable results of FPS in various 3D scenes, indicating that extending tokenization-based method to 3D scenes can be non-trivial.
Q3: Novelty of the Context Adapter (CA)
A3: We would clarify that the novelty of CA lies in its integration with SA (voxel-based convolutional bottleneck) to sparse 3D point clouds and its seamless integration into a PEFT framework. Unlike Perceiver and Perceiver IO, which applies cross-attention to progressively augment latents tokens queries, our CA introduces a lightweight low-rank structure that remains robust across varying ranks (Tab. 4b) and is specifically designed to fuse with pretrained backbones without retraining the backbone.
Moreover, Perceiver and Perceiver IO do not explicitly mix contextual information into original input, but only repeatedly use the latent tokens to query from the input (key-value). In contrast, CA facilitates direct embedding of global context into the backbone by mixing between original inputs and latent tokens bi-directionally.
As shown in Fig. 3, coupled with SA that captures local structure, CA thus enhances backbone representation specifically with geometric cues, even under strict tuning budgets.
This results in a distinct motivation (broadcasting geometric cues), architectural design (light-weight side network as PEFT), and implementation tailored to the challenges of 3D PEFT (low-rank and bidirectional attentions via latent tokens that shares across layers), marking our approach a clear departure from earlier works.
Q4: Efficiency comparison and outdoor evaluation.
A4: Thank you for raising this. We would like to clarify that due to strict space constraints, we prioritized core method and ablation details in the main paper. These results will be integrated and discussed more prominently in the final version for improved clarity.
We further measure SA/CA overhead.
| Components | Memory | Latency |
|---|---|---|
| +SA | 4.4G | 50.5ms |
| +CA | 6.4G | 51.7ms |
Results show, when employed separately, the CA would incur more overhead than SA, and we would suggest employing CA when processing large scenes, as it provides global context beyond the local attention of the 3D backbone.
Q5: Qualitative results and visualizations.
A5: We appreciate the suggestion. While qualitative comparisons were explored during development, we found initial visualizations to be affected by inherited biases from the pretrained backbone, potentially misleading without further context. To maintain clarity and focus in the main submission, we instead included attention map visualizations for demonstration. We will include additional qualitative comparisons in the camera-ready version to better highlight the benefits of local and global cues.
Q6: Minor issues
A6: Thank you for point them out, and we will revise them accordingly.
- "Pct." denotes percentage of total model parameters that are learnable; "Learn." denotes the absolute learnable parameter count, in million. We will revise to include the description and explanation for the abbreviation.
- L149 typo: "Chanlleges" -> "Challenges".
- L274: the citations should be [10,11] as listed in the references below.
References
[1] Visual Prompt Tuning. ECCV'22.
[2] AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. NIPS'22.
[3] Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach. CVPR'24.
[4] DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers. CVPR'25.
[5] PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning. CVPR'25.
[6] Parameter-Efficient Fine-Tuning in Spectral Domain for Point Cloud Learning. TPAMI'25.
[7] VGGT: Visual Geometry Grounded Transformer. CVPR'25.
[8] PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. NIPS'17.
[9] Large-scale Point Cloud Semantic Segmentation with Superpoint Graphs. CVPR'18.
[10] How transferable are features in deep neural networks? NIPS'14.
[11] Characterizing and Avoiding Negative Transfer. CVPR'19.
We thank you again for your thoughtful review; we welcome any further questions or suggestions.
We sincerely thank you, Reviewer UJDM, for your recognition and for taking the time to review our paper. We will carefully revise the manuscript based on your insightful and constructive advice.
The current rebuttal basically addresses my concern.
I would like to raise my rating to Borderline Accept, as the discussions and experiments provided in the rebuttal are non-trivial, in nearly all the rebuttals to each reviewer.
This paper introduces Geometry Encoding Mixer (GEM), a parameter-efficient fine-tuning (PEFT) method tailored for large-scale 3D point cloud transformers. GEM addresses the limitations of existing PEFT methods (e.g., LoRA, adapters, prompt tuning) by explicitly modeling local spatial patterns (via a lightweight 3D convolutional bottleneck) and global geometric context (via learned latent tokens). Experiments on ScanNet, ScanNet200, and S3DIS show that GEM matches or surpasses full fine-tuning performance while updating only 1.6% of parameters, with significant reductions in training time and memory.
优缺点分析
Strengths:
- GEM achieves near-full fine-tuning performance with 1.6% parameter updates, reducing computational costs.
- GEM is robust to scarce data. With limited labeled data, GEM outperforms full fine-tuning strategy, demonstrating strong generalization ability.
Weaknesses:
- GEM is only feasible on transformer-based architectures, which limits the compatibility of the proposed PEFT method on other architectures.
- CA uses a fixed number of latent tokens. The paper notes that even one token works well, but dynamic token allocation (e.g., based on scene complexity) might further improve adaptability.
- The experiments are only conducted on indoor-scene datasets. The performance of GEM on outdoor-scene large-scale datasets (e.g., nuScenes) is not evaluated.
问题
- The paper notes that CA avoids "noisy" global context but lacks analysis of failure cases. Under what conditions (e.g., heavy occlusion, sparse scans) does GEM underperform full fine-tuning?
- The CA uses a fixed number of latent tokens (default: 4). Have you experimented with adaptive token allocation (e.g., based on scene complexity or point density)? If not, could this further improve performance?
局限性
yes
最终评判理由
Thanks for the authors' rebuttal. The authors have addressed my concerns and I will keep the initial rating as Borderline accept
格式问题
- There are some typos, e.g., in Line 149
- Line 274 lacks a citation.
- It would be better to remove the borders of Figure 1 (3).
- Table 4 lacks the bottom line.
Thank you for the helpful suggestions. We address each concern below.
Q1: Applicability beyond transformers.
A1: GEM is composed of two modular components: Spatial Adapter (SA) and the Context Adapter (CA). SA is implemented as a lightweight convolutional bottleneck operating on local neighborhoods, making it agnostic to the backbone architecture. CA uses cross-attention with latent tokens and does not depend on the attention layers of backbone.
Therefore, though our primary focus is transformer-based models, due to their dominance in 3D pre-training, GEM functions as a light-weight side network and is compatible with other architectures.
To demonstrate this, we integrate GEM into SparseUNet [1], a 3D convolutional model. We follow MSC [2] that pretrains on ScanNet and fine-tunes on ScanNet200.
| Methods | Params. | Pct. | mIoU |
|---|---|---|---|
| SparseUnet (ft.) | 39.2M | 100% | 32.0 |
| SparseUnet (lin.) | 0.02M | 0.05% | 1.5 |
| + BitFit | 0.03M | 0.07% | 4.8 |
| + Adapter | 0.6M | 1.5% | 9.5 |
| + LoRA | 0.8M | 2.0% | 13.2 |
| + GEM | 0.6M | 1.5% | 15.2 |
GEM boosts mIoU by +4.6 mIoU over linear probing and also leads other PEFT methods, demonstrating GEM's generalizability beyond transformer models. We also notice the gains are narrower than those for transformer models, largely due to the limited capacity of convolutions, as indicated by the near-collapse performance of linear probing.
Q2: Adaptive latent tokens.
A2: Thanks for the suggestion.
(1) As a preliminary result, our ablation in Tab. 4b shows stable performance of GEM with different numbers of latent tokens: There is only mIoU variation when ranging from 1 to 8 tokens. The default setting of 4 tokens offers an effective trade-off between accuracy and efficiency.
(2) We also implement a straightforward latent tokens allocation scheme. We apply spatial pooling on the input point cloud and treat the pooled features as latent tokens, thus allocating more tokens to denser regions. This variant obtains a result of 77.6 mIoU with similar parameter budget (1.7M, 1.6% of the Sonata backbone). While it also improves over existing PEFT methods, it remains inferior to GEM. We consider the pooling operation injects local information into the tokens, steering them toward local aggregation, similar to the local attention of the 3D backbone, and thus compromising their capability of complementing 3D backbone with global scene context.
Q3: Evaluation on outdoor datasets.
A3: We thank the reviewer for bringing this up. In Tab. 8 (supp), we evaluate the transfer ability from an indoor‑pretrained backbone to outdoor scenes of SemanticKITTI. GEM improves linear probing +15 mIoU, narrowing the gap to a model directly trained on expensive outdoor scenes. This confirms GEM's ability to generalize beyond indoor scenes without retraining from scratch. We will include this, as well as other experiments, in additional settings from Sec. C (supp), in the revision, as more space is allowed.
Q4: Failure‑case analysis.
A4: We appreciate the request for deeper analysis. We find GEM underperforms full fine-tuning in several challenging scenarios.
(1) When given a tight parameter budget, such as rank=1, the performance of PEFT methods degrades, including GEM. As shown in Fig. 1 (c) and Tab. 7 (supp), GEM would underperform full fine-tuning when the budget is under a certain threshold.
(2) When the domain gap is significantly large, PEFT methods, including GEM, would not be able to match the full fine-tuning performance, e.g., transferring to outdoor scenes from an indoor-pretrained backbone as shown in Tab. 8 (supp) and discussed in A3.
Q5: Typos, citations, and formatting.
A5: Thank you for pointing these out. We will correct them in the revision.
- L149: "Chanlleges" -> "Challenges"
- L274: citations should be [3, 4], as listed in the references below
- Fig.1 (c): We will crop out the border for a cleaner presentation.
- Tab. 4: It is indented, but a bottom line shall be added, as it can be more consistent with other tables.
References
[1] 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. CVPR'19.
[2] Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning. CVPR'23.
[3] How transferable are features in deep neural networks? NIPS'14.
[4] Characterizing and Avoiding Negative Transfer. CVPR'19.
Thank you again for your thoughtful review; we welcome any further questions or suggestions.
This paper presents GEM (Geometry Encoding Mixer), a parameter-efficient fine-tuning (PEFT) method for 3D scene segmentation using pretrained point cloud transformers. Recognizing the limitations of existing PEFT techniques (e.g., LoRA, adapters, prompt tuning) in modeling spatial and geometric structures inherent in 3D point clouds, the authors propose two key components: a Spatial Adapter (SA) leveraging 3D convolution to capture local geometry and a Context Adapter (CA) using latent tokens to inject global scene context. Evaluated on ScanNet, S3DIS, and ScanNet200, GEM achieves performance comparable to full fine-tuning while updating only 1.6% of parameters.
优缺点分析
Strengths
- Clear motivation: Highlights limitations of existing PEFT approaches in the 3D domain due to lack of spatial awareness.
- Novel adaptation design: Combines 3D convolution (for local refinement) and latent token attention (for global context) in a lightweight framework.
- Strong empirical results: Demonstrates competitive or superior performance on multiple datasets and settings.
Weaknesses
- Missing direct comparisons with other 3D PEFT methods. While classic PEFT methods (e.g., LoRA, prompt tuning) are included, the paper does not compare against recent 3D-specific PEFT works.
- Method description is imbalanced. A large portion (Lines 147–185) is devoted to reviewing prior PEFT works, while the core GEM method (Lines 186–223) is described relatively briefly. Lacks intuitive explanation or architectural figures for how/why SA and CA work in practice.
- GEM is evaluated primarily with Sonata and PTv3-PPT. No results on other architectures, making it unclear how transferable GEM is across different 3D backbones.
- SA and CA mechanisms lack theoretical or empirical justification. The paper claims that SA captures local geometry and CA captures global context—but these claims are only supported by performance gain, without deeper analytical or theoretical backing.
问题
- The experiment only compared classic fine-tuning methods such as prompt tuning and LORA. Can they be compared with 3D PEFT related work, and what is the effect?
- GEM seems to be a fine-tuning method, but the experiment is only based on the Sonata baseline. If we switch to other baselines, can it still be effective?
- Is there any experimental or theoretical support for the author's claim that the SA and CA modules respectively capture local geometric structures and global scene context?
- Will SA/CA introduce memory bottlenecks and significant computational overhead for dense scenarios?
局限性
The authors discussed some limitations of their work. The following limitations also need to be discussed.
- Dependence on pretrained Point Transformer-style architectures may limit general applicability.
- While parameter count is reduced, training still requires backpropagation into early layers, which may impact training efficiency on dense point clouds (as acknowledged by the authors).
- Ablation studies lack theoretical insights into why SA and CA are effective.
- Experiments use only indoor datasets; performance in unstructured outdoor environments (e.g., forests, construction sites) is unknown.
最终评判理由
The authors have addressed my first two key concerns.
格式问题
No
Thank you for the constructive comments. We address your concerns as follows.
Main Comments
Q1: Comparison to other 3D‑specific PEFT methods.
A1: Thanks for the suggestion. (1) We would like to clarify that, to the best of our knowledge, we are among the first to systematically investigate parameter-efficient fine-tuning (PEFT) for large-scale 3D scene understanding. As a result, there are no comparable existing 3D PEFT baselines specifically designed for scene-level tasks, e.g., 3D scene segmentation, making our research a novel and underexplored setting.
Existing 3D PEFT methods excessively focus on object-level understanding (3D object classification, shape-part segmentation). The underlying assumptions of these methods do NOT transfer to scenes with millions of points:
-
Point count and complexity. Object-level methods operate on a few thousand points, allowing costly operations like spectral transforms, graph construction, or full attention. Scene-level datasets involve millions of points, making such operations intractable.
-
Regular point distribution: Shape-level inputs are often surface-regular and allow aggressive downsampling, e.g., farthest‑point sampling. This leads to compact token sets, typically ~128 tokens, which contrasts with irregular, unstructured distributions in real-world scenes that require preserving fine-grained geometry.
While earlier scene-level approaches like PointNet++ [1] or SegCloud [2] used tokenization heuristics (e.g., FPS, superpoints), such strategies sacrifice fine spatial detail. Modern 3D segmentation models now rely on U-Net-like or transformer backbones with direct point processing and augmented attention variants to retain these details, but they are distinct from PEFT in this context.
(2) To address your concern specifically and test cross-domain applicability, we further apply GEM to 3D shape datasets, ShapeNetPart, and still lead the performance.
| Methods | Params. | Cls. / Inst. (mIoU) |
|---|---|---|
| ReCon (ft.) | 27.06M | 84.52 / 86.1 |
| + PointLoRA | 5.63M | 83.98 / 85.4 |
| + PointGST | 5.59M | 83.98 / 85.8 |
| + GEM | 5.58M | 84.02 / 85.8 |
(3) Furthermore, we also implement a direct extension of the recent 3D PEFT method from object tasks to scene understanding. Specifically, we extend PointGST [4] to evaluate on ScanNet based on released code.
| Methods | Params. | Pct. | mIoU |
|---|---|---|---|
| Sonata (lin.) | 0.02M | 0.02% | 72.5 |
| + PointGST | 2.2M | 2.0% | 74.8 |
Despite PointGST stands the SOTA performance of PEFT on 3D shape, its application on 3D scenes with millions of points is on par with BitFit that only trains the bias (74.7 mIoU), and lags behind other PEFT methods, such as LoRA (76.7 mIoU), as well as the proposed GEM (78.3 mIoU).
In addition, we notice unstable training of PointGST. On average, we resume from NaN loss twice per model training, which can be due to the unstable FPS results in various 3D scenes. As FPS is shared by methods on 3D shape, this observation indicates a significant gap between PEFT for 3D shape and PEFT for large-scale scenes.
Q2: Lacks explanation on SA and CA.
A2: We would like to extend the discussion as following.
(1) In fact, we provide the demonstration of CA and SA in Fig. 2, with its overall placement indicated in Fig. 1, together with all other common PEFT methods.
(2) Although we devote to provide a clear motivation for the design of the proposed GEM, we acknowledge that the presentation of existing PEFT method may bring in too many formulas that could be over-whelming. In the revision, we will adopt your suggestion to shorten the formulas and provide a concise presentation, e.g., using , instead of expanded form.
Q3: Extending to other baselines.
A3: While we have implemented our methods on Sonata and PointTransformerV3 (PPT), both are transformer-based model.
Since our method can be regarded as a light-weight side-network, we can seamlessly incorporate our method into other baselines. We thus further extend our methods to SparseUnet [3], a 3D convolutional network that follows a drastically different design.
Specifically, we follow MSC [5] that pretrains SparseUNet on ScanNet dataset and finetunes on ScanNet200.
| Methods | Params. | Pct. | mIoU |
|---|---|---|---|
| SparseUnet (ft.) | 39.2M | 100% | 32.0 |
| SparseUnet (lin.) | 0.02M | 0.05% | 1.5 |
| + BitFit | 0.03M | 0.07% | 4.8 |
| + Adapter | 0.6M | 1.5% | 9.5 |
| + LoRA | 0.8M | 2.0% | 13.2 |
| + GEM | 0.6M | 1.5% | 15.2 |
GEM leads other PEFT methods. This result, together with GEM's performance on 3D shape (A1), demonstrates the applicability of GEM to various baselines. We also notice the gains are narrower than those for transformer models, potentially due to the limited capacity of convolutions indicated by the near-collapse performance of linear probing.
Q4: Theoretical and empirical justification for SA and CA.
A4: We agree that deeper theoretical analysis is beneficial and will present more details in the revision. Meanwhile, we would like to highlight the followings.
(1) For theoretical aspects, L200-204 and L212-215 lay out how the design of SA and CA reflects our motivation, and also explain the functional roles of SA and CA. Below, we give further elaboration.
-
Convolution captures local spatial features. Convolution applies the same learnable kernel to every neighbourhood, enforcing weight sharing and translation equivariance [6] and leading to locality inductive bias [7, 8]. In 3D processing, a typical 3D kernel (3x3x3) therefore learns filters that respond to local geometric primitives [1, 9], and can serve as a form of positional encoding [10].
-
Attention models global dependencies. Attention computes a weighted sum over all tokens, giving each query a direct path to any key in a single layer [11], thus mixing information across arbitrarily far position and approximating long-range correlations [14, 15]. Further theory shows attention can have low-rank factorization that retains the dominant global modes [14], also validated in vision tasks [15, 16].
In the proposed method, SA implements a light-weight convolutional bottleneck, and CA is essentially attention with latent tokens as low-rank approximation. Therefore, we are confident that SA and CA reflects our motivation to capture local spatial features and provide global context.
(2) For empirical justification, we provide qualitative analysis in Fig. 3 and provide discussion around L311-L321, demonstrating GEM can capture explicit geometry cues.
In addition, Tab. 4(a) reports isolated SA and CA variants; each surpasses the linear baseline, and their combination gives the full gain, confirming complementary roles.
Q5: Memory/computation overhead and training efficiency in dense scenes.
A5: While the paper has discussed these concerns; the paper also provides preliminary analysis.
(1) For memory and computation overhead, as in L200-204 and L212-215, we provide theoretical analysis on the complexity of SA and CA, which is the same as other PEFT methods, such as Adapter and Prompt Tuning.
We also provide empirical study on the overhead in Tab. 5 (supp). Though CA and SA introduce cost over backbone, the runtime and memory footprint stay largely the same as other PEFT methods, as indicated by the complexity.
In dense scenes, we operates on the same resolution of the backbone network, whichs keep the overhead manageable.
(2) For training efficiency, as discussed in L326–330 as well as Sec. B (supp), GEM remains lightweight and converges faster than full fine-tuning, using much fewer epochs (100 instead of 800).
In addition, while it is tempting to avoid back-propagation into early layer, we find fine-tuning early encoder layers is more performant. Specifically, full fine-tuning (Tab. 1) gives 79.4 mIoU, fine-tuning a newly attached decoder gives 79.1 in mIoU, whereas the performance can reach 79.5 mIoU with GEM, in Tab. 6 (supp). However, the reveal of the underlying mechanism is beyond the scope of this work.
In general, we acknoledge the difficulties of handling dense point cloud: it is a shared challenge among the field of 3D vision, and requires joint efforts.
Q6: Performance in outdoor environment.
A6: We would mention that we explored transfering to outdoor scenes in Tab. 8 (supp). We evaluate under a challenging setting: transferring an indoor pre-trained backbone to an outdoor benchmark, SemanticKITTI. GEM effectively narrows the gap to model fully trained from scratch on outdoor data.
References
[1] PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. NIPS'17.
[2] Large-scale Point Cloud Semantic Segmentation with Superpoint Graphs. CVPR'18.
[3] 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. CVPR'19.
[4] Parameter-Efficient Fine-Tuning in Spectral Domain for Point Cloud Learning. TPAMI'25.
[5] Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning. CVPR'23.
[6] Gradient-based learning applied to document recognition. Proc of the IEEE'98.
[7] On Translation Invariance in CNNs: Convolutional Layers can Exploit Absolute Spatial Location. CVPR'20.
[8] Visualizing and Understanding Convolutional Networks. ECCV'14.
[9] KPConv: Flexible and Deformable Convolution for Point Clouds. ICCV'19.
[10] Conditional Positional Encodings for Vision Transformers. ICLR'23.
[11] Attention Is All You Need. NIPS'17.
[12] Limits to Depth Efficiencies of Self‑Attention. NIPS'20.
[13] Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling. NIPS'24.
[14] Unveiling the Hidden Structure of Self‑Attention via Kernel PCA. NIPS'24.
[15] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR'20.
[16] Focal Self-attention for Local-Global Interactions in Vision Transformers. NIPS'21.
Thank you again for your thoughtful review; we welcome any further questions or suggestions.
Thanks for the authors' rebuttal. The authors have addressed my first two key concerns, and I would like to raise my rating to Borderline accept.
We appreciate your (Reviewer A47U’s) thoughtful comments and encouraging feedback on our work. We will incorporate your advice with great care in our revision.
The authors introduce the Geometric Encoding Mixer (GEM), a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers. GEM explicitly integrates fine-grained local positional encodings with a lightweight latent attention mechanism to capture comprehensive global context, thereby effectively addressing the spatial and geometric distribution mismatch. Extensive experiments demonstrate that GEM achieves performance comparable to or sometimes even exceeding full fine-tuning, while only updating 1.6% of the model’s parameters, fewer than other PEFT methods.
优缺点分析
Strengths
- The writing is logical and easy to follow
- The experimental part of the article shows competitive results
- While achieving a significant performance improvement, it does not increase the computational burden.
Weakness
- Please add experiments on other datasets similar to Tab3
- Regarding the design of the Spatial adapter, I am curious whether using a simple 3D convolution instead can achieve the same effect?
问题
Please refer to the weakness
局限性
Yes
格式问题
No
Thank you for your valuable suggestions. We provide our feedback as follows.
Q1: More experiments on other datasets similar to Tab3.
A1: We appreciate the suggestion. As you may have noticed, due to the space limit, we include the extended experiments, with additional datasets and settings, in the supplementary material (Tab. 6–8). We will incorporate both these experiments and their discussion into the main paper in revision.
Q2: Simple 3D convolution instead of Spatial Adapter (SA).
A2: We follow the setting of our ablations and conduct a controlled experiment on ScanNet dataset to explore your suggestion. With our spatial adapter (SA) replaced with simple 3D convolution (denoted SA-conv), we summarize the results below:
| Methods | Params. | Pct. | mIoU |
|---|---|---|---|
| Sonata (lin.) | 0.02M | 0.02% | 72.5 |
| + SA | 1.2M | 1.0% | 77.2 |
| + SA-conv | 72.9M | 40.2% | 71.1 |
| + GEM | 1.8M | 1.6% | 78.3 |
| + GEM (SA-conv) | 73.6M | 40.4% | 71.6 |
As shown in the table, using a simple 3D convolution drastically increases the number of trainable parameters (~40% vs. 1%) and degrades performance (71.1 vs. 77.2 mIoU for SA; 71.6 vs. 78.3 for GEM). This supports our design choice.
We analyze that the core issue is that 3D kernels (eg a common 3x3x3 kernel) introduce parameters, where denotes the backbone channel width. This is significantly more than a transformer block with qkv-projection, out-projection, and the feed‑forward network, requiring only a total of parameters. At the fine-tuning stage, especially with limited data, 3D convolution would thus further trigger overfitting and optimization difficulties, negatively impacting the adaptation of backbone models.
Instead, SA circumvents this issue with the lightweight bottleneck design. It thus captures local geometry, with a much better balance between expressiveness and parameter efficiency. This allows GEM to be performant within a ~1.6% parameter budget, and converge faster with fewer training epochs.
Thank you again for your thoughtful review; we welcome any further questions or suggestions.
Dear Reviewers,
As the author-reviewer discussion period will end soon (Aug 6, 11:59 PM AoE), please take a moment to read the authors’ responses and post a reply - either to acknowledge their clarifications or to raise any remaining concerns.
Thank you for your time and contributions to the review process.
Best regards,
AC
All four reviewers ultimately recommended borderline accept or accept. This paper proposes the Geometric Encoding Mixer (GEM) - a geometry-aware PEFT method for large-scale 3D scene segmentation, showing strong empirical results while updating only 1.6% of parameters. Reviewers initially raised concerns mainly about limited comparisons and justification of design choices, but the authors provided thorough rebuttal. While some limitations remain, the paper addresses most concerns. Given the overall consensus and the paper’s contribution, the AC recommends acceptance.