DyPolySeg: Taylor Series-Inspired Dynamic Polynomial Fitting Network for Few-shot Point Cloud Semantic Segmentation
摘要
评审与讨论
The author proposed an interesting method to solve the problem of semantic segmentation of few-shot point clouds. Specifically, the author proposed a local feature aggregation convolution inspired by Taylor series, and built the entire model backbone on this basis. To specifically solve the problem of semantic segmentation of few-shot point clouds, the author proposed a novel PCM module, which is a lightweight module that effectively solves the difference between gallery set features and query set features through self-attention mechanism and cross-attention, and verifies the effectiveness of the method on multiple datasets and tasks.
给作者的问题
I don't have any other specific questions, and I need the author to answer the questions in the weaknesses.
论据与证据
yes
方法与评估标准
yes
理论论述
Yes. I checked some of their design theory. They designed DyPolyConv starting from Taylor series and provided the geometry of this convolution in the supplementary material.
实验设计与分析
Yes. I think the authors’ experiments are quite comprehensive. They not only conducted experiments and ablation experiments on few-shot point cloud semantic segmentation tasks, but also conducted experiments on point cloud classification tasks.
补充材料
Yes. I saw their visualization of the geometry of HoConv in the supplementary material and I thought the convolution was interesting.
与现有文献的关系
The author mentioned the most advanced work in the introduction, related work and experimental comparison. I think the author's work is at the forefront in this direction.
遗漏的重要参考文献
No
其他优缺点
- This paper proposes a very expressive point cloud representation network, especially in few-shot scenarios.
- Specifically, the DyPolyConv proposed by the author is very interesting. It provides a new perspective for modeling the geometric structure of point clouds and has strong interpretability.
- The author also provides a lightweight PCM module to reduce domain differences.
Although I think the author's method is novel and effective in some aspects, there are several issues that need to be addressed:
- Although the method has achieved certain advancement, I think it is also necessary to think about the complexity of the model, but I did not see any analysis of the model complexity in the article. I suggest that the author can analyze it.
- Regarding DyPolyConv, I see that it is a unified point cloud representation convolution, especially its two special cases are ABF and RBF, but the article does not provide their sources, and the author needs to cite them correctly.
- The author may need to provide more detailed motivation and analysis for the design of PCM to help understand how it effectively solves the prototype bias problem in few-shot learning.
- It is recommended that the author add an introduction to Mamba content.
其他意见或建议
- The caption of Figure 1 is too long;
- The clarity of Figure 3 can be adjusted.
Thank you for your recognition of our method and detailed review. We have responded to your concerns in detail and hope to address all your questions.
Concern 1: Although the method has achieved certain advancement, I think it is also necessary to think about the complexity of the model, but I did not see any analysis of the model complexity in the article. I suggest that the author can analyze it.
Reply: Thank you for your question. We have analyzed the model complexity of DyPolySeg and will include this in the final version. Specifically, our complete model contains 1.42M parameters, with DyPolyConv accounting for 0.97M parameters and PCM comprising only 0.1M parameters. For computational complexity, DyPolySeg requires 1.08G of video memory and has an inference time of 21 seconds for a 2-way-1-shot task on the S3DIS dataset (S0 split) using an NVIDIA RTX 4090. This is comparable to Seg-PN while achieving significantly better performance, and much more efficient than heavier models like PAP3D (2.45M parameters). Additionally, during the training phase, our model requires 1.2 hours, which is shorter than PAP3D (4.7h).
Concern 2: Regarding DyPolyConv, I see that it is a unified point cloud representation convolution, especially its two special cases are ABF and RBF, but the article does not provide their sources, and the author needs to cite them correctly.
Reply: Thank you for your suggestion. We will properly cite the sources for ABF and RBF in the final version. Specifically, we will reference the work by Radial Basis Functions by Buhmann (2003, "Radial Basis Functions: Theory and Implementations") and for Affine Basis Functions, we will cite the work by Duchon (1977, "Splines minimizing rotation-invariant semi-norms in Sobolev spaces"). These references provide the theoretical foundations for these special cases of our DyPolyConv.
Concern 3: The author may need to provide more detailed motivation and analysis for the design of PCM to help understand how it effectively solves the prototype bias problem in few-shot learning.
Reply: Thank you for this valuable suggestion. We will substantially expand our explanation of PCM's motivation, design principles, and effectiveness in addressing prototype bias in the revised paper. The fundamental challenge in few-shot point cloud segmentation is that limited support samples often fail to fully represent the distribution of their respective classes, creating biased prototypes that lead to inaccurate feature matching. This "prototype bias" is particularly severe in point cloud data due to geometric variations within the same semantic category. Our PCM addresses this challenge through two complementary mechanisms: (1) Self-Enhancement Module (SEM): This component learns the internal feature distribution patterns within each set independently. By computing self-correlation matrices (Eq. 18) and generating attention-based prototypes (Eq. 19-20), SEM captures class-specific characteristics that might be missing from sparse support samples. This self-attention mechanism effectively expands the representational capacity of prototypes beyond the limited samples provided. (2) Interactive Enhancement Module (IEM): While SEM focuses on internal distributions, IEM establishes fine-grained feature correspondences between support and query sets through cross-correlation (Eq. 21). This bidirectional knowledge transfer refines prototypes by incorporating query-specific contextual information, allowing adaptation to the target scene's characteristics.
Concern 4: It is recommended that the author add an introduction to Mamba content.
Reply: We appreciate this suggestion. We will include a dedicated subsection explaining Mamba's architecture for point cloud processing.
Mamba is a state-of-the-art sequence modeling architecture introduced by Gu et al. (2023, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces") that employs Selective State Space Models (SSMs) for efficient long-range dependency modeling. Unlike attention-based mechanisms that scale quadratically with sequence length, Mamba achieves linear complexity through structured state space representations. In our context, Mamba's key components offer specific advantages for point cloud processing: (1) Selective State Space Module (SSM): Enables efficient sequence modeling with Θ(L) complexity (where L is sequence length) compared to Θ(L²) in attention-based approaches. This is particularly advantageous for processing dense point clouds. (2) Data-dependent Selection Mechanism: Dynamically adjusts receptive fields based on input characteristics, allowing adaptive focus on relevant spatial regions in point clouds. (3) Bidirectional Processing: Captures global context from multiple directions, complementing our DyPolyConv's local geometric modeling by providing crucial long-range dependencies.
The authors addressed all of my concerns, with seeing the reviews and rebuttals of other reviewers, I think that the author's work is an excellent one, and I am going to improve my score.
The authors propose a novel framework for semantic segmentation of few-shot point clouds, called DyPolySeg. This framework consists of two parts. The first part is composed of an encoder and a decoder for representation learning of point clouds. In the encoder part, the authors propose a novel DyPolyConv. The second part is a PCM module for improving the performance of semantic segmentation of few-shot point clouds. The overall framework does not require pre-training. The authors conducted a large number of experiments in various few-shot settings and verified that this method achieved the best performance.
update after rebuttal
I thank the authors for their comprehensive and detailed responses to my questions. After reading their rebuttal, I believe the authors have addressed all the issues I raised. I consider the manuscript to have met the acceptance standards, and I will maintain my score.
给作者的问题
See Weaknesses.
论据与证据
Yes. The claims made by the authors are well supported. In particular, the experimental results and ablation studies on the S3DIS and ScanNet datasets well support the authors' claims.
方法与评估标准
Yes
理论论述
Yes. The author explains the relationship between DyPolyConv and Taylor series very well, providing a clear theoretical basis for his method.
实验设计与分析
Yes. The authors evaluate the superiority of their method compared with the SOTA methods on the S3DIS and ScanNet datasets in Tables 1 and 2, and verify the effectiveness of each component through ablation experiments.
补充材料
Yes. I reviewed the supplementary materials, and the introduction to the dataset, the PCM, and the introduction to HoConv helped me better understand the method.
与现有文献的关系
The author's method has made some progress and has promoted the field of point cloud few-shot semantic segmentation. In particular, the author has solved the pre-training constraints of the current method and has also shown advantages in the new setting (COSeg).
遗漏的重要参考文献
No.
其他优缺点
Strengths:
- The authors proposed a novel framework for semantic segmentation of few-shot point clouds, called DyPolySeg.
- nspired by Taylor series, the authors proposed a novel convolution to effectively solve the problem of local aggregation of point clouds.
- The authors proposed a lightweight PCM to effectively solve the problem of domain differences between query and gallery.
Weaknesses:
- Although the method proposed by the author achieves the best performance, I would like to know that the author claims to have proposed a lightweight PCM but does not explain in the article how lightweight the module is. Please provide this information.
- I would like to know whether the PCM proposed by the author is universal and can be used in other methods?
- The content of Section 3.3.3 is too thin. The author should explain why the exponential operation is replaced by the logarithmic operation and what are the benefits of doing so.
其他意见或建议
- The output of the cosine similarity calculation between the query feature and PCM in Figure 2 (a) should point to the predicted image, right? Rather than the predicted image point to cos?
- The schematic diagram of polynomial fitting in Figure 2 (b) is not clear enough. It is recommended that the author improve the clarity of this sub-graph.
Thank you for the reviewers' positive comments on our manuscript and their valuable suggestions. We have responded to all your questions in detail, as follows:
Concern 1: Although the method proposed by the author achieves the best performance, I would like to know that the author claims to have proposed a lightweight PCM but does not explain in the article how lightweight the module is. Please provide this information.
Reply: Thank you for your question about PCM's efficiency. Our PCM module contains only 0.1M parameters, which accounts for just 7% of the entire model's parameters (1.42M total), making it significantly more lightweight than comparable modules in other architectures. For context, prototype enhancement modules in competing methods like PAP3D require approximately 0.5-0.6M parameters, 5-6 times larger than our PCM. Despite its compact design, PCM substantially improves performance - when added to the baseline model, it increases mIoU from 53.28% to 71.58% (an 18.3% absolute improvement) as shown in Table 3. This demonstrates PCM's exceptional parameter efficiency. Additionally, we tested DyPolySeg on the S0 subset of the S3DIS dataset and found that the 2-way-1-shot inference time was 21 seconds and occupied 1.08G video memory, with PCM adding only 3 seconds to the inference time compared to the backbone alone. We will add these detailed efficiency metrics in the final version to better quantify PCM's lightweight nature.
Concern 2: I would like to know whether the PCM proposed by the author is universal and can be used in other methods?
Reply: Yes, PCM is designed as a universal module that can be integrated into other few-shot point cloud segmentation frameworks. We conducted additional experiments applying PCM to baseline methods like AttMPTI and Seg-PN, resulting in performance improvements of 3.8% and 2.1% respectively on S3DIS. PCM's design is agnostic to the specific feature extraction backbone, requiring only query and support features as input. We will add these cross-method integration results to the paper to demonstrate PCM's universality.
Concern 3: The content of Section 3.3.3 is too thin. The author should explain why the exponential operation is replaced by the logarithmic operation and what are the benefits of doing so.
Reply: Thank you for this insightful suggestion. We will significantly expand Section 3.3.3 to provide a more comprehensive explanation of our logarithmic transformation approach. The replacement of exponential operations with logarithmic transformations was motivated by multiple critical considerations: (1) Computational efficiency: Logarithmic operations reduce the computational complexity from O(n²) to O(n log n), resulting in a 35% reduction in forward pass computation time during training. (2) Numerical stability: Exponential operations with high-power values can quickly lead to overflow (extremely large values) or underflow (extremely small values approaching zero) issues, particularly when processing point clouds with large spatial variations. Our logarithmic transformations maintain stable gradients during backpropagation, reducing gradient vanishing/explosion issues and improving convergence speed by approximately 20%. (3) Memory efficiency: Our logarithmic approach reduces GPU memory consumption by approximately 15% during training compared to direct exponential implementation. (4) Precision preservation: For high-order polynomial terms (n>2), logarithmic space calculations preserve numerical precision better, resulting in more accurate geometric modeling.
We conducted ablation experiments comparing direct exponential implementation versus our logarithmic approach, finding that the logarithmic version not only trains faster (1.2 hours vs. 1.8 hours) but also achieves better final performance (71.58% vs. 69.27% mIoU). We will incorporate these explanations and quantitative benefits in the revised paper.
Concern 4: The output of the cosine similarity calculation between the query feature and PCM in Figure 2 (a) should point to the predicted image, right? Rather than the predicted image point to cos?
Reply: Thank you for pointing out the arrow error. There is indeed a directional error in Figure 2(a). The arrow should point from the cosine similarity calculation to the prediction image, not the other way around. We will correct this in the final version to accurately reflect the information flow in our architecture.
Concern 5: The schematic diagram of polynomial fitting in Figure 2 (b) is not clear enough. It is recommended that the author improve the clarity of this sub-graph.
Reply: Thank you for your feedback on Figure 2(b). In the revised version, we will redesign this figure to improve its resolution.
I thank the authors for their comprehensive and detailed responses to my questions. After reading their rebuttal, I believe the authors have addressed all the issues I raised. I consider the manuscript to have met the acceptance standards, and I will maintain my score.
This paper points out that there are three main limitations in the existing methods. First, the methods based on pre-training have domain transfer and increase training time. Second, the current method mainly relies on DGCNN as the backbone, which affects the modeling of several structures of the point cloud. Third, the current method does not completely solve the domain differences between the query set and the support set. In response to these three problems, this paper mainly made three innovations: In order to solve problems one and two, this paper designed a DyPolySeg model, which does not require pre-training, and the proposed DyPolyConv has strong representation capabilities. In order to solve problem three, the author proposed a lightweight PCM to solve the problem of domain differences. The author verified the effectiveness of the proposed method through a large number of experiments.
给作者的问题
The author can respond with reference to the weaknesses and other suggestions I have raised.
论据与证据
The method proposed in the paper is supported by some mathematical theories. Specifically, the author's proposed DyPloyConv establishes a connection with Taylor series, points out the relationship between them, and provides a mathematical theoretical basis for the design of the convolution. In order to verify the effectiveness of the proposed method, the author not only conducted experiments on S3DIS and ScanNet data but also conducted a large number of experiments on each module. In addition, the versatility of the proposed DyPloyConv was verified on the ScanObjectNN dataset.
方法与评估标准
Yes. The proposed methods and evaluation criteria make sense for the problem or application.
理论论述
I have checked the correctness for theoretical claims. The authors clearly introduce the relationship between the proposed DyPolyConv and Taylor series, as well as the situation of the convolution under some special settings (ABF and RBF), verifying the universality of the convolution and supported by mathematical formulas.
实验设计与分析
I check the validity of the experimental designs and analyses. I think they are reasonable. For example, the authors used standard evaluation indicators and conducted experiments on standard datasets (S3DIS and ScanNet). In order to verify the universality of DyPolyConv, experiments were also conducted on ScanobjectNN.
补充材料
Yes. The authors added a description of the data set division and provided a schematic diagram of the PCM structure. They also provided the results of their method on COSeg. I think their work is quite sufficient.
与现有文献的关系
The author discussed previous point cloud basic models, such as PointNet, PointNet++, DGCNN, PointTransformer, and PointMamba series of work, and recognized their contributions. Based on them, the author proposed a novel DyPolyConv, which showed certain effects on the 3D point cloud classification dataset. The author also discussed the limitations of the current few-shot point cloud semantic segmentation work (Seg-NN, COSeg), and proposed some improvement measures.
遗漏的重要参考文献
No
其他优缺点
Strengths:
-
Inspired by Taylor series, this paper proposes a novel DyPolyConv for local structure representation of point clouds and verifies the versatility of the convolution.
-
This paper designs a lightweight PCM module to bring the features of the Gallery set and the features of the Query set closer, greatly improving the performance of semantic segmentation of few-shot point clouds.
Weaknesses:
-
Although this article mentions the improvement of computing efficiency, the author does not seem to provide the time required for inference and the size of memory required, which is also an important indicator. It is recommended that the author provide these data.
-
I found that the datasets used by the author are all indoor point cloud datasets. I want to know whether the method proposed by the author and other authors can be applied to outdoor scene point cloud datasets?
-
In Table 3, the author should also provide experimental results without LoConv?
-
Does the model in Table 7 use PCM? If not, it needs to be explained in the text.
其他意见或建议
It is recommended that the author provide some examples of segmentation failure, which may be of great help in understanding the article.
First of all, we would like to thank the reviewers for their time in reviewing our manuscript and providing constructive comments. We have carefully addressed your questions below.
Concern 1: Although this article mentions the improvement of computing efficiency, the author does not seem to provide the time required for inference and the size of memory required, which is also an important indicator. It is recommended that the author provide these data.
Reply: Thank you for your valuable suggestion. We have conducted a comprehensive efficiency analysis of DyPolySeg on the S0 subset of the S3DIS dataset. Our experiments show that for the 2-way-1-shot setting, the model requires 21 seconds for inference and occupies 1.08G of video memory on an NVIDIA RTX 4090 GPU. In comparison with state-of-the-art methods, DyPolySeg demonstrates competitive efficiency while achieving superior performance. For instance, PAP3D requires approximately 35 seconds and 1.56G memory under the same conditions. Additionally, during the training phase, our model converges in about 1.2 hours, significantly faster than the 4.7 hours required by PAP3D. We will include these detailed efficiency metrics in the final version of our paper, providing a more complete evaluation of our method's practical advantages.
Concern 2: I found that the datasets used by the author are all indoor point cloud datasets. I want to know whether the method proposed by the author and other authors can be applied to outdoor scene point cloud datasets?
Reply: Thank you for raising this important question about generalizability. As demonstrated by our experimental results on ScanObjectNN (which contains diverse object classes), our method shows strong generalization capabilities. The fundamental principles behind DyPolySeg—dynamic polynomial fitting for local geometric modeling and prototype completion for feature enhancement—are designed to capture essential geometric patterns that exist in both indoor and outdoor environments. While our current research focuses on indoor point cloud datasets due to their prevalence in few-shot segmentation benchmarks, we believe DyPolySeg can be effectively extended to outdoor scenes such as autonomous driving scenarios with appropriate adaptation to handle the increased scale and sparsity characteristics of outdoor point clouds. In future work, we plan to explicitly evaluate our method on outdoor datasets like SemanticKITTI to further validate its versatility across different domains.
Concern 3: In Table 3, the author should also provide experimental results without LoConv?
Reply: We appreciate your suggestion for this additional ablation study. We have conducted the requested experiment, and without LoConv, DyPolySeg achieves 46.95% mIoU on S0 and 49.87% on S1, with an average result of 48.41%. This represents a significant drop of 23.17% compared to our full model (71.58% mIoU), highlighting the crucial role of LoConv in capturing essential flat geometric features. This result aligns with our theoretical foundation in Taylor series approximation, where lower-order terms provide fundamental structural information that higher-order terms build upon. The substantial performance degradation without LoConv empirically validates our design choice of combining low-order and high-order convolutions for comprehensive geometric modeling. We will incorporate this informative ablation study in the final version of our paper to provide a more complete analysis of our model components.
Concern 4: Does the model in Table 7 use PCM? If not, it needs to be explained in the text.
Reply: Thank you for pointing out this ambiguity. The model used for experiments in Table 7 (ScanObjectNN classification) does not incorporate the PCM module. For these experiments, we utilized only the DyPolySeg encoder as the backbone, followed by a classifier consisting of a fully connected layer, global pooling, and other standard classification components. This is because PCM is specifically designed to address prototype bias in few-shot segmentation scenarios through feature enhancement between support and query sets, which is not applicable to the standard classification task on ScanObjectNN. The strong performance (92.8% accuracy) achieved with just our backbone architecture demonstrates the effectiveness of our DyPolyConv and Mamba Block combination for general point cloud representation learning. We will explicitly clarify this architectural difference in the final version to avoid confusion and provide a more precise description of our experimental setup.
The authors have addressed all my concerns with quantitative evidence and clear explanations. The method achieves the best performance, and I believe the DyPolyConv approach and PCM module proposed in the paper demonstrate significant innovation while being theoretically grounded with Taylor series as mathematical support. Therefore, I have increased my score and strongly recommend acceptance of this paper.
The paper argues that few-shot point cloud semantic segmentation models are constrained by their pretraining models and introduces a pre-training-free Dynamic Polynomial Fitting network. The network comprises DyPolyConv for local feature extraction and the Mamba Block for global feature extraction. Additionally, a PCM module is incorporated to reduce discrepancies between query and support sets.
给作者的问题
There are missing ablation studies to show how the Dynamic Polynomial Convolution specifically contributes to performance improvements over other 3D convolution operations. And the paper lacks clarity and is not well structured. Clearer writing is needed.
论据与证据
The paper makes a claim that pretraining models limit few-shot performance and the claim is supported by the experiments in the paper.
方法与评估标准
The evaluation criteria and datasets used make sense for the problem. However, the Method section lacks clarity, and the motivation behind certain design choices is not well explained. For instance, it is unclear how the Dynamic Polynomial Convolution module is enhanced and what the underlying motivation for these choices is. And the formulas in Section 3.3 are confusing—for example, the role of in Eq. 15 is not clearly defined in the context of the Dynamic Polynomial Convolution.
理论论述
Not related.
实验设计与分析
The experiments are conducted on two standard datasets and prove the effectiveness of the proposed method. However, the paper misses some important ablation studies to verify the designed components, such as specifically showing how Dynamic Polynomial Convolution improves performance compared to other 3D convolutions.
补充材料
I have looked through the supplementary material.
与现有文献的关系
The proposed designs do not overlap with other papers.
遗漏的重要参考文献
Not related.
其他优缺点
No.
其他意见或建议
No.
Thank you to the reviewer for your time and suggestions, and to the other reviewers for their recognition of our paper. We have carefully addressed your concerns. The specific responses are shown below, and we hope we have resolved your questions.
Concern 1: The Method section lacks clarity, and the motivation behind certain design choices is not well explained. For instance, it is unclear how the Dynamic Polynomial Convolution module is enhanced and what the underlying motivation for these choices is.
Reply: Our methods chapter is clearly organized. We have structured Section 3 systematically, beginning with the problem definition (Section 3.1), introducing our novel DyPolyConv (Section 3.2), enhancing it (Section 3.3), and then presenting PCM (Section 3.4) before tying everything together (Section 3.5). In Section 3.2, we explicitly establish the connection between Taylor series and our Dynamic Polynomial Convolution, explaining how this mathematical foundation motivated our design choices to capture complex geometric structures.
Regarding the specific enhancements in Section 3.3, each component has a clear motivation:
- The Enhanced Low-order Convolution (Section 3.3.1) enriches the description of local point cloud structures by moving beyond simple mapping of center point features, which alone cannot effectively capture the overall geometric context.
- The Explicit Structure Integration (Section 3.3.2) deliberately incorporates geometric relationships to enable DyPolyConv to effectively capture spatial arrangements within local structures, providing crucial spatial context.
- The learnable parameter s increases DyPolyConv's flexibility.
- The Computational Efficiency improvements (Section 3.3.3) replace exponential operations with logarithmic transformations for three key reasons: (1) reducing computational complexity from O(n²) to O(n log n), (2) enhancing numerical stability by avoiding overflow/underflow issues during backpropagation, and (3) reducing GPU memory consumption by approximately 15% during training.
Due to page limitations, some explanations might be concise and we will expand these sections in the revised version.
Concern 2: And the formulas in Section 3.3 are confusing—for example, the role of s in Eq. 15 is not clearly defined in the context of the Dynamic Polynomial Convolution.
Reply: Thank you for highlighting this issue. The parameter s in Equation 15 is a binary switch that controls whether sign information is preserved (s=1) or discarded (s=0) during feature transformation. This key parameter allows DyPolyConv to adapt between directional sensitivity (when s=1, similar to Affine Basis Functions) and magnitude-only processing (when s=0, similar to Radial Basis Functions). The value of parameter s is typically set manually, which presents a significant challenge in determining flexible settings. Therefore, we adopted the approach in Equation 15 to learn parameter s in a flexible, learnable manner. We will further revise Section 3.3 in the final version to explicitly define the role of s and explain its geometric interpretation.
Concern 3: There are missing ablation studies to show how the Dynamic Polynomial Convolution specifically contributes to performance improvements over other 3D convolution operations.
Reply: Thank you for this valuable suggestion. As recommended, we conducted a comprehensive comparison between our DyPolyConv and four state-of-the-art 3D convolution operations for local point cloud aggregation. The experimental results are shown in the following table:
| 3D Convolution Operations | S₀ | S₁ | Avg |
|---|---|---|---|
| PointNet++[1] | 68.57 | 69.83 | 69.20 |
| PointMLP[2] | 68.76 | 69.92 | 69.34 |
| DGCNN[3] | 70.32 | 71.34 | 70.83 |
| RepSurf[4] | 70.45 | 71.26 | 70.86 |
| DyPolyConv(our) | 71.21 | 71.94 | 71.58 |
[1] Pointnet++: Deep hierarchical feature learning on point sets in a metric space. [2] Rethinking network design and local geometry in point cloud: A simple residual MLP framework. [3] Dynamic graph cnn for learning on point clouds. [4] Surface representation for point clouds.
Results show DyPolyConv outperforms all compared methods on both S3DIS dataset splits in the 2-way-1-shot setting. With an average mIoU of 71.58%, it surpasses RepSurf by 0.72%, DGCNN by 0.75%, PointNet++ by 2.38%, and PointMLP by 2.24%. This confirms the effectiveness of our polynomial fitting approach for geometric feature extraction in few-shot point cloud segmentation. We will add this analysis to better highlight DyPolyConv's contributions in our revised paper.
Concern 4: And the paper lacks clarity and is not well structured. Clearer writing is needed.
Reply: Our paper follows standard technical structure with clear Introduction, comprehensive Related Works, hierarchically organized Method section, and standard Experiments format. We will review our manuscript to further improve clarity.
Thanks for the rebuttal. While it partially addresses some concerns (e.g., the role of s in Eq. 15), my main concerns remain. The paper still appears poorly organized and requires improvements on the writing to meet publication standards. For example, at line 133 if DyPolyConv includes DyHoConv, they should not be listed in parallel as separate modules. Additionally, the connection between the prior works in Section 3.1.3 and the proposed method is not clear, and the symbols in that section (e.g., Fout) are not clearly defined and seem isolated from the rest of the paper. In Section 3.3.1, the meaning of the formula is unclear—where is gL used, and what aggregation function is applied? And it is also unclear how efficiency is improved based on Eq. 16. The PCM module also remains confusing, such as the definition of V in Eq. 20, among other issues. Based on these concerns, I think the paper needs to be carefully reorganized and clarified to meet publication standards, and therefore I choose to set my rating as reject.
Dear Reviewer XJbF:
Thank you for providing additional feedback. We greatly appreciate your careful reading of our paper and the opportunity to address your remaining concerns. We take all your comments seriously, and since updating the submitted PDF is not allowed during the rebuttal period, we will make all necessary modifications in the final version upon acceptance of the manuscript. Below are our responses to your specific questions:
Additional Concern 1: In line 133, if DyPolyConv includes DyHoConv, they should not be listed separately as individual modules.
Reply 1: Thank you for your question. DyPolyConv does include DyHoConv as a component, which indeed caused confusion. Following your suggestion, we will change line 133 in the revised manuscript to "(FPS, Grouping, DyPolyConv, and Mamba Block)".
Additional Concern 2: The connection between previous work and the proposed method in Section 3.1.3 is unclear, symbols like Fout in that section are not clearly defined, and it seems disconnected from the rest of the paper.
Reply 2: Thank you for your question. Fout represents the dynamic weight output in PAConv. Section 3.1.3 aims to introduce two key previous works: RepSurf (representation based on Taylor series) and PAConv (representation based on dynamic convolution). Our DyPolyConv combines the strengths of both approaches. To make this connection more explicit, we can change the title of Section 3.1.3 in the final version to "Prior Works based on Taylor Series and Dynamic Convolution".
Additional Concern 3: In Section 3.3.1, the meaning of the formula is unclear—where is gL used, and what aggregation function is applied?
Reply 3: Thank you for your question. As stated before the colon preceding this formula in Section 3.3.1, we have already explained that gL is the output of the low-order convolution (LoConv) formulation. We applied max pooling as the aggregation function. It should be noted that in our method, the aggregation function we use is always max pooling.
Additional Concern 4: The efficiency improvement based on Equation 16 is unclear.
Reply 4: Thank you for your question. As we responded to Reviewer wrBV's Concern 3, our motivation for applying logarithmic transformation to exponential operations is based on multiple considerations: (1) Computational Efficiency: Logarithmic operations reduce computational complexity from O(n²) to O(n log n), decreasing the forward propagation time during training; (2) Numerical Stability: Exponential operations with high power values can easily lead to numerical overflow or underflow problems, especially when processing point clouds with large spatial variations. Our logarithmic transformation maintains gradient stability during backpropagation, reducing gradient vanishing/explosion problems and improving convergence speed; (3) Memory Efficiency: Compared to direct exponential implementation, our method reduces GPU memory consumption; (4) Precision Preservation: For high-order polynomial terms (n>2), calculations in logarithmic space better preserve numerical precision, enabling more accurate geometric modeling. Specific numerical results (for a 2-way-1-shot task on the S3DIS dataset (S0 split) using an NVIDIA RTX 4090) are shown below:
| Training Time (h) | Memory Usage (G) | |
|---|---|---|
| Exponential Calculation | 1.8 | 1.31 |
| Logarithmic Calculation | 1.2 | 1.08 |
Additional Concern 5: The definition of V in Equation 20 in the PCM module is confusing.
Reply 5: In Equation 20, V represents the initial prototype features for each class generated by the support set features and support set masks (i.e., the prototype before enhancement). We will add this clarification in the final version.
Best regards,
Authors of Paper #5054
This paper presents a compelling and well-executed framework, DyPolySeg, for few-shot point cloud semantic segmentation, introducing a pretraining-free approach that combines a novel Taylor series-inspired DyPolyConv for expressive local feature learning and a lightweight PCM module for mitigating domain differences between support and query sets. Across four reviewers, the method is consistently recognized for its theoretical grounding, experimental rigor on multiple benchmarks, and practical contribution to advancing 3D few-shot learning. While some reviewers request additional clarity on specific design motivations and model complexity, these concerns are outweighed by the paper's novelty, strong empirical results, and detailed supplementary material. I recommend acceptance and encourage the authors to further strengthen the camera-ready version by including computational efficiency analysis, clarifying PCM’s design choices, and improving figure clarity.