7.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.3

置信度

创新性2.8

质量3.0

清晰度3.3

重要性3.0

NeurIPS 2025

Coarse-to-Fine 3D Part Assembly via Semantic Super-Parts and Symmetry-Aware Pose Estimation

Xinyi Zhang,Bingyang Wei,Ruixuan Yu,Jian Sun

OpenReview PDF

提交: 2025-04-05更新: 2025-10-29

TL;DR

We propose CFPA for 3D part assembly that integrates semantic super-parts, dual-range feature propagation, and symmetry-aware supervision, achieving SOTA accuracy, consistency, and diversity on standard benchmarks.

摘要

关键词

3D part assembly; semantic super-parts; dual-range feature propagation; symmetry-aware supervision

评审与讨论

审稿意见

评分: 5置信度: 42025-06-02

This paper proposes CFPA (Coarse-to-Fine Part Assembly), a novel two-stage framework for 3D part assembly that jointly addresses semantic structuring and geometric symmetry challenges. The method begins with a coarse pose estimation stage, where semantic super-parts are constructed using entropy-regularized optimal transport, providing high-level guidance via dual-range feature propagation. Then it takes a refinement stage that adjusts part poses through cross-stage attention and instance-level geometric encoding. Besides, this paper introduces a symmetry-aware loss that models both self-symmetry and inter-part geometric similarity. Extensive experiments on the PartNet benchmark show its effectiveness.

优缺点分析

Goodness:

This paper presents a novel Coarse-to-Fine Part Assembly framework, integrates global structure and local geometric reasoning. Specifically, it takes a coarse pose estimation stage and a refinement stage to enhance accurate and diverse part assemblies.
This paper accounts for pose ambiguity caused by symmetry, enabling accurate and diverse configurations.
Extensive experiments on PartNet demonstrate its effectiveness.
This paper is written and organized well.

Weakness:

Since the proposed method has the ability to learn the base part-level shapes based on the semantic guidance, it will be interesting to evaluate its performance on unseen categories.
The framework involves multiple attention layers, message passing, and OT computation. It will be interesting to report on the computational cost impact of these components.
It will be interesting to add visualization or analysis of learned super-parts or their semantic consistency across shapes.
It will be interesting to evaluate the generalization of the proposed symmetry-aware loss on other baselines, for that it is a problem often overlooked or only heuristically addressed in prior work.
Lack comparison with Imagine [1] in Tab.1.

[1] Imagine: Image-Guided 3D Part Assembly with Structure Knowledge Graph. AAAI 2025.

问题

Refer to the Weakness. I will improve my rates when the weaknesses are solved.

局限性

The author analyzes the limitations and societal impacts adequately.

最终评判理由

The authors addressed my concerns and questions. I have updated my ratings! Good luck!

格式问题

作者回复

2025-07-31

Q1: Evaluate the model's performance on unseen category.

In the manuscript, we adopt the experimental settings of DGL [1], RGL [2], Score-PA [3], IET [4], and 3DHPA [5] to evaluate models on the three largest categories of PartNet, where our model achieves superior performance compared to the baselines.

To evaluate the performance of our method on unseen category, we test with the model trained on Chair category to predict assemblies for shapes in Lamp category. As shown in Table R4-1, our model (Our Chair⟹Lamp) outperformed the baseline methods (* Chair⟹Lamp, where * represents the compared models) on 4 out of 5 metrics (SCD, PA, CA, QDS), achieving second-best performance on WQDS.

These results demonstrate the generalization ability of our approach, which leverages semantic guidance to learn part-level structures and produce coherent and accurate assemblies even for unseen category.

Table R4-1. Performance of models trained on Chair for assembly prediction on unseen Lamp category

Methods	SCD( $10^{-2}$ )↓	PA(%)↑	CA(%)↑	QDS( $10^{-5}$ )↑	WQDS( $10^{-5}$ )↑
LSTM [1] Chair⟹Lamp	4.15	2.61	15.43	1.22	0.21
Global [1] Chair⟹Lamp	3.99	1.32	12.81	1.28	0.15
DGL [1] Chair⟹Lamp	3.63	2.38	10.79	0.98	0.11
RGL [2] Chair⟹Lamp	3.61	3.20	16.22	0.94	0.14
Score-PA [3] Chair⟹Lamp	2.41	5.55	13.95	1.4	1.30
IET [4] Chair⟹Lamp	4.79	1.93	11.70	1.26	0.41
3DHPA [5] Chair⟹Lamp	2.98	4.15	26.87	1.16	0.93
Our Chair⟹Table	2.84	5.61	28.85	1.43	1.13

Q2: Report computational cost of the proposed components.

To evaluate the computational cost of our proposed components, we report key metrics for a single forward pass with a batch size of 64, including the number of parameters (Million, M), GPU memory usage (GB), forward time (ms), and GFLOPS. We analyze the CFPA model and its variants, each excluding specific components as: 1)-2) CFPA-w/o-refine/coarse MHA that exclude the multi-head attention in the pose refinement/coarse pose estimation stage; 3) CFPA-w/o-OT that removes the OT-based super-part construction; 4)-5) CFPA-w/o-SRFP/LRFP that remove the short-/long-range feature propagation; 6) CFPA-w/o-MP that removes the message passing in long-range feature propagation; 7) CFPA-w/o-CA that removes the cross-stage attention; 8) CFPA-w/o-SL that removes the symmetry-aware loss. We also report the computational cost of the baseline method that removes all of the above components.

As shown in Table R4-2, removing these components leads to a reduction in computational cost compared to the full CFPA model. Notably, some of these components are specifically designed to operate based on the relationships between parts and super-parts rather than directly on point clouds, with the maximum number of parts limited to 20 and super-parts to 16. This design inherently requires only a small computational cost. Additionally, certain components, such as OT, short-range feature propagation, and message passing, do not introduce any learnable parameters, further contributing to their relatively low computational cost.

Table R4-2. Computational cost comparisons

Method	#Para.(M)	GPU Memory(GB)	Forward time(ms)	GFLOPS
baseline	21.33	24.05	641.36	86.22
1) CFPA-w/o-refine MHA	22.90	24.28	674.80	86.26
2) CFPA-w/o-coarse MHA	25.14	26.72	772.09	86.30
3) CFPA-w/o-OT	25.89	26.80	657.87	86.34
4) CFPA-w/o-SRFP	25.89	26.36	804.91	86.34
5) CFPA-w/o-LRFP	25.76	24.68	778.68	86.33
6) CFPA-w/o-MP	25.89	24.60	791.31	86.34
7) CFPA-w/o-CA	25.63	25.71	698.05	86.33
8) CFPA-w/o-SL	25.89	26.60	802.72	86.34
CFPA	25.89	27.04	806.80	86.34

Q3: It will be interesting to add visualization or analysis of learned super-parts or their semantic consistency across shapes.

While figures are not permitted in this rebuttal, we visualize the correspondence for the basic parts with their nearest super-parts, identified by the minimal cost in the optimal transport map $T$ . These visualization reveal several key tendencies:

A. Parts with similar functions are mapped to the same super-part, showcasing strong semantic coherence. For instance, chair legs are typically grouped into one super-part, while chair backs and seats are often mapped into another super-part, reflecting their interchangeable functional roles.

B. Unique or uncommon parts, such as decorative elements or isolated structural pieces, tend to be assigned to specific super-part.

C. The learned super-parts demonstrate semantic consistency across shapes. For instance, despite variations in chair designs (e.g., chairs with armrests versus those without), parts with similar functions, such as chair legs and seat backs, are consistently grouped into their respective super-parts, while armrests are mapped into separate super-parts.

These findings highlight the model's ability to capture semantic and functional relationships across shapes, while effectively handling structural variations and unique components.

Q4: It will be interesting to evaluate the generalization of the proposed symmetry-aware loss on other baselines.

To evaluate the generalization ability of the proposed symmetry-aware loss, we conducted experiments by integrating it into various baseline methods. As shown in Table R4-3, the models equipped with our symmetry-aware loss (* -w/-SL, where * represents the baselines) achieve improved performance.

These results demonstrate that the symmetry-aware loss is not only effective within our framework but also generalizable across diverse baseline methods. This further validates its robustness and applicability in 3D part assembly tasks.

Table R4-3. Comparison of different baselines with our symmetry-aware loss

Method	SCD↓	PA(%)↑	CA(%)↑	QDS(10⁻⁵)↑	WQDS(10⁻⁵)↑
Global [1]	1.46	15.70	9.90	0.15	1.25
Global-w/-SL	1.21	18.10	12.92	0.43	1.21

LSTM [1]	2.35	8.08	10.05	3.92	1.26
LSTM-w/-SL	2.35	9.34	11.25	4.03	1.33

DGL [1]	0.91	39.00	23.87	1.69	1.35
DGL-w/-SL	0.87	41.77	29.51	1.66	1.73

RGL [2]	0.98	48.85	30.68	5.85	2.09
RGL-w/-SL	0.92	49.15	33.27	5.91	1.55

Score-PA [3]	0.71	44.51	30.32	3.36	1.70
Score-PA-w/-SL	0.65	43.77	31.32	5.11	1.41

IET [4]	1.34	37.60	25.44	3.33	1.85
IET-w/-SL	1.21	38.53	27.21	3.33	1.93

HPA [5]	0.51	63.01	48.28	4.42	1.90
HPA-w/-SL	0.51	66.34	47.21	4.82	2.01

Q5: Lack comparison with Imagine in Tab.1.

We will include a comparison with Imagine [6] in the revised manuscript and clarify the distinctions between the two methods. While both tackle 3D part assembly tasks, Imagine utilizes structural guidance from 2D images, making it a conditional approach. In contrast, our method addresses the task in an unconditional setting. Despite these differences, our method outperforms Imagine in terms of the PA metric on the Chair (69.24% vs 65.88%) and Table (68.48% vs 65.13%) categories.

We sincerely thank you for your thoughtful feedback and valuable suggestions. We will incorporate the additional analysis and experiments into our revised manuscript.

[1] Zhan, Guanqi, et al. Generative 3d part assembly via dynamic graph learning. NeurIPS 2020.

[2] Narayan, et al. Rgl-net: a recurrent graph learning framework for progressive part assembly. WACV 2022.

[3] Cheng, Junfeng, et al. Score-pa: score-based 3d part assembly. BMVC 2023

[4] Zhang, Rufeng, et al. 3d part assembly generation with instance encoded transformer. IEEE Rob. Autom. Lett., 2022, 7(4): 9051-9058.

[5] Du, Bi'an, et al. Generative 3d part assembly via part-whole-hierarchy message passing. CVPR 2024.

[6] Wang, Weihao, et al. Imagine: image-guided 3d part assembly with structure knowledge graph. AAAI 2025.

2025-08-09

Dear Reviewer JKQa,

Thank you sincerely for your insightful questions regarding the generalization, visualization analysis, and computational aspects of our designs. In our responses, we have provided detailed quantitative and qualitative analyses, including evaluations on unseen categories (Table R4-1), computational cost assessments (Table R4-2), analysis of learned super-parts (response to Q3), and an examination of the generalization ability of the proposed symmetry-aware loss (Table R4-3). These results collectively demonstrate the effectiveness of our design and its robust generalization capabilities.

We would greatly appreciate your feedback to confirm whether our responses have sufficiently addressed your concerns. If you have any further questions or require additional clarification on any aspect, please let us know. We remain fully committed to addressing any remaining concerns you may have.

Thank you again for your time and dedication.

Best regards,

The authors of Paper 385

2025-08-09

The authors addressed my concerns and questions. I have updated my rating to Accept ! Good luck!

2025-08-09

Dear Reviewer JKQa,

We deeply appreciate the time and effort you dedicated to reviewing our response and providing thoughtful feedback. It is very encouraging to hear that our responses have successfully addressed your concerns and questions.

Your suggestions are truly valuable and insightful, and we will carefully incorporate the additional analysis and experiments into the revised version to further enhance its rigor and overall quality.

We are sincerely grateful for your updated rating and kind support. Thank you once again for your constructive comments and encouragement.

Best regards,

The authors of Paper 385

审稿意见

评分: 4置信度: 32025-06-30

This paper is about 3D part assembly. The pipeline is clear, including Coarse Pose Estimation with Super-Part, Pose Refinement, and Loss Function. This paper introduces a dual-range feature propagation strategy to integrate both local and global structural cues. The Pose Refinement part applies a cross-stage attention and instant encoding. The Loss function integrates structural priors, geometric reasoning, and symmetry-aware supervision.

优缺点分析

Strength:

The method is easy to understand, and the experiment in the paper is SOTA
This task is interesting, and the comparison with other methods also proves the effectiveness of this method.

Weakness：

Some parts do not have a clear explanation process, like how the semantic super part $\{h_{j}\} _ {j=1}^M$ comes from $\{h_{j}\} _ {j=1}^N$ and the geometrically similar parts in line 159.
The ablation study should contain the baseline method; simply removing any single part is not convincing enough.
The method part needs more explanation about how this module solves the issues.

问题

I would like to see more explanation of the approach, including how the designed method addresses the needs.
More ablation experiments are encouraged.

局限性

Yes

最终评判理由

The authors addressed my concerns and questions. I have lifted my ratings.

格式问题

There is no formatting issues in this paper.

作者回复

2025-07-31

Q1: How the semantic super part $\\{h\_j\\}\_{j=1}\^M$ comes from $\\{f_i\\}_{i=1}^N$ ? What is the definition of geometrically similar parts in line 159?

(1) Semantic super-parts.

The semantic super-parts $\\{h\_j\\}\_{j=1}\^M$ are computed as weighted aggregation of basic part features $\\{f_i\\}_{i=1}^N$ , where the weights are set as transport map $T$ , as formulated in Eq. (1) of the manuscript.

The transport map (weight) $T$ is obtained through an entropy-regularized optimal transport (OT) process, which minimizes the objective function described in Eq. (2) of the manuscript. This optimization is efficiently solved using Sinkhorn's algorithm [1], which iteratively updates the transport map and super-parts until convergence.

These semantic super-parts act as middle-level representations and serve as feature prototypes, offering meaningful structural abstractions that facilitate subsequent feature propagation and pose prediction.

(2) Geometrically similar parts.

We follow DGL [2] and compare the axis-aligned bounding box sizes of each part. Two parts are considered geometrically similar if the absolute difference in their bounding box sizes is below a predefined threshold (set to 0.1). For example, under this criterion, parts such as the legs of a table or the seat and backrest of a chair can be identified as geometrically similar if their bounding box size differences satisfy the threshold. This definition for geometrically similar parts is also provided in Appendix A.

[1] Distances, Cuturi M. Sinkhorn. Lightspeed computation of optimal transport. NeurIPS 2013.

[2] Zhan, Guanqi, et al. Generative 3d part assembly via dynamic graph learning. NeurIPS 2020.

Q2: Ablation study should contain baseline method. More ablation experiments are encouraged.

We present the results of our 1) baseline method and the full CFPA model in Table R3-1. For the baseline, we systematically remove key components of our approach, including semantic super-parts, dual-range feature propagation, cross-stage attention, and the symmetry-aware loss.

To further evaluate the designs in our model, we conduct additional ablation studies as follows.

A. To demonstrate the effectiveness of entropy-regularized OT for semantic super-part construction, we conduct experiments with: 2) CFPA-w/-Att.map: This variant replaces our transport map computed by entropy-regularized OT with an attention-based transport map as used in [3]. 3) CFPA-w/- $L_2$ -dis: This variant replaces the cost function in OT with $L_2$ distance in feature space. The results in Table R3-1 show that these variants achieve lower performance than our full CFPA model, demonstrating the effectiveness of entropy-regularized OT and inner product as cost function for semantic super-part construction.

B. To evaluate the effectiveness of our symmetry-aware loss for handling self-symmetric and geometrically similar parts, we conduct experiments with: 4)-6) CFPA-w/-SL in [4]/[5]/[6]: These variants replace our symmetry-aware loss $\mathcal{L}_{sym}$ in Eq.(15) with the symmetry-handling losses proposed in [4]/[5]/[6]. The results in Table R3-1 show that CFPA with our symmetry-aware loss achieves better performance compared to these symmetry-handling losses, demonstrating its ability to effectively handle self-symmetric and geometrically similar parts.

C. To demonstrate the effectiveness of our symmetry-aware loss compared to methods that explicitly use encourage distinct placement losses for geometrically similar parts, we conduct ablation studies with: 7) CFPA-w/-SL in [7]: This variant replaces our symmetry-aware loss $\mathcal{L}\_{sym}$ in Eq.(15) with the loss function proposed in [7], which encourages distinct placement for geometrically similar parts. 8) CFPA-w/-SL- $L_2$ : This variant replaces our symmetry-aware loss $\mathcal{L}\_{sym}$ in Eq.(15) with $L_2$ distance applied within geometrically similar parts. The results in Table R3-1 show that CFPA with our symmetry-aware loss outperforms these variants, highlighting its superiority in handling geometrically similar parts than explicitly enforcing distinct placement.

We hope that these comparisons, along with the variants analyzed in the ablation study of the manuscript, provide a thorough and convincing evaluation of our method.

Table R3-1. Ablation studies for CFPA

Method	SCD( $10^{-2}$ )↓	PA(%)↑	CA(%)↑	QDS( $10^{-5}$ )↑	WQDS( $10^{-5}$ )↑
1) CFPA-baseline	1.02	44.37	38.20	0.99	0.46
2) CFPA-w/-Att.map	0.50	69.87	49.07	5.87	2.60
3) CFPA-w/- $L_2$ -dis.	0.51	69.23	48.85	5.08	2.14
4) CFPA-w/-SL in [4]	0.51	68.72	44.87	4.80	2.02
5) CFPA-w/-SL in [5]	0.52	68.74	48.42	4.86	2.06
6) CFPA-w/-SL in [6]	0.50	69.17	48.35	5.23	2.22
7) CFPA-w/-SL in [7]	0.51	68.82	45.81	4.82	2.04
8) CFPA-w/-SL- $L_2$	0.55	67.67	42.59	3.27	1.45
CFPA	0.49	69.24	49.20	6.71	2.75

[3] Xin Wei, et al. Learning generalizable part-based feature representation for 3d point clouds. NeurIPS 2022.

[4] Zhang, Jiahao, et al. Manual-PA: Learning 3D Part Assembly from Instruction Diagrams. arXiv:2411.18011

[5] Zhang, Rufeng, et al. 3d part assembly generation with instance encoded transformer. IEEE Rob. Autom. Lett., 2022, 7(4): 9051-9058.

[6] Li, Yichen, et al. Learning 3d part assembly from a single image. ECCV 2020.

[7] Zhang, Ruiyuan, et al. Scalable geometric fracture assembly via co-creation space among assemblers. AAAI 2024.

Q3: The method part needs more explanation about how this module solves the issues.

We adopts a coarse-to-fine strategy for 3D part assembly, which face challenges including the lack of explicit semantic structure, difficulties in integrating semantic information, inconsistencies between different prediction stage and ambiguities caused by self-symmetry or geometric similarity. Below, we clarify how each module specifically resolves these issues:

A. To address the lack of explicit semantic structure, we propose semantic super-parts as high-level structural abstractions. These are dynamically constructed via optimal transport (OT) (Eqs.(1-2)), avoiding reliance on rigid hierarchies. By grouping semantically similar parts, this module provides a semantic abstraction for guiding coarse pose estimation. This mechanism equips the framework with explicit, data-driven semantic structure, thereby improving global structural understanding and the overall quality of assembly.

B. To guide localized feature learning with the learned semantic abstraction, we introduce a dual-range feature propagation mechanism. Locally, it captures fine-grained semantic relations between parts and their nearest super-parts (Eq.(3)); globally, it aggregates holistic context via attention (Eq.(4)). This design ensures that local feature learning is effectively guided by multi-scale high-level semantic structure, enabling precise part placement while maintaining global structural coherence.

C. To promote consistency between different stages and transfer global information from the coarse to the refinement stage, we introduce a cross-stage attention module. By fusing complementary cues across stages, this module enables the refinement process to be guided by global structural context, thereby reducing inconsistencies and improving overall assembly coherence.

D. To resolve ambiguities from self-symmetry and geometric similarity, we propose a symmetry-aware loss that supervises pose prediction with explicit self-symmetry and geometric similarity modeling (Eq.(13), Eq.(14)). This allows the network to tolerate multiple valid configurations for symmetric or interchangeable parts. By doing so, our method accommodates symmetry-induced ambiguities and supports diverse yet valid assembly solutions.

E. Beyond the above designs, to enhance feature interaction and discrimination, we employ multi-head attention in both the coarse and refinement stages (Eq.(7), Eq.(9)), facilitating effective information exchange among parts. Furthermore, message passing among geometrically similar parts (Eq.(5)) encourages feature consistency for interchangeable components, while an instance encoding strategy preserves unique part representations and prevents prediction collapse.

We sincerely thank you for your feedback and suggestions. We will incorporate the additional analysis and experiments into our revised manuscript.

2025-08-07

Dear Reviewer XKs3,

We sincerely appreciate your questions regarding the explanations of our designs and the experiments related to the baseline and additional ablation studies. In our responses, we have provided a comprehensive explanation of our approach and how the designs meet the specified needs (response for Q3). Additionally, we have included more experiments, such as the baseline and ablation studies (results in Table R3-1), which demonstrate the effectiveness of our designs and their superiority over traditional designs.

We would be grateful to hear from you to confirm whether these responses have addressed your concerns. We remain fully open to addressing any remaining questions you may have.

Thank you again for your time and dedication.

Best regards,

The authors of Paper 385

2025-08-09

Reviewer XKs3, please check the rebuttal and share your thoughts. We have less than 24 hours left for the discussion.

审稿意见

评分: 5置信度: 22025-07-02

This paper introduces a method to assemble parts togehter. It uses optimal transport to produce a coarse assembly, before refining in the second stage. They train and test on PartNet.

优缺点分析

Strengths Thorough comparisons with prior works, substantial improvements demonstrated quantitatively. The methods seem non-trivial, and takes into account many nuanced problems of the part assembly task.

Additionally, substantial ablations have been done to show the effectiveness of different design decisions.
Weaknesses

The method was only demonstrated on 3 categories of Partnet, hindering how convincing the results are. Moreover, the qualitative results demonstrated in Figure 4 feels a bit underwhelming, especially given the complexity of the method.

As far as I can tell (please correct if wrong), the models introduced are not conditional on some input image/pointcloud/mesh of the desired final assembly at test time. This strikes me as a little weird, since it's possible to assemble the parts to produce many different, though equality valid, final assemblies. For instance, in the first Column of tables in Figure 4, I would say that "Score", "Ours" and "Groundtruth" all corresponded to equally likely assemblies. `

问题

Why are only 3 categories of shapes shown in Table 1?

How do you account for the various possibilities that the same set of parts can assemble to? Is there a way to communicate the desired final assembly through some form of conditioning?

局限性

Yes

最终评判理由

Hi, thanks for your comprehensive rebuttal, it has addressed the bulk of my concerns. I've updated the rating accordingly.

格式问题

None

作者回复

2025-07-31

Q1: Why are only 3 categories of shapes shown in Table 1?

We followed the experimental settings of DGL [1], RGL [2], Score-PA [3], IET [4], and 3DHPA [5], which evaluate models on the three largest categories of PartNet. As shown in Table 1 of the manuscript, our CFPA consistently achieves higher overall performance compared to the baseline methods.

In addition, we conducted experiments on the Storage category, with comparative results coppyed from SPAFormer [6]. As illustrated in Table R2-1, our method outperforms the baseline methods on the SCD and PA metrics, while achieving the second highest performance on the CA metric. Together with the results in Table 1, these results highlight the effectiveness of our model for accurate part assembly.

Furthermore, we performed few-shot transfer learning experiments by training our model on the Chair category and fine-tuning it using only 10 shapes from the Table category. As shown in Table R2-2, our fine-tuned model (Ours Chair $\rightarrow$ Table) outperforms both other fine-tuned models (* Chair $\rightarrow$ Table, where * represents competing methods) and our model trained from scratch (Ours trained from scratch). These results underscore the effectiveness of our approach under limited training data and demonstrate its superior generalization capabilities.

Table R2-1. Performance comparison of different methods on Storage

Methods	SCD( $10^{-2}$ )↓	PA(%)↑	CA(%)↑
DGL [1]	1.21	12.39	17.23
LSTM [1]	1.19	8.50	27.14
RGL [2]	0.64	48.02	38.45
Score-PA [3]	0.69	27.68	21.82
IET [4]	0.68	29.31	22.51
SPAFormer [6]	0.45	56.11	49.98
3DHPA [5]	0.39	67.09	57.62
Ours	0.33	68.11	56.12

Table R2-2. Performances of different methods for few-shot transfer learning

Methods	SCD( $10^{-2}$ )↓	PA(%)↑	CA(%)↑
Global [1] Chair→Table	2.97	1.87	2.11
DGL [1] Chair→Table	1.12	27.69	33.62
LSTM [1] Chair→Table	3.31	2.56	3.45
RGL [2] Chair→Table	0.84	17.44	23.12
Score-PA [3] Chair→Table	1.32	19.23	29.45
IET [4] Chair→Table	0.83	48.23	20.34
3DHPA [5] Chair→Table	0.86	36.77	34.68
Our Chair→Table	0.78	40.14	36.62
Ours trained from scratch	1.43	15.06	22.90

[1] Zhan, Guanqi, et al. Generative 3d part assembly via dynamic graph learning. NeurIPS 2020.

[2] Narayan, et al. Rgl-net: A recurrent graph learning framework for progressive part assembly. WACV 2022.

[3] Cheng, Junfeng, et al. Score-pa: Score-based 3d part assembly. BMVC 2023

[4] Zhang, Rufeng, et al. 3d part assembly generation with instance encoded transformer. IEEE Rob. Autom. Lett., 2022, 7(4): 9051-9058.

[5] Du, Bi'an, et al. Generative 3d part assembly via part-whole-hierarchy message passing. CVPR 2024.

[6] Xu, Boshen, et al. SPAFormer: Sequential 3D Part Assembly with Transformers. 3DV 2025.

Q2: The qualitative results demonstrated in Figure 4 feels a bit underwhelming.

Unconditional part assembly is inherently challenging due to the complexity of understanding assembly patterns and accurately placing symmetric or similar parts. Unlike conditional methods, this task operates without predefined guidance, requiring the model to learn intricate relationships between parts.

From a qualitative perspective, our model demonstrates better connectivity and more reasonable handling of symmetric or similar parts. For example, in the Chair category, it predicts more accurately aligned legs with stronger connectivity. In the Table category, wheels are placed more reasonably, and for Lamps, lampshades are positioned more effectively. These results highlight the robustness of our approach in capturing meaningful part relationships and producing plausible assemblies.

Quantitatively, our model consistently outperforms baseline methods across multiple metrics as detailed in experiment results of the manuscript. These numerical results further validate the effectiveness of our approach in achieving accurate and robust part assemblies.

Q3: How do you account for multiple possible assemblies for the same set of parts? How to specify the desired final assembly through conditioning?

We explicitly accounts for multiple possible assemblies. As in Eq.(14), we consider diverse configurations of similar parts to be valid and constrain the prediction of each part to be consistent with the ground truth of its similar counterparts. This ensures structural plausibility while accommodating the inherent ambiguity of symmetric or geometrically similar parts.

To further address potential collapse caused by identical predictions for similar parts, we incorporate part instance encoding during the refinement stage. This mechanism distinguishes individual parts and avoids over-simplification. Additionally, we employ a shape loss (Eq.(12)) to enforce global structural consistency, ensuring that the overall assembly remains coherent while preserving diversity.

3D part assembly with conditions, such as images or textual descriptions, offers a promising approach to achieve the desired final configuration. For example, establishing 2D-3D correspondences through projection methods can effectively constrain the uniqueness of 3D assemblies. Additionally, leveraging deterministic relational links between parts in images to guide the modeling of inter-part relationships in 3D space is a worthwhile direction for further investigation.

Currently, we focus on unconditional part assembly, where predictions are made without external guidance. We greatly appreciate your suggestion, as integrating conditional information to enable unique and deterministic predictions represents a highly meaningful and valuable avenue for future exploration.

We sincerely thank you for your thoughtful feedback and valuable suggestions. We will incorporate the additional analysis and experiments into our revised manuscript.

2025-08-09

Thank you for your responses -- they addressed my concerns well. Would love to see some of these new results in the final version of the paper. I will be updating the score accordingly.

2025-08-09

Dear Reviewer 7yWx,

We sincerely appreciate your thoughtful feedback and the time you invested in reviewing our responses. We are delighted to hear that our responses have addressed your concerns.

Your suggestion to include some of the new results in the final version of the paper is highly insightful, and we will incorporate these results and analyses into the revised manuscript to enhance its overall quality.

Thank you once again for your constructive review and for updating the score.

Best regards,

The authors of Paper 385

审稿意见

评分: 5置信度: 42025-07-03

This paper introduces CFPA , a two-stage deep learning framework for 3D shape assembly that predicts poses for individual parts to reconstruct coherent 3D shapes. The key innovation lies in constructing semantic super-parts through optimal transport to capture high-level object structure, followed by dual-range feature propagation that integrates both local and global structural information. The method addresses critical challenges in part assembly including complex geometric dependencies, symmetry handling, and the need for both precise local reasoning and global structural understanding. Experimental results on the PartNet benchmark demonstrate state-of-the-art performance across Chair, Table, and Lamp categories, achieving superior assembly accuracy, structural consistency, and assembly diversity compared to existing methods.

优缺点分析

Strengths And Weaknesses

Strengths

Novel Two-Stage Framework with Semantic Abstraction The paper introduces a principled coarse-to-fine architecture that first constructs semantic super-parts via an entropy-regularized optimal transport formulation and then refines part poses through cross-stage attention and instance-level encoding. This dual-stage design effectively balances global structural understanding with precise local geometric reasoning
Comprehensive Experimental Validation Extensive experiments on the PartNet benchmark across Chair, Table, and Lamp categories demonstrate SOTA performance in assembly accuracy (SCD, PA, CA), structural consistency, and assembly diversity (QDS, WQDS). The ablation studies systematically quantify the contribution of each component, highlighting the author’s reasons for choosing each module.

Weaknesses

Insufficient Theoretical Foundations for Design Choices While the authors provide comprehensive empirical validation demonstrating the effectiveness of each proposed module, the paper would benefit significantly from theoretical analysis or intuitive explanations underlying the design principles and mechanisms. For instance, the optimal transport formulation for semantic super-part construction, the dual-range feature propagation mechanism, and the cross-stage attention design would all benefit from theoretical justification or principled intuition beyond empirical effectiveness. This theoretical gap potentially constrains the generalizability and interpretability of the proposed framework to new scenarios or domains.
Limited Novelty Claims Regarding Symmetry-Aware Loss Terms The symmetry-aware loss formulation presented in this work appears to have been explored in prior literature. For example, works such as Manual-PA[1] and other contemporary studies have previously addressed the handling of part self-symmetry and inter-part geometric similarity through specialized loss terms. The reflective symmetry loss and geometric similarity constraints have been incorporated into part assembly frameworks to address pose ambiguities arising from symmetric components. To substantiate the claimed contributions in symmetry-aware modeling, the authors should provide more rigorous comparative analysis or experimental validation demonstrating the specific advantages and innovations of their proposed symmetry-aware loss formulation compared to these existing symmetry-handling approaches. This could include detailed ablation studies comparing different symmetry loss formulations, theoretical analysis of the proposed approach's superiority, or empirical evidence showing improved performance over prior symmetry-aware methods.
Computational and Memory Overhead The construction of semantic super-parts via Sinkhorn’s algorithm, dual-range propagation, and cross-stage attention introduces multiple heavyweight modules and hyperparameters (e.g., number of super-parts, entropy regularization), potentially limiting scalability to larger part sets or high-resolution meshes. Additionally, in the self symmetry part, the proposed method need to compute 8 times of forward propagation, which seems quite heavy.

[1] Zhang, Jiahao, et al. "Manual-PA: Learning 3D Part Assembly from Instruction Diagrams." arXiv preprint arXiv:2411.18011 (2024).

问题

What is the physical meaning and intuitive interpretation of the super-part features constructed via optimal transport formulation? How do these semantic super-parts conceptually differ from simpler global feature aggregation approaches[1] or utilize graph neural network-based feature aggregation and message passing between different parts[2], what are the fundamental theoretical advantages and principled design motivations behind constructing these intermediate semantic super-parts?
While the empirical results demonstrate the effectiveness of optimal transport-based super-part construction, could the authors provide theoretical analysis or intuitive explanations for why the entropy-regularized optimal transport formulation is particularly well-suited for this hierarchical part grouping task? And why inner product can serve as a great cost function ? A deeper theoretical understanding of the underlying mechanisms and design principles would significantly strengthen the paper's contribution beyond purely empirical validation.
The proposed symmetry-aware loss enumerates all eight axis-aligned reflection transformations and selects the one with minimal pose error. However, since reflections are not elements of the SE(3) group, this exhaustive search may introduce unintended chirality issues, potentially producing assemblies with inverted handedness. Moreover, real-world part point clouds can assume arbitrary orientations and need not align with the canonical axes; it is unclear whether the fixed set of eight reflections suffices to capture all symmetry variants under arbitrary poses. Finally, this approach incurs additional computational overhead due to repeated network forward passes for each candidate reflection. Could the authors provide:
1. A theoretical discussion on how reflecting parts outside SE(3) affects the preservation of chirality and whether any mechanisms ensure consistency of handedness across the assembled shape?
2. Empirical or analytical justification that the eight axis-aligned reflections are sufficient to cover the full space of self-symmetries for arbitrarily oriented part point clouds?
3. An evaluation of the added computational cost from multiple forward passes, and potential strategies (e.g., grouping symmetric variants or embedding symmetry into the architecture) to mitigate this overhead while preserving symmetry-aware supervision?
In Eq.14, there may exists a typo, the subscript $j$ should be $g$
In Eq.14, could the author clarify why your training objective minimizes the pose differences between geometrically similar parts, rather than encouraging them to occupy distinct positions? Intuitively, one would expect interchangeable components—such as the four legs of a chair—to be driven toward different spatial locations instead of collapsing into the same pose. What is the motivation for favoring minimal pose discrepancy among symmetric parts, and have you considered an alternative formulation that explicitly promotes diversity in their placements?

[1] Wu, Ruihai, et al. "Leveraging se (3) equivariance for learning 3d geometric shape assembly." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[2] Scarpellini, Gianluca, et al. "Diffassemble: A unified graph-diffusion model for 2d and 3d reassembly." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

局限性

Yes

最终评判理由

This paper addresses an important problem in 3D vision and robotics, namely multipart assembly. It introduces CFPA, a two-stage deep learning framework for 3D shape assembly that predicts poses for individual parts to reconstruct coherent 3D shapes. The authors design several modules tailored to the characteristics of this problem and provide extensive experimental evidence demonstrating their effectiveness. The rebuttal is thorough and addresses most of my concerns. Therefore, I recommend that this paper be accepted.

格式问题

作者回复

2025-07-31

Q1: Physical meaning & intuitive interpretation of super-part from OT.

Physically, as Eq.(1), each super-part is a weighted aggregation of basic part features learned via optimal transport, forming compact prototypes that encode multiple related parts. Intuitively, these super-parts group functionally or geometrically similar parts into semantic clusters, providing mid-level abstractions without predefined hierarchies.

Visualization: Figures are not allowed in rebuttal. We visualize the correspondence between parts and their nearest super-parts identified by transport map $T$ . Figures show: 1) parts with similar functions (e.g., chair legs, backs & seats) tend to cluster within the same super-part; 2) unique or uncommon parts (e.g., decorative elements) are tend to be assigned to one specific super-part; 3) the super-parts exhibit semantic consistency across shapes, consistently grouping functionally similar parts.

Q2: Difference of super-parts from global [1] or GNN-based aggregation [2]. Motivation & advantage of intermediate super-parts.

Difference: Unlike global feature aggregation [1] that compresses part information into single vector and loses intermediate structures, our super-parts preserve mid-level semantics by clustering parts into coherent groups. Compared to GNN-based method [2] that model inter-part relations via message passing but may suffer from over-smoothing, we build hierarchical representations to encode structural relationships, bridging local details and global organization.

Motivation & advantage: Object parts often exhibit similarities in appearance or functionality, naturally forming semantic groups. These relationships provide a foundation for coarse-to-fine approach in part assembly, where intermediate representations help progressively refine local details into coherent structure. This hierarchical design enhances semantic consistency across parts and guides precise part assembly, as validated by ablation study 1)-3) in manuscript.

[1] Wu, R., et al. Leveraging se(3) equivariance for learning 3d geometric shape assembly.

[2] Scarpellini, G., et al. Diffassemble: a unified graph-diffusion model for 2d and 3d reassembly.

Q3: Why use entropy-regularized OT & inner product.

OT offers a flexible, probabilistic alignment of parts to super-parts, enabling soft associations compared to rigid clustering or geometry-based grouping. Entropy regularization ensures smooth assignments, avoids sparsity, and enables efficient Sinkhorn-based computation.

Replacing OT with k-means or geometric grouping (ablation 2)-3) in manuscript), or replacing entropy regularization with attention-based map as in [3] (CFPA-w/-Att.-map in Table R1-1), results in worse performance, highlighting effectiveness of entropy-regularized OT.

Inner product measures compatibility between part and super-part by capturing directional similarity in feature space. Its linearity ensures computational efficiency and stability during training.

Replacing inner product with $L_2$ distance-based cost function (CFPA-w/- $L_2$ -dis. in Table R1-1) resulted in worse performance, validating its suitability for this task.

Table R1-1. Results on Chair

	SCD( $10^{-2}$ )↓	PA(%)↑	CA(%)↑	QDS( $10^{-5}$ )↑	WQDS( $10^{-5}$ )↑
CFPA-w/-Att.-map	0.50	69.87	49.07	5.87	2.60
CFPA-w/- $L_2$ -dis.	0.51	69.23	48.85	5.08	2.14
CFPA	0.49	69.24	49.20	6.71	2.75

[3] Xin W., et al. Learning generalizable part-based feature representation for 3d point clouds.

Q4: Physical meaning & intuitive interpretation of dual-range feature propagation & cross-stage attention.

Physically, Dual-range feature propagation allows each part to capture both global semantic structure (Eqs.(3-4)) and local geometric details (Eq.(5)), bridging global and local contexts. Cross-stage attention leverages coarse-stage semantic cues to refine part features, enabling more accurate pose estimation and consistent assembly.

Intuitively, Dual-range feature propagation enables each part to integrate local details with global structure, while cross-stage attention ensures feature refinement maintains semantic consistency across stages.

Q5: Analysis/validation on symmetry-aware loss vs other symmetry-handling methods.

Analysis: IET [5] use Chamfer Distance (CD) and instance encoding to address self-symmetry and distinguish similar parts, Li et al. [6] and Manual-PA [4] use Hungarian algorithm for one-to-one matching, with Manual-PA further introducing InfoNCE loss to reduce matching uncertainty. In contrast, we minimizes CD over all flipped variants (Eq.(13)) to explicitly model multiple symmetric solutions and avoid bias from uneven point sampling. For geometrically similar parts, we relax strict matching (Eq.(14)) to allow interchangeable roles, enabling structurally equivalent assemblies.

Validation: Replacing symmetry-handling loss in Eq.(15) with alternatives from [4–6] (CFPA-w/-SL in [4]/[5]/[6] in Table R1-2) improves performance over baseline without symmetry handling (CFPA-w/o-SL). CFPA with our symmetry-aware loss achieves best results, demonstrating the benefit of explicitly modeling symmetric configurations.

Table R1-2. Results with symmetry-handling losses

	SCD( $10^{-2}$ )↓	PA(%)↑	CA(%)↑	QDS( $10^{-5}$ )↑	WQDS( $10^{-5}$ )↑
CFPA-w/o-SL	0.52	67.51	47.17	4.60	1.99
CFPA-w/-SL in [4]	0.51	68.72	44.87	4.80	2.02
CFPA-w/-SL in [5]	0.52	68.74	48.42	4.86	2.06
CFPA-w/-SL in [6]	0.50	69.17	48.35	5.23	2.22
CFPA	0.49	69.24	49.20	6.71	2.75

[4] Zhang, J., et al. Manual-PA: Learning 3D Part Assembly from Instruction Diagrams.

[5] Zhang, R., et al. 3d part assembly generation with instance encoded transformer.

[6] Li, Y., et al. Learning 3d part assembly from a single image.

Q6: Are 8 reflections sufficient? How reflection outside SE(3) affects chirality preservation? Computational cost for multiple forward passes and potential strategies.

Sufficiency of the eight axis-aligned reflections is guaranteed because all input parts are pre-aligned using a chirality-preserving PCA method as in DGL [7], before being fed into the model.

Chirality is preserved during both training and inference: 1) For inference, input parts are pre-aligned using a chirality-preserving PCA as in DGL [7], and the model predicts rotations with quaternion, ensuring no chirality inversion occurs. (2) For training, although Eq.(13) formulates the reflection-based symmetry constraint as applying reflections to input, in practice, for computational efficiency, we instead apply these reflections only to ground-truth poses of self-symmetric parts.

Computational cost: In implementation, the potential computational overhead for symmetry-aware supervision is significantly reduced by applying axis-aligned reflections only to ground-truth poses of self-symmetric parts, rather than reflecting input parts. As in Table R1-4 (last two rows), using symmetry-aware loss does not increase parameter count but leads to increased GPU memory (26.60GB→27.04GB) and forward time (802.72ms→806.80ms).

Other strategies: 1）designing symmetry-equivariant architectures to ensure consistent outputs under input reflections; 2）grouping symmetric variants by aggregating losses or sampling representatives; 3）employing hierarchical processing to first capture global symmetries before refining local variations.

[7] Zhan, G., et al. Generative 3d part assembly via dynamic graph learning.

Q7: Why not encourage distinct placement for collapse?

By minimizing prediction differences to the most similar ground-truth counterpart (Eq.(14)), we accommodate multiple plausible part configurations induced by geometric similarity. The part-wise instance encoding and shape loss (Eq.(12)) further maintain coherent global arrangements without collapsing similar parts into identical poses.

As in Table R1-3., replacing symmetry-aware loss $\mathcal{L}_{sym}$ in Eq.(15) with distinct-placement constraint as [8] (CFPA-w/-SL in [8]) or using $L_2$ loss (CFPA-w/-SL- $L_2$ ) on predictions of geometrically similar parts results in worse performance.

Table R1-3. Results with constraints for geometric similar part

	SCD( $10^{-2}$ )↓	PA(%)↑	CA(%)↑	QDS( $10^{-5}$ )↑	WQDS( $10^{-5}$ )↑
CFPA-w/-SL in [8]	0.51	68.82	45.81	4.82	2.04
CFPA-w/-SL- $L_2$	0.55	67.67	42.59	3.27	1.45
CFPA	0.49	69.24	49.20	6.71	2.75

[8] Zhang, R., et al. Scalable geometric fracture assembly via co-creation space among assemblers.

Q8: Computational cost of modules.

Table R1-4. presents computational cost of CFPA and its variants without key modules of semantic super-parts via OT (CFPA-w/o-OT), short/long-range feature propagation (CFPA-w/o-SRFP/LRFP), message passing (CFPA-w/o-MP), cross-stage attention (CFPA-w/o-CA), and symmetry-aware loss (CFPA-w/o-SL). The baseline model refers to CFPA without any of these modules. '#Para.(M)' is number of parameters(in millions). All variants are tested on Chair with batchsize=64, forward time is measured for one iteration.

These modules operate on part or super-part levels rather than the point cloud, and the overhead introduced by these components in the full CFPA remains manageable and does not significantly impact scalability.

Table R1-4. Computational cost

	#Para.(M)	GPU Memory(GB)	Forward time(ms)	GFLOPS
CFPA-w/o-OT	25.89	26.80	657.87	86.34
CFPA-w/o-SRFP	25.89	26.36	804.91	86.34
CFPA-w/o-LRFP	25.76	24.68	778.68	86.33
CFPA-w/o-MP	25.89	24.60	791.31	86.34
CFPA-w/o-CA	25.63	25.71	698.05	86.33
CFPA-w/o-SL	25.89	26.60	802.72	86.34
CFPA	25.89	27.04	806.80	86.34

We will incorporate the analysis on model and experiments, and correct the typo.

2025-08-09

Reviewer jLDf, please check the rebuttal and share your thoughts. We have less than 24 hours left for the discussion.

最终决定Accept (poster)

2025-09-17

This paper introduces a coarse-to-fine framework for 3D part assembly. The key innovation is the two-stage framework incorporating super-part construction followed by part pose refinement. Most reviewers appreciated the novelty and the comprehensive evaluations, especially the comparisons with baselines, although there was a concern about the lack of theoretical foundations for certain design choices. All reviewers gave positive scores, so this submission is clearly accept. However, the technical innovation is not at the level to be recommended for a spotlight or oral presentation. During the author–reviewer discussion, the authors provided additional experiments such as computation cost analysis, comparisons between the proposed symmetry-aware loss and previous losses, cross-category few-shot transfer results, and ablation studies. These additional results led the reviewers to raise their scores.