BiAssemble: Learning Collaborative Affordance for Bimanual Geometric Assembly

审稿意见

评分: 6置信度: 42024-10-21

This paper presents a method for dual-arm manipulation to attach two separated objects from one object on the separated surface. This method divides the manipulation procedure into pick-up, alignment, and assembly as subtasks. The affordance network gives the grasp poses, considering the alignment and assembly. The VN-DGCNN, cVAE, and PointNet++ generate the input of the affordance network from the observed point cloud. The proposed method outperformed other baselines in the experiments.

优点

The paper presents a novel manipulation learning problem using shape assembly tasks in computer vision.
The paper proposed a manipulation learning framework to solve robotic shape assembly tasks.
The paper tackles a challenging task: the dual-arm robot should accurately control the object's pose from the observed point cloud.

缺点

The paper claims that geometric assemblies can be used in practical applications. However, it needs to explain which application robotic geometric assemblies are used in. It is better to highlight the academic significance of the robotic geometric assemblies. This reviewer recommends the authors provide specific examples of potential real-world applications for robotic geometric assemblies and elaborate on how this work advances the field theoretically or methodologically, highlighting its academic contributions.
The success ratio of the proposed method is 24.10 %. It looks low. Humans may be able to perform the task at 100 %. The task may be extremely challenging unlike 2D pushing tasks and pick and place.
Robotic parts assembly includes peg insertion, furniture assembly, and geometry assembly. The paper lacks an explanation of the robotic geometry assembly in the robotic manipulation tasks. The reviewer recommends the paper include a brief comparison of different types of robotic assembly tasks, highlighting how geometric assembly differs from or relates to other assembly tasks like peg insertion or furniture assembly. This would help readers better understand the unique challenges and contributions of this work.

问题

Why is the success ratio low? Is the task too challenging? Which component did fail, such as grasp planning and object recognition? This reviewer recommends the authors provide a breakdown of failure modes or an ablation study showing the performance of individual components. This would help pinpoint where the main challenges lie and guide future improvements.
Which application can the robotic geometric assembly be used in?
Did the affordance network output the grasp action stably? Or did it sometimes fail? Are there any metrics related to the stability of the affordance network's outputs, such as the variance in grasp predictions across multiple runs? The reviewer recommends the authors provide a more detailed error analysis for the affordance network specifically, which would give readers a better understanding of its performance and limitations.

评论- Response to Reviewer qozA [Part2/2]

2024-11-22

--- More Ablation Studies

We have conducted additional ablation studies, with detailed quantitative results provided in Table 4 and Table 5 in Appendix G. The ablations are as follows:

(1) w/o Affordance Network: During inference, we do not use the trained Affordance Network to highlight actionable regions. Instead, we randomly sample a contact point on the part. The results show a significant drop in the success rates, which decrease to 4.60% for training categories and 2.80% in unseen categories. This demonstrates that the Affordance Network plays a crucial role in filtering out non-graspable points and points that are unsuitable for the subsequent assembly process.

(2) w/o Transformation Predictor: In this ablation, we remove the Transformation Predictor during inference. This results in success rates of 7.40% on training categories and 4.80% on unseen categories, both substantially lower than our original method. These results show that the Transformation Predictor plays an essential role in predicting alignment poses, enabling the robot to manipulate parts from their initial to alignment poses without collisions.

(3) w/ heuristic $v$ : In this case, we remove the Disassembly Predictor during inference. Instead, we compute the center of each part from the imaginary assembled shape $S$ by averaging the part points, and then use the relative direction of the two parts' centers as the disassembly direction $v$ . This ablation achieves success rates of 19.70% on training categories and 15.20% on unseen categories, which are lower than those of our method. The results indicate that while the calculated relative direction can approximate the relative position of the two parts, it is not sufficiently accurate to replace the assembly direction required in our task, highlighting the importance of the Disassembly Predictor for better performance.

More detailed scores including per-category accuracy can be found in Table 4 and Table 5 in Appendix G.

Q3 The reviewer recommends the paper include a brief comparison of different types of robotic assembly tasks, highlighting how geometric assembly differs from or relates to other assembly tasks like peg insertion or furniture assembly.

The discussions, marked in red in the modified version of our paper, in the Introduction and Relation Work section, have revealed the descriptions and comparisons of different assembly tasks.

Q4 Did the affordance network output the grasp action stably?... The reviewer recommends the authors provide a more detailed error analysis for the affordance network specifically.

Thank you for this insightful suggestion. The predicted actions can vary across multiple runs due to the inherent randomness in the inference process. Specifically, in our implementation, after the Affordance Network generates the affordance map, we randomly select a point from the top 5% of points with the highest affordance scores as the contact point. Additionally, the Actor Network, implemented as a conditional variational autoencoder (cVAE), produces different actions depending on the sampled Gaussian noise $z$ . As a result, even with the same initial setup, the outcomes may differ across multiple runs.

To analyze this variability, we conducted an experiment using 500 different scenario initializations. For each scenario (where the fractured parts and their poses remain identical), we ran the model three times and calculated the success rate distribution. After excluding scenarios that were nearly impossible to complete, we found the following: 8.6% of scenarios were successful in only one out of three trials, 12.6% were successful in two out of three trials, and 78.8% were successful in all three trials. These results indicate that while our method exhibits variability due to the random sampling of points from the top 5% of the affordance map and the stochastic nature of the generative model (cVAE), its overall performance is stable across multiple runs.

评论- Response to Reviewer qozA [Part1/2]

2024-11-22

Thank you for your detailed review and constructive suggestions. Your valuable feedback has helped us improve our work, and we have addressed all your questions and comments in the following responses. The changes have been highlighted in Red in the revised manuscript for your convenience.

W1 & Q2 Explain which application robotic geometric assemblies are used in. It is better to highlight the academic significance of the robotic geometric assemblies.

Thank you for this valuable suggestion. Referring to our BreakingBad dataset paper [Paper-1] and a recently related work [Paper-2], robotic geometric assemblies have potential applications in several practical domains: (1) reassembling archaeological artifacts such as pottery, (2) performing industrial tasks that involve assembling irregularly shaped objects, (3) aligning bone fragments to assist in bone reduction surgery, (4) restoring fragments of walls and buildings, and (5) reconstructing fossils from fragments in paleontology. These examples highlight the practical significance of robotics geometry assembly and its potential impact across multiple fields. We have revised the Introduction Section of our paper to include more detailed potential applications. Thanks again for your valuable feedback.

W2 & Q1 The success ratio of the proposed method is 24.10%... Why is the success ratio low?

Thanks for this valuable questions. Below we will first provide a detailed analysis of failure cases, and then provided more results of ablation studies.

--- Analysis of Failure Cases

It is true as you said, this task is extremely challenging unlike 2D pushing tasks and pick and place. The relatively low scores across all models and baselines stem primarily from the diverse and complex nature of our geometric shape assembly task. This task involves parts with highly varied fracture patterns across multiple categories, including some fractured parts that are nearly impossible to grasp or assemble. For instance, in certain cases, the graspable regions of a part completely overlap with its seam areas, making it extremely challenging to avoid collisions during assembly.

To provide a more detailed analysis of failure cases and illustrate the inherent difficulty of the task with scenarios that are particularly challenging for robots to figure out, we have revised Appendix E (Failure Cases). Additionally, we provide insights into potential future improvements to address these complexities more effectively:

Hard to Grasp:

(1). Heavy or Smooth-Surfaced Parts. Fractured parts that are heavy or have smooth surfaces often result in grasping failures. For instance, as shown in Figure 7(a) in Appendix E, categories such as teapots and vases, which are relatively large and feature smooth curved surfaces, exhibit notably high failure rates during grasping.

(2). Flat Parts. Flat fractured parts, particularly some shapes in categories like statues and mugs, are challenging to pick up due to the limited gripping area. For example, as shown in Figure 7(b) in Appendix E, the statue part on the left is too close to the desktop and has a very small thickness, which prevent the gripper from grasping it. Similarly, in (c), the handle fragment on the right is too flat, making it impossible for the gripper to grasp it. A potential solution is incorporating pre-grasp operations, such as moving the fractured part to the table edge, allowing the shape to hang off slightly and thus become graspable.

Hard to Assemble:

(3). Graspable Regions Overlapping Seam Areas. When the graspable regions of a fractured part align with its seam areas, collisions during assembly become frequent. This issue is common in categories such as wineglasses, mugs, and bowls. For example, as shown in Figure 7(d), the left gripper avoids collision-prone regions, but the right gripper must grasp the neck of the wine bottle. Similarly, in (e), while the left gripper avoids collisions, the right gripper ends up grasping the handle of a mug. A potential solution is to perform a series of pick-and-place operations to adjust the object's initial pose. This adjustment can reduce the overlap between the object's graspable regions and seam areas, thereby minimizing collisions during the assembly process.

(4). Complex Object Shapes. Objects with intricate shapes, like those in the statues category, pose challenges due to irregular edges and complex curves. Such designs increase the difficulty of alignment and manipulation, leading to higher failure rates during assembly.

(5). Relative Displacement During Operations. Relative displacement between the gripper and fractured parts often occurs due to small contact areas and insufficient support, which can cause sliding or tipping during manipulation. For example, wine bottles with narrow necks, which have unstable center of gravity, making the gripper prone to sliding during movement and leading to operational failures.

2024-11-30

Given that the discussion phase is quickly passing, we would like to know if our response has addressed your concerns. If you have any further questions or suggestions, we would be more than happy to continue the discussion. Thank you again for your constructive feedback, and we look forward to hearing from you.

审稿意见

评分: 6置信度: 42024-10-30

This paper focuses on the shape assembly task for reconstructing broken objects. This paper proposes a multi-stage BiAssembly framework to complete this task. The BiAssembly framework first gets an imaginary assembled shape using SOTA methods, then predicts the disassembly direction, alignment pose transformation, pick-up affordance, and finally the gripper alignment and assembly poses. Additionally, this paper introduces a real-world framework. The experimental results show that the BiAssembly framework surpasses previous methods.

优点

The paper is well-written overall, with technical points and experiments clearly articulated.
The framework is feasible for shape assembly, and its performance surpasses previous heuristic or policy-based methods, according to the outcomes in the paper.

缺点

The multi-stage framework involves some assumptions, such as the object having two broken parts, the imaginary assembled shape being obtainable in advance, and the robot needing to follow the alignment and assembly process. This means that the framework may work well in this specific task, perhaps benefiting from pre-set assumptions, but it may not generalize to other scenarios, such as a cup breaking into several pieces.
I believe that the performance of this framework is affected by the quality of the imaginary assembled shape, which may be more difficult to achieve than the subsequent processes. Discussing this aspect would be helpful for this paper.
Although the results show that the performance of this framework surpasses previous methods, they are not good enough (only an average of 24). Moreover, there are no quantitative experimental results available for real-world experiments.

问题

See Weaknesses.

评论- Response to Reviewer qij6 [Part3/3]

2024-11-22

[Paper-1] Silvia Sellán1, Yun-Chun Chen, Ziyi Wu, Animesh Garg, and Alec Jacobson. Breaking bad: A dataset for geometric fracture and reassembly. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.

[Paper-2] Ruihai Wu, Chenrui Tie, Yushi Du, Yan Zhao, and Hao Dong. Leveraging SE-(3) equivariance for learning 3d geometric shape assembly. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14311–14320, 2023c.

[Paper-3] Jiaxin Lu, Yifan Sun, Qixing Huang. Jigsaw: Learning to Assemble Multiple Fractured Objects. Advances in Neural Information Processing Systems, 36, 2024b.

[Paper-4] Theodore Tsesmelis, Luca Palmieri, Marina Khoroshiltseva, Adeela Islam, Gur Elkin, Ofir Itzhak Shahar, et al. Re-assembling the past: The RePAIR dataset and benchmark for real world 2D and 3D puzzle solving. In Neural Information Processing Systems Datasets and Benchmarks Track, 2024.

[Paper-5] Gianluca Scarpellini, Stefano Fiorini, Francesco Giuliari, Pietro Morerio, and Alessio Del Bue. DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

[Paper-6] Ruocheng Wang, Yunzhi Zhang, Jiayuan Mao, Ran Zhang, Chin-Yi Cheng, and Jiajun Wu. IKEA-Manual: Seeing Shape Assembly Step by Step. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.

[Paper-7] Issei Sera, Natsuki Yamanobe, Ixchel G. Ramirez-Alpizar, Zhenting Wang, Weiwei Wan, and Kensuke Harada. Assembly Planning by Recognizing a Graphical Instruction Manual. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.

[Paper-8] Yuxuan Wan, Kaichen Zhou, Jinhong Chen, and Hao Dong. SCANet: Correcting LEGO Assembly Errors with Self-Correct Assembly Network. In International Conference on Intelligent Robots and Systems (IROS), 2024.

评论- Response to Reviewer qij6 [Part2/3]

2024-11-22

Q2 The performance of this framework is affected by the quality of the imaginary assembled shape.

As explained in Q1 (B), the imaginary assembled shape prediction is relatively well-studied, and thus we follow a reasonable assumption that we can acquire a good imaginary assembled shape, which is in align with the settings of other part assembly studies [Paper 6–8]. We have explicitly added this clarification in Appendix F of our revised paper.

Q3.1 Although the results show that the performance of this framework surpasses previous methods, they are not good enough.

Thank you for this question. The relatively low scores across all models and baselines stem primarily from the diverse and complex nature of our geometric shape assembly task. This task involves parts with highly varied fracture patterns across multiple categories, including some fractured parts that are nearly impossible to grasp or assemble. For instance, in certain cases, the graspable regions of a part completely overlap with its seam areas, making it extremely challenging to avoid collisions during assembly.

To provide a more detailed analysis of failure cases and illustrate the inherent difficulty of the task with scenarios that are particularly challenging for robots to figure out, we have revised Appendix E (Failure Cases). Additionally, we provide insights into potential future improvements to address these complexities more effectively:

Hard to Grasp:

(1). Heavy or Smooth-Surfaced Parts. Fractured parts that are heavy or have smooth surfaces often result in grasping failures. For instance, as shown in Figure 7(a) in Appendix E, categories such as teapots and vases, which are relatively large and feature smooth curved surfaces, exhibit notably high failure rates during grasping.

(2). Flat Parts. Flat fractured parts, particularly some shapes in categories like statues and mugs, are challenging to pick up due to the limited gripping area. For example, as shown in Figure 7(b) in Appendix E, the statue part on the left is too close to the desktop and has a very small thickness, which prevent the gripper from grasping it. Similarly, in (c), the handle fragment on the right is too flat, making it impossible for the gripper to grasp it. A potential solution is incorporating pre-grasp operations, such as moving the fractured part to the table edge, allowing the shape to hang off slightly and thus become graspable.

Hard to Assemble:

(3). Graspable Regions Overlapping Seam Areas. When the graspable regions of a fractured part align with its seam areas, collisions during assembly become frequent. This issue is common in categories such as wineglasses, mugs, and bowls. For example, as shown in Figure 7(d), the left gripper avoids collision-prone regions, but the right gripper must grasp the neck of the wine bottle. Similarly, in (e), while the left gripper avoids collisions, the right gripper ends up grasping the handle of a mug. A potential solution is to perform a series of pick-and-place operations to adjust the object's initial pose. This adjustment can reduce the overlap between the object's graspable regions and seam areas, thereby minimizing collisions during the assembly process.

(4). Complex Object Shapes. Objects with intricate shapes, like those in the statues category, pose challenges due to irregular edges and complex curves. Such designs increase the difficulty of alignment and manipulation, leading to higher failure rates during assembly.

(5). Relative Displacement During Operations. Relative displacement between the gripper and fractured parts often occurs due to small contact areas and insufficient support, which can cause sliding or tipping during manipulation. For example, wine bottles with narrow necks, which have unstable center of gravity, making the gripper prone to sliding during movement and leading to operational failures.

Q3.2 There are no quantitative experimental results available for real-world experiments.

Thank you for this valuable suggestion. In our real-world experiments, we tested each object category with 10 trials, varying the initial poses of the two fractured parts for each trial. Below, we report the success rates for different object categories:

Object Category	Bowl	Mug	BeerBottle	WineGlass
Success/Total	3/10	2/10	3/10	2/10

The mug has a relatively low success rate due to its small diameter. If the mug handle faces downward and becomes ungraspable, the gripper must grasp the top edge of the mug. This leads to collisions during the assembly process when both grippers grasp the top edges of the fractured parts. The wineglass has a low success rate because its glasswork is prone to slipping. Even when the gripper successfully grasps the wineglass, it may slide or tip during manipulation, resulting in assembly failures.

评论- Response to Reviewer qij6 [Part1/3]

2024-11-22

We sincerely appreciate the time and effort you have dedicated to reviewing our paper. Your constructive feedback and thoughtful suggestions have been invaluable, and we have addressed all your questions below.

Q1 The multi-stage framework involves some assumptions, such as the object having two broken parts, the imaginary assembled shape being obtainable in advance, and the robot needing to follow the alignment and assembly process.

Thank you for this valuable comment, below we will explain each concern:

(A) Handling multiple broken parts.

Our method is indeed able to handle multiple fragments, and we have conducted experiments to validate this extension. Below, we provide a detailed explanation of how our method can be adapted for multi-fragment assembly, followed by the experimental results.

The multi-fragment assembly task can be achieved by iteratively applying the two-fragment assembly process. First, at each iteration, we can identify which two fragments, $p_i$ and $p_j$ , should be assembled next. (If some parts have already been assembled in previous iterations, their combination is treated as a new fragment.) Specifically, based on the imaginary assembled shape $S$ , we can calculate the minimum distance, $\min \| p_i - p_j \|$ , between sampled points from every pair of fragments, and the pair $(p_i, p_j)$ with the minimum distance is chosen for assembly: $(p_i, p_j) = \underset{(p_i, p_j) \in \mathcal{S}_1 \times \mathcal{S}_2}{\arg\min} \ \| p_i - p_j \|$ . Once $p_i$ and $p_j$ are identified on $S$ , we then map these fragments to their corresponding parts in the observed point cloud $O$ . This mapping is formulated as a classification task, where the similarity between parts in $S$ and $O$ is estimated.

Finally, using the imaginary assembled shape of the selected fragments $S_{p_i} ∪ S_{p_j}$ , and the corresponding observed point cloud $O_{p_i} ∪ O_{p_j}$ , our method predicts the actions to pick up and assemble the fragments. This process mirrors the steps of the standard two-fragment assembly method. By iteratively applying this two-fragment assembly process, the complete assembly of all fragments can be achieved.

To validate the feasibility of this multi-fragment assembly process, we evaluated our pretrained BiAssembly model on broken beerbottles with three pieces without any finetune process. We provide the visualization of the predicted affordance maps and actions in Figure 8 in Appendix F.1. We can see that for multi-fragment assembly task, our method can still predict reasonable results in each iteration.

While the above proposed method is a practical approach for assembling multi-part fractures, another potential strategy is training the Affordance Network to identify which two fragments are easiest to assemble in each iteration. In this new method, the Affordance Network would involve assigning high affordance scores to the reasonable regions of these fragments, while predicting low affordance scores for the fragments that are not being assembled in the current iteration. Implementing this strategy would require additional data collection for training and modifications to the framework. We leave this exploration for future work.

(B) The imaginary assembled shape.

Predicting the imaginary assembled shape from multiple fractured parts is a well-studied vision problem [Paper 1–5]. Previous works have demonstrated the ability to predict precise fragment poses that allow for an imaginary assembled shape, making it reasonable to assume the existence of such shapes in our framework. Additionally, in traditional furniture assembly tasks, several studies [Paper 6–8] also assume the existence of an imaginary assembled shape as part of their formulation. Therefore, given the advancements in prior works, this assumption is reasonable.

(C) Alignment and assembly process.

The alignment and assembly process mirrors the natural approach humans take when assembling fragments. Humans typically align the fragments along the seams first and then gradually move them together for precise fitting. Furthermore, when decomposing the assembly process into multiple frames, there is usually a stage where the two fragments are aligned but separated by a small distance. This intermediate step is captured in our formulation as the alignment step, which generalizes well to most shape assembly scenarios.

As the first work tackling the challenging task of robotic shape assembly, though our assumptions are reasonable for most shape assembly tasks, we acknowledge that our method may face limitations in certain scenarios. We leave these challenges for future exploration and improvement. Furthermore, the above discussions are also elaborated in Conclusion Section and Appendix F of our revised paper.

评论- Official Comment by Reviewer qij6

2024-11-25

Thank you for the response; some of my concerns have been addressed. Although I believe that this work has not fully tackled the shape assembly task, I acknowledge its contributions. Therefore, I raise my rating to "marginally above the acceptance threshold".

I still maintain my viewpoint that "the performance of this framework is affected by the quality of the imaginary assembled shape, which may be more difficult to achieve than the subsequent processes." I disagree with the statement that "the assembled shape prediction is relatively well-studied." I believe that while the framework may perform well on the testing datasets, it struggles to generalize to the real world, which consists of unseen objects or categories. In the meantime, I argue that it is more important to determine the final assembly pose than on how to plan with the goal.

Therefore, I think it is very necessary to discuss the performance of the proposed cascaded system under different qualities of the imaginary assembled shape, to answer the question: Is your system robust to this cumulative error? If not, it is a promising direction to consider incorporating the imaginary assembled shape error into your system.

评论- Response to Reviewer qij6

2024-11-26

We sincerely thank you for your valuable suggestions and positive feedback. We greatly appreciate your acknowledgment of our contributions and fully agree that "it is a promising direction to consider incorporating the imaginary assembled shape error into the system."

While prior works have extensively studied how to predict the imaginary assembled shape, and our simulation results demonstrate the effectiveness and potential of our system, we acknowledge that the quantitative results from real-world experiments reveal areas for improvement. This observation suggests that incorporating the imaginary assembled shape error into our system could enhance its robustness. By addressing this, our system could leverage advancements in the upstream vision task (i.e., the imaginary assembled shape prediction) while also enhancing its ability to handle accumulated errors.

Once again, thank you again for your insightful suggestions. We believe the robotics shape assembly task holds significant potential, and still has considerable room for further development. We will continue to explore this direction in our future work.

2024-11-25

Dear Reviewer,

Please provide feedback to the authors before the end of the discussion period, and in case of additional concerns, give them a chance to respond.

Timeline: As a reminder, the review timeline is as follows:

November 26: Last day for reviewers to ask questions to authors.

November 27: Last day for authors to respond to reviewers.

审稿意见

评分: 6置信度: 32024-11-03

This paper addresses the task of geometric assembly, which is a long-horizon task requiring pick-up, alignment, and assembly. The paper tackles this task through predicting collaborative affordance and gripper actions for bimanual geometric shape assembly. A real-world benchmark for re-assembling broken parts is created. Extensive evaluations demonstrate the effectiveness of the approach and shows generalizability to unseen object categories.

优点

The paper addresses a useful task that has been under-explored in previous robotics works, and provides an effective approach to solve this challenging task.

A real-world benchmark on geometric assembly is created, which paves way for future research on this direction.

Thorough evaluations in both sim and real are carried out to demonstrate the effectiveness of the approach. The model is generalizable to shapes from unseen categories.

缺点

For real-world experiments, only qualitative results are presented, there is a lack of quantitative results on more object shapes and comparisons to other baselines. There is also a lack of more detailed sim2real transfer analysis, for example, comparing the results of an exact same set of shapes in simulation and the real world.

The paper only includes one ablation study on w/o SE(3), however, the approach is a combination of multiple components and more ablations would be helpful to better understand the effect of each component.

The task setup only considers objects with two fragments, however, in reality there could be an arbitrary number of fragments, but the proposed model cannot generalize to different numbers of parts.

问题

The provided website link seems broken?

Are evaluations in simulation carried out with floating grippers? It would be more realistic to control grippers mounted on bi-manual arms, as there could be singularity and arm-table collision issues that are not being taken into account with the floating grippers.

How would the accuracy of the pose estimator (line 288-289) affect the performance? If the pose estimation is a bit off due to occlusions or sensor noises in the real world, would the model be robust to it and still manage to succeed?

评论- Response to Reviewer b9vG [Part2/3]

2024-11-22

Q3 The task setup only considers objects with two fragments, however, in reality there could be an arbitrary number of fragments.

Thank you for this insightful question. Our method can indeed be extended to handle multiple fragments, and we have conducted experiments to validate this extension. Below, we provide a detailed explanation of how our method can be adapted for multi-fragment assembly, followed by the experimental results.

The multi-fragment assembly task can be achieved by iteratively applying the two-fragment assembly process. First, at each iteration, we can identify which two fragments, $p_i$ and $p_j$ , should be assembled next. (If some parts have already been assembled in previous iterations, their combination is treated as a new fragment.) Specifically, based on the imaginary assembled shape $S$ , we can calculate the minimum distance, $\min \| p_i - p_j \|$ , between sampled points from every pair of fragments, and the pair $(p_i, p_j)$ with the minimum distance is chosen for assembly: $(p_i, p_j) = \underset{(p_i, p_j) \in \mathcal{S}_1 \times \mathcal{S}_2}{\arg\min} \ \| p_i - p_j \|$ . Once $p_i$ and $p_j$ are identified on $S$ , we then map these fragments to their corresponding parts in the observed point cloud $O$ . This mapping is formulated as a classification task, where the similarity between parts in $S$ and $O$ is estimated.

Finally, using the imaginary assembled shape of the selected fragments $S_{p_i} ∪ S_{p_j}$ , and the corresponding observed point cloud $O_{p_i} ∪ O_{p_j}$ , our method predicts the actions to pick up and assemble the fragments. This process mirrors the steps of the standard two-fragment assembly method. By iteratively applying this two-fragment assembly process, the complete assembly of all fragments can be achieved.

To validate the feasibility of this multi-fragment assembly process, we evaluated our pretrained BiAssembly model on broken beerbottles with three pieces without any finetune process. We provide the visualization of the predicted affordance maps and actions in Figure 8 in Appendix F.1. We can see that for multi-fragment assembly task, our method can still predict reasonable results in each iteration.

Q4 Broken website link

Sorry for this mistake. We have corrected the website link [https://sites.google.com/view/biassembly/] in the revised version of our paper.

Q5 Are evaluations in simulation carried out with floating grippers? It would be more realistic to control grippers mounted on bi-manual arms, as there could be singularity and arm-table collision issues that are not being taken into account with the floating grippers.

Thank you for this valuable suggestion. We agree that integrating control of grippers mounted on bimanual arms would make the setup more realistic. In our work, following previous works [Paper 1-4], we focus on learning the collaborative affordance for geometric shape assembly tasks, abstracting away the control of robot arms. While our real-world experiments show that the proposed actions can be applied to real robot arms in some scenarios with the help of the motion planning in MoveIt! [Paper-5], we acknowledge that incorporating arm control would enhance the system’s realism and improve the accuracy. In our future work, we plan to address those challenges including arm singularities and collision issues to further optimize the system. For example, we aim to integrate the cuRobo [Paper-6] for collision-free motion generation for bi-manual manipulators. We sincerely appreciate your suggestion and will consider these aspects in future developments.

Q6 How would the accuracy of the pose estimator (line 288-289) affect the performance?

As described in Equation (2) of our paper: $g_{i}^{asm} = g_{i}^{pick} \cdot q_{i}^{pick} \cdot {(q_{i}^{init})}^{-1} \cdot M^{-1}$ , the pose estimator does not need to precisely predict the absolute object pose at each frame. Instead, it only needs to estimate the relative pose between two frames, i.e., $q_{i}^{pick} \cdot {(q_{i}^{init})}^{-1}$ , which significantly simplifies the task for a pose estimation or pose tracking model. Additionally, the selected pose estimator, FoundationPose [Paper-7], is the state-of-the-art model for both pose estimation and pose tracking. It excels in predicting relative poses between consecutive frames during continuous manipulation process. Consequently, we empirically observed that in most scenarios, even with occlusions (e.g., the gripper occluding the object after grasping) or sensor noise, the relative pose estimation remains accurate enough for our task.

评论- Response to Reviewer b9vG [Part1/3]

2024-11-22

Thank you for your thorough review of our paper. We greatly value your constructive feedback and insightful suggestions, which we have carefully addressed in our responses below. We also highlighted all changes in Red in the revised paper.

Q1 For real-world experiments, there is a lack of quantitative results... Lack of more detailed sim2real transfer analysis, for example, comparing the results of an exact same set of shapes in simulation and the real world.

Thank you for this valuable suggestion. In our real-world experiments, we tested each object category with 10 trials, varying the initial poses of the two fractured parts for each trial. Below, we report the success rates for different object categories:

Object Category	Bowl	Mug	BeerBottle	WineGlass
Success/Total	3/10	2/10	3/10	2/10

The mug has a relatively low success rate due to its small diameter. If the mug handle faces downward and becomes ungraspable, the gripper must grasp the top edge of the mug. This leads to collisions during the assembly process when both grippers grasp the top edges of the fractured parts. The wineglass has a low success rate because its glasswork is prone to slipping. Even when the gripper successfully grasps the wineglass, it may slide or tip during manipulation, resulting in assembly failures.

For the sim2real transfer analysis, we load the real object meshes, which are acquired from 3D scan methods (provided in our real-world benchmark), into the simulation environment. We observed that the results in simulation were better than those in the real world. This discrepancy arises because, in the real world, the robot arms are more prone to reaching joint limitations. For instance, when attempting to pick up a bowl lying flat on a table, the gripper in simulation can move along a path parallel and very close to the table surface. However, in the real-world setup, the robot arm often encounters joint limitations that prevent it from achieving the same movement, leading to failure in such trials. This comparison highlights the importance of incorporating bi-manual arm joint constraints into our simulation framework to better reflect real-world scenarios and improve transferability.

Q2 The approach is a combination of multiple components and more ablations would be helpful.

We have conducted additional ablation studies, with detailed quantitative results provided in Table 4 and Table 5 in Appendix G. The ablations are as follows:

(1) w/o Affordance Network: During inference, we do not use the trained Affordance Network to highlight actionable regions. Instead, we randomly sample a contact point on the part. The results show a significant drop in the success rates, which decrease to 4.60% for training categories and 2.80% in unseen categories. This demonstrates that the Affordance Network plays a crucial role in filtering out non-graspable points and points that are unsuitable for the subsequent assembly process.

(2) w/o Transformation Predictor : In this ablation, we remove the Transformation Predictor during inference. This results in success rates of 7.40% on training categories and 4.80% on unseen categories, both substantially lower than our original method. These results show that the Transformation Predictor plays an essential role in predicting alignment poses, enabling the robot to manipulate parts from their initial to alignment poses without collisions.

(3) w/ heuristic $v$ : In this case, we remove the Disassembly Predictor during inference. Instead, we compute the center of each part from the imaginary assembled shape $S$ by averaging the part points, and then use the relative direction of the two parts' centers as the disassembly direction $v$ . This ablation achieves success rates of 19.70% on training categories and 15.20% on unseen categories, which are lower than those of our method. The results indicate that while the calculated relative direction can approximate the relative position of the two parts, it is not sufficiently accurate to replace the assembly direction required in our task, highlighting the importance of the Disassembly Predictor for better performance.

More detailed scores including per-category accuracy can be found in Table 4 and Table 5 in Appendix G.

评论- Response to Reviewer b9vG [Part3/3]

2024-11-22

[Paper-1] Kaichun Mo, Leonidas Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. In International Conference on Computer Vision (ICCV), 2021.

[Paper-2] Yan Zhao, Ruihai Wu, Zhehuan Chen, Yourong Zhang, Qingnan Fan, Kaichun Mo, and Hao Dong. Dualafford: Learning collaborative visual affordance for dual-gripper manipulation. In International Conference on Learning Representations (ICLR), 2023.

[Paper-3] Ben Eisner, Harry Zhang, and David Held. Flowbot3d: Learning 3d articulation flow to manipulate articulated objects. In Robotics: Science and Systems (RSS), 2022.

[Paper-4] Zhenjia Xu, Zhanpeng He, and Shuran Song. UMPNet: Universal manipulation policy network for articulated objects. In IEEE Robotics and Automation Letters (RAL), 2022.

[Paper-5] Sachin Chitta, Ioan Sucan, and Steve Cousins. Moveit! IEEE Robotics & Automation Magazine, 19 (1):18–19, 2012.

[Paper-6] Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, Dieter Fox. CuRobo: Parallelized collision-free minimum-jerk robot motion generation. arXiv preprint arXiv:2310.17274.

[Paper-7] Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17868–17879, 2024.

2024-11-25

Dear Reviewer,

Please provide feedback to the authors before the end of the discussion period, and in case of additional concerns, give them a chance to respond.

Timeline: As a reminder, the review timeline is as follows:

November 26: Last day for reviewers to ask questions to authors.

November 27: Last day for authors to respond to reviewers.

2024-11-30

Dear Reviewer,

Given that the discussion phase is quickly passing, we would like to know if our response has addressed your concerns. If you have any further questions or suggestions, we would be more than happy to continue the discussion. Thank you again for your constructive feedback, and we look forward to hearing from you.

审稿意见

评分: 8置信度: 32024-11-03

This work presents BiAssemble, a framework designed for bimanual robotic manipulation of fractured geometric shapes. The framework utilizes affordance learning to tackle complex long-horizon tasks involving multiple steps, including grasping, alignment, and final assembly. A disassembly prediction determines feasible disassembly directions and a bimanual affordance prediction enhances action planning for assembly. Results suggest significant improvements over baseline methods in both simulation and real-world experiments.

优点

Task itself is very novel.
Addresses the challenging domain of geometric assembly with fractured parts, using a combination of affordance learning and collaborative action prediction. The proposed method became more valuable by supporting bimanual coordination and multi-step processes.
The real-world benchmark offers a strong foundation for evaluating geometric assembly tasks, with a range of fractured objects and reproducible environments.
Affordance learning makes a lot of sense

缺点

Lacks a robust analysis of failure cases, which would provide insights into the system’s limitations and areas for improvement in real-world scenarios.
- Specifically, consider adding a categorization of different types of failures, quantitative analysis of failure rates in different scenarios, or discussion of specific challenging cases. One example: is there a specific type of object that your policy fails to generalize to? or if there's ambiguity, how does the failure look?

问题

how does the method handle symmetry? For example the fracture is a verticle cut? Maybe include an analysis or experiment specifically examining performance on symmetrical fractures, if you haven't already done so.
how does the method compare with RL-based methods? Id suspect that reward hacking could work for this task. Maybe discuss why you chose the current approach over RL methods
website does not work: link redirects to 404.

评论- Response to Reviewer XrtY [Part3/3]

2024-11-22

[Paper-1] Kaichun Mo, Leonidas Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. In International Conference on Computer Vision (ICCV), 2021.

[Paper-2] Yan Zhao, Ruihai Wu, Zhehuan Chen, Yourong Zhang, Qingnan Fan, Kaichun Mo, and Hao Dong. Dualafford: Learning collaborative visual affordance for dual-gripper manipulation. In International Conference on Learning Representations (ICLR), 2023.

[Paper-3] Hao-Shu Fand, Chenxi Wang, Hongjie Fang, Minghao Guo, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains. In IEEE Transactions on Robotics, 2023.

[Paper-4] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. In Proceedings of the 35 th International Conference on Machine Learning.

评论- Response to Reviewer XrtY [Part2/3]

2024-11-22

Q2 How does the method handle symmetry?

Thank you for this insightful question! We will first explain how our method handles symmetry and then present the experimental results on symmetrical fractures.

Let us begin with non-symmetriical fractures. For these fractures, we assume the part mapping relationship between the imaginary assembled shape $S$ and the observed point cloud $O$ is known. This mapping is straightforward to determine as it is only a simple classification task to estimate the similarity between parts in $S$ and $O$ . This mapping is illustrated in Figure 2 of our paper through the use of consistent color coding.

For symmetrical parts, such as $p_1$ and $p_2$ , which are visually identical, it is correct for the classification model to predict either of the following mapping combinations: ( $S_{p_1}$ <-> $O_{p_1}$ , $S_{p_2}$ <-> $O_{p_2}$ ) or ( $S_{p_1}$ <-> $O_{p_2}$ , $S_{p_2}$ <-> $O_{p_1}$ ) . Once the mapping relationship is established, our Transformation Predictor can accordingly predicts the SE(3) transformation $M$ applied to the imaginary assembled shape $S$ , to ensure no part collisions occur during the assembly process (e.g. avoiding scenarios where the left part is incorrectly moved to the right and vice versa).

In summary, whether the fractures are symmetrical or not, as long as the mapping relationship is established, our framework can successfully execute the assembly process.

To conduct experiment on symmetrical fractures, since the BreakingBad dataset [Paper-1] does not contain symmetrical parts, we generate new data for this experiment. Specifically, we randomly select three bowls from the ShapeNet dataset [Paper-2], and use ZBrush to create a vertical plane along the central axis of each bowl, followed by a Boolean operation to cut the bowls into two symmetrical parts. For each trial in our experiment, we randomly select a pair of bowl fractures and initialize their poses randomly. After conducting 100 trials, the accuracy for symmetrical fractures is 10%, which is consistent with the accuracy reported for bowls in our paper. The low accuracy for bowls is primarily due to the challenges in grasping. When the bowl fracture is initialized in an overturned or rotated position with the seam facing upward, it becomes nearly impossible for the grippers to find grasp points that are not on the seam, leading to collisions during the assembly process. We also provide visualizations of the predicted affordances and actions for symmetrical experiments on our website [https://sites.google.com/view/biassembly/].

Q3 How does the method compare with RL-based methods?

Thank you for this suggestion. The main reason we chose the current approach over RL-based methods is the diverse and complex nature of our geometric shape assembly task, which involves parts with varying fracture patterns across multiple categories. Previous affordance-based works [Paper 1–3] have demonstrated strong effectiveness and generalization capabilities of visual affordances in such scenarios. In contrast, RL-based methods are typically trained in a per-category manner and require category-specific reward engineering, making it challenging for them to scale across the wide variety of shapes and categories in our task.

We trained an RL baseline using the SAC algorithm [Paper 4]. The state representation included the grippers' poses, the shapes' poses, and features encoded by a PointNet++ encoder. The reward structure was designed to provide positive rewards for object contact, successful pick-up, alignment of the two shapes, and successful assembly. However, we observed very few successful attempts. One reason for this is the low sample efficiency of RL, which makes it difficult to sample positive manipulations during exploration. Additionally, even when the RL agent successfully picks up a shape, the learned experience is not easily transferable to subsequent trials, as the shape geometry changes in new episodes. These challenges highlight the limitations of RL-based methods for this task, reinforcing the suitability of our affordance-based approach.

Q4 Website link redirects to 404.

Sorry for this mistake. We have corrected the website link [https://sites.google.com/view/biassembly/] in the revised version of our paper.

评论- Response to Reviewer XrtY [Part1/3]

2024-11-22

We sincerely appreciate your positive feedback and valuable suggestions for enhancing our work. We have carefully addressed all your questions and comments in the following responses, with all changes marked in Red in the revised paper.

Q1 A robust analysis of failure cases.

Thank you for this valueable suggestion. We have revised the Appendix E (Failure Cases), to include more detailed categorizations of failure types and in-depth analysis. This revision highlights the system's limitations and provides insights for future improvements. Below, we summarize the key failure modes observed:

Hard to Grasp:

(1). Heavy or Smooth-Surfaced Parts. Fractured parts that are heavy or have smooth surfaces often result in grasping failures. For instance, as shown in Figure 7(a) in Appendix E, categories such as teapots and vases, which are relatively large and feature smooth curved surfaces, exhibit notably high failure rates during grasping.

(2). Flat Parts. Flat fractured parts, particularly some shapes in categories like statues and mugs, are challenging to pick up due to the limited gripping area. For example, as shown in Figure 7(b) in Appendix E, the statue part on the left is too close to the desktop and has a very small thickness, which prevent the gripper from grasping it. Similarly, in (c), the handle fragment on the right is too flat, making it impossible for the gripper to grasp it. A potential solution is incorporating pre-grasp operations, such as moving the fractured part to the table edge, allowing the shape to hang off slightly and thus become graspable.

Hard to Assemble:

(3). Graspable Regions Overlapping Seam Areas. When the graspable regions of a fractured part align with its seam areas, collisions during assembly become frequent. This issue is common in categories such as wineglasses, mugs, and bowls. For example, as shown in Figure 7(d), the left gripper avoids collision-prone regions, but the right gripper must grasp the neck of the wine bottle. Similarly, in (e), while the left gripper avoids collisions, the right gripper ends up grasping the handle of a mug. A potential solution is to perform a series of pick-and-place operations to adjust the object's initial pose. This adjustment can reduce the overlap between the object's graspable regions and seam areas, thereby minimizing collisions during the assembly process.

(4). Complex Object Shapes. Objects with intricate shapes, like those in the statues category, pose challenges due to irregular edges and complex curves. Such designs increase the difficulty of alignment and manipulation, leading to higher failure rates during assembly.

(5). Relative Displacement During Operations. Relative displacement between the gripper and fractured parts often occurs due to small contact areas and insufficient support, which can cause sliding or tipping during manipulation. For example, wine bottles with narrow necks, which have unstable center of gravity, making the gripper prone to sliding during movement and leading to operational failures.

2024-11-25

Dear Reviewer,

Please provide feedback to the authors before the end of the discussion period, and in case of additional concerns, give them a chance to respond.

Timeline: As a reminder, the review timeline is as follows:

November 26: Last day for reviewers to ask questions to authors.

November 27: Last day for authors to respond to reviewers.

评论- Kindly Seeking Feedback from the Reviewer

2024-11-26

Given that the discussion phase is quickly passing, we would like to know if our response has addressed your concerns. If you have any further questions or suggestions, we would be more than happy to continue the discussion. Thank you again for your constructive feedback, and we look forward to hearing from you.

AC 元评审

2024-12-23

This paper proposes a novel framework for learning collaborative affordance in bimanual geometric assembly. The task involves assembling fractured parts into complete objects, which requires precise coordination, geometric reasoning, and long-horizon planning. The authors present a multi-component pipeline integrating disassembly prediction, transformation prediction for alignment poses, and a collaborative affordance predictor. They further introduce a real-world benchmark for evaluating fractured object assembly and validate their approach across diverse object categories in both simulated and real-world environments.

Strengths:

The paper addresses an underexplored but important problem in robotics and manipulation. The integration of collaborative affordance prediction with geometric reasoning demonstrates potential for advancing bimanual assembly tasks. The method is validated in simulated environments with diverse object geometries, showing promising results in controlled settings. The real-world benchmark for fractured object assembly, although preliminary, provides a starting point for evaluating approaches in this domain. The proposed ablations highlight the role of individual components, such as disassembly prediction and SE(3)-equivariant representations in the obtained performance.

Weaknesses:

Despite its strengths, the paper has significant limitations. The reported success rates in real-world experiments are notably low (20-30%), raising concerns about the robustness and reliability of the approach in practical applications. The method relies heavily on specific assumptions, such as the availability of an ideal "imaginary assembled shape" and the restriction to two-part assemblies, which limit its generalizability to more complex or real-world scenarios. Moreover, the limited scope of the two-part assembly tasks makes it difficult to realize extensions of the methodological framework to a broader set of tasks where bimanual collaboration and geometric reasoning are necessary, e.g., long-horizon rearrangement tasks with multimodal contacts.

Furthermore, the failure analysis provided in the rebuttal remains superficial and does not offer actionable insights into addressing core limitations, such as gripper precision or challenging object geometries. The scalability of the approach to multi-fragment assembly, while proposed as a conceptual extension, is not validated through experiments. Additionally, the baseline comparisons are limited, as the paper does not engage deeply with reinforcement learning-based methods or explore alternatives that might address symmetry and robustness issues.

Reasons for Rejection:

While the paper introduces a novel approach for bimanual aassembly and demonstrates potential, the limitations in robustness, generalizability, and scalability of the method outweigh its contributions. The low real-world success rates and reliance on restrictive assumptions hinder the practical applicability of the method, and the rebuttal failed to adequately address these core concerns. Although the reviewers recognized the paper’s ambition and novelty, the AC finds that the paper requires significant revisions and to address a broader set of tasks and demonstrate statistically significant results in real world applications, before being ready for publication at a high-impact venue like ICLR.

审稿人讨论附加意见

During the discussion phase, reviewers acknowledged the paper's novelty and ambition but expressed consistent concerns about its limitations in generalizability, robustness, and scalability. Reviewer b9vG emphasized that the framework's reliance on strong assumptions—such as the availability of an ideal "imaginary assembled shape" and the restriction to two-part assemblies—significantly limited its applicability to more complex, real-world scenarios. Reviewer XrtY highlighted the lack of robust failure analysis and the absence of meaningful comparisons with reinforcement learning (RL)-based baselines, both of which are critical for a comprehensive evaluation of the method's contributions.

In their rebuttal, the authors provided additional ablations, categorized failure cases, and proposed a conceptual extension for multi-fragment assembly. While these efforts demonstrated an understanding of the concerns, they did not adequately address the core issues. The proposed extension for multi-fragment tasks remained theoretical and lacked experimental validation, leaving scalability concerns unresolved. Similarly, the failure analysis, while helpful in categorizing errors, did not provide actionable insights or detailed solutions to address the low real-world success rates (20-30%).

Weighing on the reviewers' assessments and carefully evaluating the rebuttal, the Area Chair decided to recommend rejecting the paper. While the reviewers recognized the potential impact of the work, the unresolved issues—particularly the heavy reliance on assumptions, low robustness in real-world settings, and limited validation for scalability—indicate that the paper is not yet ready for acceptance at ICLR. This decision reflects the need for substantial revisions and broader validation to elevate the paper to the standards of a high-impact venue.

最终决定Reject

2025-01-22

Reject

BiAssemble: Learning Collaborative Affordance for Bimanual Geometric Assembly

摘要

评审与讨论

优点

缺点

问题

优点

缺点

问题

优点

缺点

问题

优点

缺点

问题

审稿人讨论附加意见