4.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.5

置信度

正确性2.8

贡献度2.8

表达2.5

NeurIPS 2024

EfficientCAPER: An End-to-End Framework for Fast and Robust Category-Level Articulated Object Pose Estimation

Xinyi Yu,Haonan Jiang,Li Zhang,Lin Yuanbo Wu,Linlin Ou,Liu Liu

OpenReview PDF

提交: 2024-05-13更新: 2024-12-26

摘要

关键词

Articulated Object;Pose Estimation;Joint-centric Pose Modeling

评审与讨论

审稿意见

评分: 5置信度: 32024-07-07

This paper proposes EfficientCAPER, a novel method for category-level articulated object pose estimation from input point cloud. The proposed method first estimates the 6D pose of the free part (or static part), then uses this estimated 6D pose to transform the input point cloud into canonical space, and finally segments and regresses the pose of each part. The authors experiment on both synthetic and real-world datasets, showing that the proposed method outperforms existing methods.

优点

The proposed method is simple but efficient as shown in Table 2.
The idea of transforming the input point cloud into its canonical space seems to be straightforward but it appears not to have been done in previous works.
The proposed method outperforms existing works on both synthetic and real-world datasets.

缺点

The paper misses many important related works such as CAPTRA, Ditto. While these methods do not work in the exact same setting, these methods are relevant and it is worthy to discuss the difference with the proposed method.
I found the paper is not well-written, making it difficult to understand the true contribution of the paper.
In Table 3, the authors show the results to prove the robustness of the proposed method against self-occlusion. The self-occlusions are sometimes quite very high such as 80-100%. It would be helpful to know in which cases the object can be occluded at this high rate, along with qualitative samples.

问题

Please show some qualitative samples of object being self-occluded at 80-100%.

局限性

No, the authors did not mention explicitly the limitation of the proposed method.

作者回复

2024-08-06

Thank you for taking the time to read this paper and for asking these meaningful questions. Below we respond in detail to your questions.

W1: The paper misses many important related works such as CAPTRA, Ditto. While these methods do not work in the exact same setting, these methods are relevant and it is worthy to discuss the difference with the proposed method.

Our work involves estimating articulated objects, while CAPTRA[1] focuses on the tracking task. In terms of translation, CAPTRA utilizes CoordinateNet to predict normalized coordinates, whereas our approach directly regresses it, which enhances efficiency. Ditto[2] is a model based on implicit neural representations, capable of learning and reconstructing the 3D geometry of articulated objects from visual observations through interactive perception, but it does not perform pose estimation. For the joint parameter estimation module, Ditto uses implicit decoding of axis parameters and states from camera point clouds. Still, our work implicitly decodes axis parameters and explicitly decodes the state in the canonical space. We will incorporate more significant related work in the revised version.

W2: I found the paper is not well-written, making it difficult to understand the true contribution of the paper.

We have summarized our contributions in the introduction from line 71 to line 80. In short, our key contributions are pose canonicalization and joint-centric articulation pose modeling method. Additionally, we will add a concluding statement at the end of the method section to highlight the contributions of our approach in the revised version.

W3&Q1: In Table 3, the authors show the results to prove the robustness of the proposed method against self-occlusion. The self-occlusions are sometimes quite very high such as 80-100%. It would be helpful to know in which cases the object can be occluded at this high rate, along with qualitative samples.

We presented some cases with severe self-occlusion and qualitative results in Figure 3 in the attached PDF. The category drawer suffers from severe self-occlusion due to its complex structure and constrained parts with small size. Thanks to our joint-centric pose modeling, our method can effectively address self-occlusion issues and demonstrate robustness to varying levels of self-occlusion. Therefore, both qualitative and quantitative results show the potential that our method can achieve superior performance on objects with more complex structures.

[1] CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds. 2021. ICCV.
[2] Ditto: Building Digital Twins of Articulated Objects from Interaction. 2022. CVPR.

审稿意见

评分: 5置信度: 42024-07-08

The paper investigates category-level articulated object pose estimation with a new framework. The framework consists of two stages, including the first for estimation the pose of the free parts using decoupled rotation representation and the second for estimation the pose of the constrained parts by predicting the joint parameters and states as replacements. Experimental results on ArtImage, ReArtMix, and RobotArm demonstrate the outstanding pose estimation performance for articulated objects.

优点

1.An end-to-end Category-level Articulated object Pose EstimatoR composed of two stages for free parts and constrained parts.

2.Free part pose estimation with decoupled rotation and constrained part pose estimation with joint state replacement.

3.Experimental results demonstrate that the proposed method achieves the less object pose estimation error in comparison with other methods on ArtImage, ReArtMix, and RobotArm.

缺点

1.The proposed method is called EfficientCAPER, which indicates that the running efficiency is a key factor. However, the expression of this is lacked. For one thing, the reason that why the proposed method can achieve higher FPS than other methods should be explained more clearly. For another, the detailed discussion of running time is also suggested to add.

2.In Table 2, experiments on the effect of articulation pose canonicalization are conducted. However, different metrics (Joint State Error and 3D Scale Score) are used. It is suggested to use the same metrics (Rotation Error, Translation Error (m) and 3D IOU) as Table 1, which can prove the effect of articulation pose canonicalization more clearly.

3.The experimental results demonstrate obvious improvements in Tables 1,3,4. What is the key contribution to the improvements? It is suggested to add more ablation experiments to show this.

问题

1.What is the main difference and challenge of category-level articulated object pose estimation in compared with category-level object pose estimation?

2.It is interesting that what is the performance of the category-level object pose estimation methods (like GPV-Pose) on ArtImage, ReArtMix, and RobotArm?

局限性

The authors have adequately addressed the limitation.

作者回复

2024-08-06

Thank you for your review, and for your thoughtful comments and questions. They will certainly help improve the revised paper. Below we respond in detail to your questions.

W1: The proposed method is called EfficientCAPER, which indicates that the running efficiency is a key factor. However, the expression of this is lacked. For one thing, the reason that why the proposed method can achieve higher FPS than other methods should be explained more clearly. For another, the detailed discussion of running time is also suggested to add.

The analysis of inference time is reported in Section 5.2. More discussions are in below: (1) Compared to some dense prediction methods like NOCS, we do not perform point-by-point predictions, achieving lower computational costs and faster running speed; (2) Our method is end-to-end without any time-consuming post-procession like OMAD (It estimates coarse pose and uses a retrieval-based method to get refined results).

W2: In Table 2, experiments on the effect of articulation pose canonicalization are conducted. However, different metrics (Joint State Error and 3D Scale Score) are used. It is suggested to use the same metrics (Rotation Error, Translation Error (m) and 3D IOU) as Table 1, which can prove the effect of articulation pose canonicalization more clearly.

The joint state is the key target after canonicalization, which is a more intuitive metric in our proposed joint-centric perspective. We additionally provide estimated results of the constrained part with the same metrics as Table 1 in the main paper, it shows that the canonicalization module employed in the second stage is beneficial for estimating the pose of constrained part.

Category	APC	Rotation Error (°)	Translation Error (m)	3D IOU (%)
Laptop		5.1	0.073	84.8
	✔	3.3	0.071	87.9
Eyeglasses		8.8, 9.7	0.113, 0.094	78.4, 79.2
	✔	7.2, 5.8	0.109, 0.089	82.3, 84.3
Dishwasher		3.3	0.096	79.3
	✔	2.6	0.085	83.9
Scissors		5.9	0.112	69.5
	✔	5.1	0.099	69.2
Drawer		1.2, 1.2, 1.2	0.102, 0.124, 0.100	81.8, 80.5, 83.6
	✔	1.2, 1.2, 1.2	0.086, 0.090, 0.080	87.1, 87.1, 88.4

W3: The experimental results demonstrate obvious improvements in Tables 1,3,4. What is the key contribution to the improvements? It is suggested to add more ablation experiments to show this.

As shown in Table 2 in the main paper, it is justified that canonicalization contributes greatly to the framework, so we conducted the ablation study for this module. Our pose canonicalization operation facilitates better learning of joint parameters and states. Additionally, we also conducted an ablation study in articulation pose modeling. We use GPV-Pose to directly output the poses of all parts of the category laptop under part-centric(i.e., treating each part of the articulated object as an independent rigid object), the result shows that our joint-centric modeling method is beneficial to the estimation of the constrained part.

Category	Pose Modeling Method	Rotation Error ( $^\circ$ )	Translation Error (m)	3D IOU(%)
Laptop	Part-centric	10.5, 12.7	0.205, 0.275	49.1, 46.3
	Joint-centric	2.3, 3.3	0.065, 0.071	88.2, 87.9

Furthermore, we explore the effect of the backbone. We have conducted an ablation study on category laptop to estimate the pose of free part , using the MLP, 3D-GC, HS-Encoder as backbones respectively. The results in the table below show that the choice of HS-Encoder is effective.

Category	Backbone	Rotation Error ( $^\circ$ )	Translation Error (m)	3D IOU(%)
Laptop	MLP	20.5	0.203	23.8
	3D-GC	7.5	0.099	76.3
	HS-Encoder	4.8	0.073	85.4

Q1: What is the main difference and challenge of category-level articulated object pose estimation in compared with category-level object pose estimation?

Estimation complexity: Compared to rigid objects can be represented by a single pose, describing their position and orientation in 3D space. Category-level articulated object pose estimation involves objects with articulated parts that can move relative to each other while obeying kinematic constraints. This adds complexity as the pose estimation not only involves determining the overall object pose but also capturing the poses of its articulated parts and joint parameters.
Severe self-occlusion: Category-level articulated object pose estimation will increase dimensionality and variability which may cause self-occlusion. Severe self-occlusion will lead to poor performance, especially where smaller parts are obscured by larger parts from certain camera views. As discussed in section 4.2 in the main paper, the joint-centric pose modeling method can effectively alleviate the issue.

Q2: It is interesting that what is the performance of the category-level object pose estimation methods (like GPV-Pose) on ArtImage, ReArtMix, and RobotArm?

We have conducted this experiment(shown in Table in W3), directly regressing pose in part-centric modeling method has inferior performance, since the part-centric pose modeling method does not take count into kinematic constraints.

审稿意见

评分: 4置信度: 42024-07-12

This paper introduces a new approach for estimating the pose of category-level articulated objects. The proposed framework, EfficientCAPER, aims to address the challenges of kinematic constraints, self-occlusion, and optimization requirements. The method eliminates the need for optimization post-processing by utilizing a joint-centric pose modeling mechanism. The approach is evaluated on three diverse datasets: ArtImage, ReArtMix, and RobotArm, demonstrating its effectiveness and generalization ability.

优点

EfficientCAPER introduces a joint-centric articulation pose modeling mechanism, which differs from traditional part-centric approaches. This mechanism allows the learning of each part’s pose as a joint state representation, improving the network's ability to handle self-occlusion and kinematic constraints.
The method splits the pose estimation process into two stages. First, it estimates the pose of the free part using a decoupled rotation representation. Then, it canonicalizes the input point cloud to estimate the poses of constrained parts by predicting joint parameters and states.
EfficientCAPER eliminates the need for intermediate variable estimation or optimization procedures, making the pose estimation process more efficient and straightforward.

缺点

The method heavily relies on the accuracy of the free part pose estimation. Any error in this stage can propagate and affect the estimation of the constrained parts.
Although the method is tested on three datasets, the datasets might not cover all possible variations and complexities of real-world articulated objects. This might limit the generalizability of the method to new, unseen categories.
There is no clear definition of what constitutes a free part. The concept appears to be relative, and each part could potentially be considered a free part depending on the context (like the two parts of scissors). This ambiguity might complicate the learning process.
The difference between a free part and a root part is not well-defined, and they seem to be essentially the same.
it is not clear in the method section if the free part is pre-defined. As in the ablations, there are inputs with free-part only.

问题

Please check the weakness w.r.t free part.

局限性

作者回复

2024-08-06

We appreciate that the reviewer understands and recognizes the contributions of this work. We address the main concerns as follows.

W1: The method heavily relies on the accuracy of the free part pose estimation. Any error in this stage can propagate and affect the estimation of the constrained parts.

As shown in Table 1, our method has earned satisfactory results in estimating poses both the free part and the constrained part. While our method does exhibit cumulative errors, even in datasets of high complexity such as RobotArm, we have also achieved notably superior results compared to the baseline approaches.

W2: Although the method is tested on three datasets, the datasets might not cover all possible variations and complexities of real-world articulated objects. This might limit the generalizability of the method to new, unseen categories.

We follow the suggestion from A-NCSH that the number of rigid parts and kinematic structures is constant for all the objects in the same category, and the goal is to estimate the pose of unseen objects within the category. We consider objects in the physical world to generally fall into two categories: rigid objects and articulated objects. Articulated objects can be further classified into two types—those with rotational joints and those with prismatic joints. Our datasets already include both types of articulated objects, including highly complex ones like robot arms. Therefore, our method can generalize to both kinds of articulated objects within the same category.

W3: There is no clear definition of what constitutes a free part. The concept appears to be relative, and each part could potentially be considered a free part depending on the context (like the two parts of scissors). This ambiguity might complicate the learning process.

For the pose estimation of the free part in the first stage, we input all parts of the articulated object together, treating them as a rigid object without distinguishing between the free and constrained parts. Since all the parts can rotate around each other interchangeably, for simplicity, we pre-define the free part according to the joint location (this configuration is interchangeable), all objects share the same setting within the intra-category.

For the pose estimation of the constrained part in the second stage, we first canonicalize the point clouds via the prediction from stage 1 to align the free part. Then the network tries to learn the joint state, which is distinguished from canonical space and rest state(please see Figure 1 in the attached PDF). The Parts with obvious movement are considered constrained parts, whose poses are optimized by our joint-centric modeling way.

W4: The difference between a free part and a root part is not well-defined, and they seem to be essentially the same.

Please check Figure 2 in the attached PDF, we show the main difference between a free part and a root part.

It predominantly suggests two ways in defining root part for articulation structure:

Definition I: the root part is defined as the world, and there is a fixed joint connecting the world to the base part (this definition is known in PartNet-Mobility dataset from SAPIEN[1]). In this definition, the root part is a virtual part that contains no physical parts.

Definition II: the root part is directly defined as the base part known in AKB-48 dataset [2].

In this work, the free part is more similar to definition II of AKB-48 in terms of the physical base part in the articulation structure, but the differences are: (1) we name the part type following the kinematic motion type, i.e. we assume there is a free joint accompanying with the free part. So every part's pose in the articulated object can be converted to the task that how to estimate the joint parameters and state. (2) "free" means it can move at arbitrary position and rotation in the 3D space, while "constrained" means the part's motion is limited to the free part's motion.

W5: It is not clear in the method section if the free part is pre-defined. As in the ablations, there are inputs with free-part only.

Free part and constrained part is our proposed articulation modeling way in this work, where the free part is pre-defined for each category. As it is stated in our work, there are two stages to solving the articulation pose estimation. The first one is to canonicalize the object by predicting the free part's pose and the second one is to predict the joint states of constrained parts (since it is obviously easy to predict joint state in canonical space). In this case, we find that only free part input in the first stage might cause symmetry problem that affects the canonicalization performance. Thus, we provide an ablation study in A.5.1, that shows when the whole object is inputted, the free part pose estimation can be achieved better so the canonicalization is more accurate. Note that in this ablation study, we only investigate the effect of input on the first stage.

[1] SAPIEN: A SimulAted Part-based Interactive ENvironment. CVPR. 2020.

[2] AKB-48: A Large-scale Real-world Articulated Object Knowledge Base. CVPR. 2022.

2024-08-14

Thank you for the rebuttal and the detailed explanations provided. While I appreciate the clarifications, I still have concerns regarding the definition of free part and how to define it for category. Given these concerns, I tend to maintain my original rating. That said, I am OK if other reviewers strongly advocate for the acceptance of this paper.

2024-08-14

Thanks for your reply, here, we give a clearer definition of the free part and how it works for category.

Definition: The free part used in this work mainly refers to a relative concept, in other words, we divide the K rigid parts of an articulated object into 1 free part and K-1 constrained part. The free part enjoys 3D of freedom transformation, while the constrained part can only conduct relative motion (1D of freedom transformation) under the free part.

For each category, we tend to choose the part that is semantically the base as the free part, and this definition follows a uniform setting across the intra-category.

Thank you again for your time and effort. If our rebuttals do not address your concerns to some extent, we would be happy to have further discussions. Your feedback is an important reference for us to improve the quality of our paper, and we attach great importance to it. We look forward to your reply.

审稿意见

评分: 5置信度: 32024-07-12

This paper introduces an end-to-end framework designed for category-level articulated object pose estimation. Specifically, it addresses the complexity of articulated objects by using a joint-centric approach that divides the task into two stages: estimating the pose of free parts and then the constrained parts. The method is evaluated on three datasets, demonstrating its effectiveness.

优点

1.The paper is well-organized and easy to follow.

2.It is interesting to estimate the category-level articulated object pose from the joint-centric perspective.

3.The proposed method is shown to be effective on different datasets.

缺点

1.It is not clear to me how the free part is defined. Actually, the “free” and “constrained” are defined relatively and can be changed in some sense. Will this cause the ambiguity in the estimation?

2.In the paper, it is claimed that the part-centric can fix the self-occlusion issues, but I didn’t fully understand how the proposed “Joint-Centric Articulation Pose Modeling“ can handle this effectively.

3.The framework takes a cascaded architecture, and I am wondering how the free part pose estimation in the first stage will affect the following constrained part estimation.

4.For the ArtImage dataset, the translation estimation of the proposed method is not stable, and for the “Scissors” case, its performance is much worse than others in terms of rotation and translation errors. More expatiation is expected.

问题

Please refer to the weakness part.

局限性

The limitations are discussed in the supplementary material.

作者回复

2024-08-06

Thank you for your valuable comments and evaluation on our submission. Please find our response below.

W1: Unclear definition of the free part with relative change against the constrained part.

The "free" and "constrained" parts are indeed interchangeable, however, this does not cause ambiguity in the estimation process. Concretely, for the pose estimation of the free part, all parts of the articulated object will be input together, treating them as a rigid object without distinguishing between the free and constrained parts. Since all the parts can rotate around each other interchangeably, for simplicity, we pre-define the free part according to the joint location(this configuration is interchangeable).

In the second stage, we first perform pose canonicalization via the prediction from stage 1 to align the free part. Then the network tries to learn the joint state, which is distinguished from canonical space and rest state. The Parts with obvious movement are considered as constrained parts, whose poses are optimized by our joint-centric modeling way (please see Figure 1 in the attached PDF).

W2: It is claimed that the part-centric can fix the self-occlusion issues, but I didn’t fully understand how the proposed “Joint-Centric Articulation Pose Modeling“ can handle this effectively.

In the proposed joint-centric modeling, the degrees of freedom for the constrained part are reduced from 3D to 1D (due to kinematic constraints from the joint), and this 1D constraint information is well reflected in the joint state. Therefore, even in the cases of severe self-occlusion, the network can still learn the joint state from few point clouds, regressing 1D directional information naturally performs better than regressing 3D direction information, achieving superior performance.

W3: The framework takes a cascaded architecture, and I am wondering how the free part pose estimation in the first stage will affect the following constrained part estimation.

As mentioned in Section 4.4 in the main paper, this strategy can be conducive to the following aspects: (1) We use the predicted pose of free part to align the free part, transforming the joint state estimation task from camera space to canonical space, making sure the network can effectively learn the joint state in canonical space. This is a simple and efficient way but still under-studied in existing works (see reviewer Gf45); (2) The canonicalization process applied to the point cloud eliminates the effects of varying joint configurations in movable rigid parts. This process reduces uncertainty, enhances stability and accuracy, and provides both shape and kinematic priors, significantly aiding in the regression of joint states. The ablation study in Table 2 validates these advantages.

W4: For the ArtImage dataset, the translation estimation of the proposed method is not stable, and for the “Scissors” case, its performance is much worse than others in terms of rotation and translation errors. More explanation is expected.

Estimating translation in category-level pose estimation is challenging. As shown in Table 1, the translation error of constrained part is higher than free part, since it relies on the estimation of free part and joint parameters. However, compared to baselines achieving state-of-the-art performance only for specific categories, our method can achieve comprehensive performance in each category.
For category scissors, typically exhibit complex shapes and partially symmetrical features. This complexity makes it challenging to accurately capture the details of scissors, while symmetry may lead to multiple possible poses.

作者回复

2024-08-05

Dear Reviewers,

Please see the attached PDF for a one-page PDF with a summary of added experimental figures.

We are appreciated by the positive comments of the reviewers on the novelty and significance of our method (Reviewers UxV6, UsR2, cXQd, Gf45), readability (Reviewer UxV6), effectiveness (Reviewers UxV6, UsR2, Gf45), and notable improvements compared with state-of-the-art methods (Reviewers cXQd, Gf45).

We provide point-to-point responses to each reviewer. We look forward to engaging in further discussion with the reviewers, answering questions, and discussing improvements.

最终决定Accept (poster)

2024-09-25

This paper is about articulated object pose estimation. After considering all reviews and author comments, the paper received three ratings of "borderline accept" and one of "borderline reject". However, the reviewer recommending reject states that he/she is comfortable with an "accept" if the other reviewers recommend so. The authors provided detailed responses to the reviewer queries; unfortunately, except for one reviewer, the others did not provide updated ratings. In AC's view, the responses seem very specific and address the questions raised. The AC recommends accepting this paper.