5.8

/10

Poster4 位审稿人

最低5最高7标准差0.8

4.3

置信度

正确性2.5

贡献度2.8

表达3.3

NeurIPS 2024

Grasp as You Say: Language-guided Dexterous Grasp Generation

Yi-Lin Wei,Jian-Jian Jiang,Chengyi Xing,Xiantuo Tan,Xiao-Ming Wu,Hao Li,Mark Cutkosky,Wei-Shi Zheng

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

摘要

关键词

Dexterous Grasp GenerationRobotics

评审与讨论

审稿意见

评分: 7置信度: 42024-07-07

This paper proposes a novel robotic grasping task that utilizes natural language as guidance to generate grasping poses for dexterous hands that satisfy corresponding intentions. To accomplish this task, the paper first creates a large-scale language-guided dexterous grasping dataset. Moreover, the authors propose a progressive solution composed of a diffusion-based generative model and a regressive model to achieve generating intention-consistent, high diversity, and high-quality grasp poses for a wide range of objects.

优点

Compared to existing work, this work advances further by posing a more general and challenging problem and offering a more comprehensive solution. The introduction of this problem promotes the direction of general robotic grasping.
The paper contributes a grasping dataset that includes 18,000 objects with a total of 50,000 text-grasp pose pairs. This dataset fully utilizes the information provided by the Oakink dataset and transfers human hand poses to robotic hands in three steps. During the language annotation stage, the paper generates corresponding language descriptors by measuring the distance from the fingers to different regions of the object, then uses GPT to generate natural language. The entire process is technologically sound.
The paper adopts a progressive framework to address this issue, utilizing a diffusion-based generative model and a regressive model to ensure that the generated poses are both diverse and high-quality.

缺点

Weaknesses

When verifying intention consistency, the paper uses chamfer distance and contact distance to compare the error between the estimated poses and the target poses. However, since the intention is derived from natural language, which is inherently imprecise and non-unique, using a collected target pose as the sole standard is inappropriate. For instance, for "to use a lotion pump, place your forefinger on the pump head and other fingers on the bottle body," there might be infinite ways to hold the lotion with the forefinger on the pump head, some of which might be different to the target one. Therefore, chamfer distance and contact distance may not be the best metrics.
The dataset collected in the paper includes only 50,000 grasping poses. Compared to existing datasets for parallel-jaw grippers, which contain millions or even billions of poses, the diversity is limited. For a dexterous hand, which has a higher degree of freedom, collecting more data is inevitable to achieve more varied grasping methods. I believe the performance on unseen categories would degrade significantly.
The authors mentioned that they use 20% of the data as a test set to evaluate the model's performance. However, they do not provide more specific information about the training and test sets. For example, whether the target poses and description texts of the same category of objects in the training and test sets are consistent, and what the results would be if given inconsistent descriptions, such as in Figure 6, "To grasp a wineglass, use your fingers to hold the stem securely" if changed to "To grasp a wineglass, use your fingers to hold the GLASSES securely" (I think this description will not exist in the dataset).
Although the authors use the Q1 metric to measure grasp stability, this analytical approach still has limitations. The authors could try using a simulator (e.g., Isaac Sim) to test the grasp success rate, which would be more indicative.

Other Comments

The title of A.1.4 is "inference time," but the content is unrelated.
The real-world experiments in the video are not very promising. In the given demos, non-target areas of the objects are fixed by fixtures, leaving only the target areas available for grasping.

问题

See weaknesses

局限性

See weaknesses

作者回复

2024-08-07

Thanks for your reviews. Below are our responses to your questions.

Q1: Intention consistency metric.

A1: To further evaluate, we employ Fréchet Inception Distance (FID) [1], which is commonly used in generative task [3]. We use sampling point cloud features extracted from [2] to calculate P-FID and rendering image features extracted from [1] to calculate FID. The results in Table 1 in global rebuttal PDF show that our methods surpass all previous methods. More details can be found in global rebuttal (a), and Figure 1 (a) and Table 1 in rebuttal PDF.

Q2: Generalization on unseen categories.

A2: We conduct experiments on unseen categories by removing 15 categories in the training set. The results shows the good generalization of our method. Benefiting from our framework design and our large-scale datasets, model can learn from object geometric structure and language semantic information, not just overfitting certain category (e.g., learn how to use a mug by grasping the handle and generalize to how to use a teapot). The details and results can be found in global rebuttal (b), Table 2 and Figure 2 (a) in rebuttal PDF.

Q3: Dataset splitting and description inconsistency.

A3:

The description texts for objects of the same category in the training and test sets are NOT the same. The description texts are generated independently for each sample by the LLM. Our dataset consists of 34 categories, 1500 objects, and 50000 grasp annotations. We split it at the object level, with 80% for training and 20% for testing in each category. The training set contains about 1200 objects and 40,000 grasps, and the test set contains about 300 objects and 10,000 grasps.
Our framework are robust to the description noise, which we think is benefiting from our pre-trained CLIP language encoder [4]. For your example, if 'stem' is replaced with 'GLASSES,' the model tends grasps the stem or the body for different samples. And if you replace the "stem" to "body", the generated dexterous hand will hold the body of the wineglass. Due to the PDF is limited in one page in rebuttal, we will add more details to the supplementary materials.

Q4: Grasp success rate.

A4: We evaluate the grasp success rate in Issac Gym following [5], and the results show that our framework surpasses all previous works. The results can be found in Table 1 in rebuttal PDF.

Q5: The title of A.1.4 is "inference time".

A5: This section describes the pipeline in the inference time. We will change it to “Inference Time Pipeline”.

Q6: The fixtures in real-world experiments.

A6: It is for the convenience of demonstration to fix the objects with fixtures. If the model generates inaccurate grasping, it will collide with the fixture. We also provide a new visualization that shows all areas are available for grasping without using fixtures in Figure 2 (c) in rebuttal PDF.

[1] Heusel, et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. NIPS 2017.

[2] Nichol, et al. Point-E: A System for Generating 3D Point Clouds from Complex Prompts. OpenAI 2022.

[3] Guo et al. Generating Diverse and Natural 3D Human Motions from Text. CVPR 2022.

[4] Radford, et al. Learning Transferable Visual Models from Natural Language Supervision." ICML 2021.

[5] Wang, et al. DexGraspNet: A Large-Scale Robotic Dexterous Grasp Dataset for General Objects Based on Simulation, ICRA 2023.

2024-08-10

I appreciate the authors' responses and additional experiments, which addressed most of my concerns and questions. Although this is a minor comment, it would be beneficial to include more analysis of failure cases. This could provide valuable insights and help guide future improvements. Overall, this paper studies the novel task of open-vocabulary dexterous grasping by contributing a new dataset and proposing a novel pipeline to tackle this problem. Therefore, I would like to raise my score to support acceptance.

评论- Official Comment by Authors

2024-08-11

We appreciate the reviewer's feedback and are pleased that our response has addressed the concerns. We will conduct more analysis and visualization of failure cases, and add these to the supplementary materials in our final version.

审稿意见

评分: 6置信度: 52024-07-07

This paper tackles language-guided dexterous grasping by creating a large-scale dataset pairing grasps and language guidance and presenting a grasp generation pipeline based on language instructions. The dataset, DexGYSNet, leverages HOIR and LLMs to scale up the size of grasps and annotations while keeping the generation costs manageable. The DexGYSGrasp framework introduces a two-stage progressive generation to balance diversity, quality, and grasp-language alignment. The method is evaluated on the intention consistency, grasp quality, and diversity against SOTA methods. Experiments demonstrate the necessity of progressive components, effectiveness of HOIR, potential to transfer to real-world scenarios. Overall, the paper advances the field of dexterous grasp generation by natural language guidance, providing a valuable dataset, and proposing a robust framework for generating high-quality robotic grasps based on human instructions.

优点

This problem of language-guided dexterous grasping is significant. This paper presents both a large-scale dataset and a method to generate diverse grasps. The dataset is a valuable resource for the community and the method can be integrated into a grasping policy and applied to real-world scenarios. The design of experiments justifies the choice to break the generation process into two stages and demonstrates the effectiveness of the method in improving grasp quality, diversity, and alignment with language instructions.

缺点

The method requires a full point cloud and is designed for static pose generation. In real-world applications, access to a complete point cloud is often unavailable, although this could be mitigated by using a point cloud completion network. Moreover, grasping is typically a precursor to a subsequent action, requiring an end-to-end pipeline that can handle both grasping and the intended use of the object. An integrated approach would likely be more effective than segregating these steps into different modules. For instance, the subsequent policy must not only use the generated pose but also comprehend instructions (again, since the generated pose alone is not enough) and object geometry to perform the correct actions, which an end-to-end system could potentially handle more seamlessly. Plus, there can be slight adjustments to gestures after grasping a tool from a table but before using the tool. In this sense, real world experiments on this synthesis method are not so legitimate in the first place.
The paper does not sufficiently address how the method generalizes to unseen objects or categories. Given the diversity of real-world objects, it is crucial for the network to perform well on unseen items to be practically useful. Including more detailed experiments and analysis on in-category and out-of-category generalization would significantly strengthen the work. This could involve testing the method on a wider variety of objects and categories not present in the training dataset and assessing its adaptability given that OakInk has the category information.

问题

The paper states that intention consistency is calculated by comparing the prediction to the most similar grasp target. While this approach addresses the issue of multiple valid grasp poses, could you provide more details on how the diversity of valid grasp poses is represented in your dataset and how it influences the evaluation metrics?
The paper does not seem to explicitly address how the proposed method generalizes to different objects, unseen objects, or even unseen categories. The dataset is derived from OakInk, which has category information. Could you provide more details on the performance and adaptability of your method when applied to objects and categories not present in the training dataset? This generalizability is important for the network to understand the correspondence between the geometry and functional semantics. In-category and out-of-category generalization are both essential for a capable network.

局限性

The method is not yet ready for real-world applications due to several factors. It requires a full point cloud, which needs several adds-on modules like segmentation and completion for deployment. Additionally, there is a lack of discussion on how this method would benefit the subsequent grasping execution and how the grasped objects can be manipulated, which is crucial for practical use.
The paper does not sufficiently address how the method generalizes to unseen objects or categories. Ensuring that the method can handle a wide range of objects, including those not present in the training dataset, is essential for practical applicability.

作者回复

2024-08-07

Thanks for your reviews. Below are our responses to your questions.

Q1: The full point cloud and end-to-end pipeline.

A1: Our work is focused on language-guided dexterous grasp generation task and is the base of future work of partial observation and end-to-end pipeline. What’s more, our cascading pipeline works well in real robot experiment, where shows acceptable impact on the performance of our framework.

Q2: Generalization on unseen objects.

A2:

All objects in the test set of DexGYSNet DO NOT exist in the training set. Therefore, the experiment results of generation performance are all evaluated on unseen objects in this paper. Our dataset consists of 34 categories, 1500 objects, and 50000 grasp annotations. We split it at the object level, with 80% for training and 20% for testing in each category. The training set contains about 1200 objects and 40,000 grasps, and the test set contains about 300 objects and 10,000 grasps.
In addition, we conduct experiments on cross-domain unseen objects, collected from PartNet [3] (a 3D object understanding dataset). The results show the good generalization of our framework. The details and results of cross-domain unseen object experiment can be found in global rebuttal (b), Table 3 and Figure 2 (b) in rebuttal PDF.

Q3: Generalization on unseen categories.

A3: We conduct experiments on unseen categories by removing 15 categories in the training set. The results shows the good generalization of our method. Benefiting from our framework design and our large-scale datasets, model can learn from object geometric structure and language semantic information, not just overfitting certain category (e.g., learn how to use a mug by grasping the handle and generalize to how to use a teapot). The details and results can be found in global rebuttal (b), Table 2 and Figure 2 (a) in rebuttal PDF.

Q4: Intention consistency and diversity calculation.

A4:

The ground trunth grasps of the object-guidance pair are all grasp poses sharing the same intention. The grasps with same simple intention and contact parts are considered as the same intention (e. g., all grasps that holding the bottle from different directions).
In the inference time, the object and language are fed into our framework as condition, and multiple (specially 8 in our setting) samplings are performed to generate 8 grasps.
Each generated grasp is matched with the nearest ground truth grasp by the hand chamfer distance. The intention consistency metric is calculated between the generated grasps and the matched ground truth. The diversity metrics calculate the standard deviation of these eight sampling grasps.
We believe that a good model should be able to generate grasps that are both diverse and aligned with intention (e.g. hold the body of a bottle from different directions).

评论- Keep my original decision

2024-08-11

The author's responses mostly address my concerns and questions. While some doubts remain on the foundational settings of this work, e.g., full point cloud and an end-to-end execution pipeline, this paper still provides a strong basis for future exploration in such topics. Therefore, I retain my original recommendation for acceptance.

评论- Official Comment by Authors

2024-08-11

Thanks for the reviewer's feedback. We are pleased that our response has addressed the concerns.

审稿意见

评分: 5置信度: 52024-07-10

The paper introduces "Dexterous Grasp as You Say" (DexGYS), a novel framework that integrates natural language instructions with robotic dexterous grasping. To support this integration, the authors present DexGYSNet, a dataset featuring 50,000 pairs of annotated dexterous grasps with corresponding human language instructions across 1,800 household objects, developed using innovative methods such as Hand-Object Interaction Retargeting and LLM-assisted annotation.

Building on this dataset, the DexGYSGrasp framework is introduced, designed to generate robot grasps that are aligned with human intentions, ensuring diversity and quality. This two-phase framework initially learns a diverse grasp distribution aligned with intended commands and subsequently refines these grasps to enhance quality.

Extensive testing both on DexGYSNet and in real environments confirms the framework's ability to produce practical, diverse, and high-quality grasps based on natural language commands, significantly advancing human-robot interaction capabilities for both industrial and domestic applications.

优点

The motivation of the authors is reasonable: leveraging current large language models to provide fine-grained descriptions of grasping states. The DexGYSNet dataset contributes to the community.
The authors validated their algorithm on a real robotic gripper, which is crucial.
The authors showcased parts of the DexGYSNet data in the supplementary material. Providing visualization code would further enhance this contribution.

缺点

The evaluation metrics require further discussion. DexGYSGrasp is a generative framework, and the effectiveness of the proposed method based on Chamfer distance and Contact distance is not entirely convincing. Although the authors mention in Supp. (Line 521) that the target is "the most similar grasp", it is not always possible to find a closely matching grasp pose in the dataset. For instance, if the generated grasp is on the right temple of a pair of eyeglasses, but all dataset grasps are on the left temple, the Chamfer distance metric becomes ineffective.
DexGYSNet only annotates which fingers contact the object and which parts of the object are touched. It lacks more complex intents, which might limit its applicability.
The authors need to verify the quality of DexGYSNet. For example, what about the quality of the text annotations (e.g., could hallucinations from large models cause errors)? Additionally, the authors obtain dexterous hand parameters from MANO parameters through retargeting. They need to validate the grasp quality in physical simulations (e.g., Isaac Sim) to ensure that the objects are tightly grasped without dropping.
In Line 170, the authors train the Quality Grasp Component based on the most similar ground truth grasp with the same intentions. Perhaps, adding noise directly to the generated grasp pose to create a large amount of training data would be more effective?
The paper contains several typos. For example, Line 117 incorrectly uses $L_{spen}$ . Lines 173 and 168 are repetitive.

问题

My questions are primarily based on the Weaknesses:

The authors need to further discuss their evaluation metrics. CD and Con. are too stringent for a generative task. Metrics based on collisions or the standard deviation of poses seem insufficient for evaluating language-based generative tasks.
The authors might consider how to incorporate richer intents into DexGYSNet.
The authors should include an evaluation of the quality of the DexGYSNet dataset.

局限性

The authors discuss the societal impacts and limitations of their method.

作者回复

2024-08-07

Thanks for your reviews. Below are our responses to your questions.

Q1: Evaluation Metrics.

A1: To further evaluate, we employ Fréchet Inception Distance (FID) [1], which is commonly used in generative task [6]. We use sampling point cloud features extracted from [2] to calculate P-FID and rendering image features extracted from [1] to calculate FID. The results in Table 1 in global rebuttal PDF show that our methods surpass all previous methods. More details can be found in global rebuttal (a), and Figure 1 (a) and Table 1 in rebuttal PDF.

Q2: Richer Intents.

A2:

The guidance texts about contacting in DexGYSNet are better aligned with the grasp action, thereby facilitating the generation of higher quality grasps.
Our method can be naturally extended to enrich intentions. For the dataset construction, the high-level intention can be annotated by the LLM with proper prompts (e.g., “Please generate possible human intentions for grasping the mug handle.”). For methods, the high-level intention can be understood by the LLM and output as low-level guidance through chain-of-thought reasoning [3] (e.g., from "I am thirsty" to "Grasp the handle of the mug"). Then, the low-level guidance can be fed into our framework to generate grasp.

Q3: Grasp Quality of DexGYSNet.

A3: For grasp quality, we validate the grasp quality in Isaac Gym following [4], and the success rate is 85.1%. The overall grasp quality of the DexGYSNet is relatively high, compared with previous datsets [7]. And we find that the simulation errors, and small contact area in fine-grained operation (e.g., grasp the cap of a bottle) result in failure case. We will perform retargeting and verification again assisted with manual adjustment to ensure the high quality of DexGYSNet.

Q4: Text Quality of DexGYSNet.

A4: For text annotations quality, we use the LLM-assisted evaluation [5] (leveraging the powerful vision understanding ability of GPT4-o) to evaluate the consistency with the rendered grasp image as shown in Figure 1 (b) in global rebuttal PDF. The average consistency score is 4.38 (on a scale of 1-5), and the overall text quality of the DexGYSNet is relatively high, according to [5] . Some inconsistent samples are caused by fine-gained contact and specific modifiers (e.g., “tightly grasp the handle of the mug with all your fingers”, while the LLM thinks it's not tight enough or some fingers don’t contact the handle.). We will improve these with the assistance with LLM and manual check.

Q5: Adding noise to train Quality Grasp Component.

A5: We add the Gaussian noise to the normed grasp pose to augment the training set. There is no significant improvement, as shown in the following table. We think that the current training data is sufficient to train our model.

	P-FID ↓	Q1 ↑
Ours	5.796	0.083
Add Noise	5.764	0.085

[1] Heusel, et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. NIPS 2017.

[2] Nichol, et al. Point-E: A System for Generating 3D Point Clouds from Complex Prompts. OpenAI 2022.

[3] Brohan, et al. Vision-language-action models transfer web knowledge to robotic control. CoRL 2023.

[4] Wang, et al. DexGraspNet: A Large-Scale Robotic Dexterous Grasp Dataset for General Objects Based on Simulation, ICRA 2023.

[5] Zheng, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NIPS 2023.

[6] Guo et al. Generating Diverse and Natural 3D Human Motions from Text. CVPR 2022.

[7] Wei et al. DVGG: Deep Variational Grasp Generation for Dextrous Manipulation. RAL 2022.

2024-08-11

I appreciate the author’s professional and meticulous responses. Currently, most of my concerns have been addressed. I recommend that the author include a discussion on the evaluation metrics in the camera-ready version of the paper, as this is crucial and will impact all subsequent follow-up works. I also look forward to the prompt publication of the high-quality dataset. Therefore, I am inclined to maintain my review score.

评论- Official Comment by Authors

2024-08-11

We appreciate the reviewer's constructive suggestions and are pleased that our response has addressed the concerns. We will add the discussion on evaluation metrics in the final version.

审稿意见

评分: 5置信度: 32024-07-12

This work proposes a language-guided dexterous grasp dataset based on OakInk, named DexGYSNet utilizing LLM and a corresponding two-component framework DexGYSGrasp which generates dexterous grasps based on human language instructions. Real-world experiments are conducted to validate DexGYSGrasp's performances.

优点

The construction of the DexGYSNet dataset is cost-effective with the help of a large language model-assisted annotation system. The use of an LLM-based approach allows for efficient and scalable data labeling, which is crucial for developing robust models in this domain.
Extensive experiments are conducted to verify the performance advantages of the DexGYSGrasp method, particularly in terms of grasping intention and diversity, as well as the design of the evaluation metrics. The thorough experimental evaluation provides a comprehensive assessment of the approach and its capabilities.

缺点

While the DexGYSGrasp framework presents several notable contributions, such as the use of penetration loss and the Quality Grasp Component, the primary strength of the approach seems to be in the training methods rather than the overall framework design.
The Quality Grasp Component appears to function more as a post-processing step rather than a core component of the framework. This suggests that the key innovations of the work are centered around the training techniques employed, rather than the architectural design of the system.

问题

The language descriptions generated via the large language model, even with the inclusion of rich contact information as shown in Figure 6, still appear to lack precision. In this case, it is questionable whether directly applying metrics like Chamfer distance and contact distance is the most appropriate approach to reflect the consistency of the grasping intention. The suitability of these particular evaluation metrics for the language-based grasping task deserves further examination.
A minor question: the naming of the dataset as "DexGYSNet" seems somewhat confusing, as it suggests a naming convention more suited for a pipeline or system, rather than a dataset. "DexGYSGrasp" would be a more intuitive name for the dataset itself, as it aligns better with the specific task of grasping being addressed.

局限性

See Weaknesses

作者回复

2024-08-07

Thanks for your reviews. Below are our responses to your questions.

Q1: The main contribution.

A1: The main contribution of our paper is NOT just a dataset and a baseline. We believe our idea can contribute to robotics community. The main contribution can be summarized as:

This paper explores a meaningful task, "Dexterous Grasp as You Say" (DexGYS), enabling robots to perform dexterous grasping based on human language commands.
This paper proposes the DexGraspNet dataset and a cost-effective dataset construction pipeline for language-guided dexterous grasp generation with the idea of retargeting from human grasps and LLM assistance. This pipeline can inspire more language-guided and dexterous robotics tasks.
This paper proposes a progressive two-stage framework that decouples the challenging problem of dexterous grasp generation into two sub-problems. One-stage methods struggle to simultaneously achieve high quality, diversity, and intention consistency due to optimization challenges, while our framwork achieve these.

The framework surpasses all previous methods and only our framework can achieve high-quality, diversity and intention consistency generation.

Q2: The framework is a baseline.

A2: Our framework is NOT a baseline, but a novel framework that solves the identified optimization problems and achieves capabilities that previous works can’t.

We benchmark several SOTA methods as baseline as shown in both Table 1 in the main paper and rebuttal PDF. The results show that only our framework can achieve the generative ability of intention consistency, diversity and high quality.
We identify the key optimization challenge for dexterous grasp generation caused by penetration loss and propose to decouple this challenge into two easily solvable problems. And our framework effectively addresses both problems.
What's more, the Quality Grasp Component is not a post-processing method like Test Time Adaptation [1], but a part of progressive two-stage framework. Quality Grasp Component needs to keep the intention consistency and improve the quality, which can’t be achieved by [1] as shown in Table 2 Line 6 ( $IDGC(λ^2_{pen}=0)+TTA$ ) in the main paper.

Q3: Evaluation metric.

A3: To further evaluate, we employ Fréchet Inception Distance (FID) [2], which is commonly used in generative task [4]. We use sampling point cloud features extracted from [3] to calculate P-FID and rendering image features extracted from [2] to calculate FID. The results in Table 1 in global rebuttal PDF show that our method surpasses all previous methods. More details can be found in global rebuttal (a), and Figure 1 (a) and Table 1 in rebuttal PDF.

Q4: Why call the dataset DexGYSNet.

A4: We follow the ImageNet [5] to name our dataset, other examples include DexGraspNet [6].

[1] Jiang et al. Hand-Object Contact Consistency Reasoning for Human Grasps Generation. ICCV 2021

[2] Heusel, et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. NIPS 2017.

[3] Nichol, et al. Point-E: A System for Generating 3D Point Clouds from Complex Prompts. OpenAI 2022.

[4] Guo et al. Generating Diverse and Natural 3D Human Motions from Text. CVPR 2022

[5] Deng et al. ImageNet: A Large-Scale Hierarchical Image Database. CVPR2009.

[6] Wang, et al. DexGraspNet: A Large-Scale Robotic Dexterous Grasp Dataset for General Objects Based on Simulation, ICRA 2023.

2024-08-10

Thank you for clarifying your core contributions and the value of your design. The introduction of the new metric, FID, and additional experiments help to evaluate diverse but correct grasping positions. I have accordingly raised my score.

评论- Official Comment by Authors

2024-08-11

Thanks for the reviewer's feedback. We are pleased that our response has addressed the concerns.

作者回复

2024-08-07

Thanks for the constructive comments of all reviewers. We conduct additional experiments and analysis on the evaluation metrics and generalization. The figures and tables can be found in the rebuttal PDF.

(a) Evaluation metrics:

To further evaluate, we employ Fréchet Inception Distance (FID) [1], which is commonly used in generative task [6] by measuring the distance between the generated distribution and the ground truth distribution. We use sampling point cloud features extracted from [2] to calculate P-FID and rendering image features extracted from [1] to calculate FID. The details are shown in Figure 1 (a).
To further evaluate, we evaluate the grasp success rate in Issac gym following [4].
The results in Table 1 show that our method surpasses all previous methods in FID, P-FID and success rate.
The Chamfer Distance and Contact Distance can also evaluate the model performance. These metrics are calculated between the generation grasp and the most similar grasp in ground truth which share the same intention. The grasps with same simple intention and contact parts are considered as the same intention (e. g., all grasps that holding the bottle from different directions). As a whole, the generated grasp can find a proper ground truth with the same intention benefiting from the richness of DexGYSNet. The contact consistency is also used in previous work [7]. Therefore, these metrics can evaluate the model’s overall performance in the whole dataset.

(b) Generalization on unseen objects and categories.

All objects in the test set of DexGYSNet DO NOT exist in the training set. Therefore, the experiment results of model performance are all evaluated on unseen objects in this paper.
To evaluate the generalization on unseen categories, we randomly sample 15 categories [8] and exclude these during training phase. The results in Table 2 and visualization in Figure 2 (a) show the good generalization of our method. Benefiting from our framework design and our large-scale dataset, model can learn from object geometric structures and language semantic information (e.g., learn how to use a mug by grasping the handle and generalize to how to use a teapot), not just overfitting the certain category.
To further evaluate, we collect cross-domain unseen objects from PartNet [3] (a 3D object understanding dataset), containing 6 categories and 316 objects [9]. We collect language guidance generated by GPT4-o, where the prompt contains examples in DexGYSNet and the category of target object. We use GPT4-o to evaluate the consistency between rendering image and language guidance as shown in Figure 1 (b), due to the lack of ground truth. The results in Table 3 and visualization in Figure 2 (b) demonstrate that the models trained on DexGYSNet exhibit good generalization in cross-domain unseen objects and our framework surpasses the SOTA.

[1] Heusel, et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, NIPS 2017.

[2] Nichol, et al. Point-E: A System for Generating 3D Point Clouds from Complex Prompts. OpenAI 2022.

[3] Mo, et al. PartNet: A Large-scale Benchmark for Fine-grained and Hierarchical Part-level 3D Object Understanding, CVPR 2019.

[4] Wang, et al. DexGraspNet: A Large-Scale Robotic Dexterous Grasp Dataset for General Objects Based on Simulation, ICRA 2023.

[6] Guo et al. Generating Diverse and Natural 3D Human Motions from Text. CVPR 2022

[7] Agarwal, et al. Dexterous Functional Grasping. CoRL 2023

[8] "Unseen Categories": "cup", "cylinder_bottle", "teapot", "hammer", "screwdriver", "pincer", "fryingpan", "toothbrush", "wrench", "bowl", "lightbulb", "flashlight", "binoculars", "apple", "banana".

[9] "Cross-domain Unseen objects": "bottle", "mug", "headphones", "jar", "knife", "scissors".

最终决定Accept (poster)

2024-09-25

The authors addressed all concerns raised by the reviewers with additional experiments and clear explanations. The paper's contributions, including the DexGYSNet dataset and the DexGYSGrasp framework, are significant for advancing the field of language-guided robotic grasping. The final evaluations from the reviewers were positive, recognizing the paper's technical merits and potential impact, leading to a recommendation for acceptance.