InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction
摘要
评审与讨论
The paper addresses the challenge of text-conditioned 3D dynamic human-object interaction (HOI) generation, which has lagged behind advancements in text-conditioned human motion generation due to the scarcity of large-scale interaction data and detailed annotations. The authors propose InterDreamer, a framework that decouples interaction semantics and dynamics to generate realistic HOI sequences without relying on text-interaction pair data. By leveraging pre-trained large models for high-level semantic control and introducing a world model to understand low-level interaction dynamics, InterDreamer surpasses existing motion capture data limitations. Applied to the BEHAVE and CHAIRS datasets, InterDreamer demonstrates the ability to produce coherent and text-aligned 3D HOI sequences, showcasing its effectiveness through comprehensive experimental analysis.
优点
- The authors introduce a novel task of synthesizing whole-body interactions with dynamic objects guided by textual commands, without relying on text-interaction pair data.
- The types of HOI interactions are limited compared to the diversity of movement in the human body itself. So, the idea that decomposes semantics and dynamics and then integrates them makes sense to me. The main challenge lies in effectively understanding the positions where the human object interacts. The authors propose leveraging the power of LLMs to simplify this challenge effectively.
- The technical details are sound, and the experimental results are good and demonstrate its zero-shot capability.
- The paper is well written, with clear figures and typography.
缺点
- This work doesn't seem to be able to handle more complex long interactions, such as a person walking to a chair and then sitting down, or a person lifting a box on the floor and carrying it for a while before putting it on the ground.
问题
- I'm curious, can the Interaction Retrieval predict the accurate contact areas for an object that has never been seen before?
- Is this post-processing optimization expensive, and how long does it take to optimize a sequence?
局限性
- The method still is unable to handle fine-grained human-object interactions (HOIs), such as dexterous manipulation using hands. Right now HOI generation is still a very challenging task, so I wouldn't consider this limitation to be a weakness of the manuscript, and expect that subsequent work will refine this point.
We thank the reviewer for thoughtful and insightful comments. We address your concerns below:
W1: handle more complex long interactions:
- As shown in Figure 4 and in the examples after 00:43 in demo_1.mp4 of the supplementary material, our approach is capable of handling complex and extended interactions, such as “A person holds a medium box up with their right hand, lowers their right arm, and pulls the box with left hand towards them.”
- However, we recognize the model's limitation on very long and highly complex sequences, particularly those involving multiple distinct phases, which we found to be difficult for existing work even with supervised training [21,50,71 in the paper]. To address this, future work could explore temporal compositionality in HOI, as well as incorporate more complex multi-phase interactions into the training data, to enhance the model's capability in these scenarios.
Q1: Accuracy of contact reasoning for novel objects:
-
The handcrafted interaction retrieval works well with predefined objects but lacks generalizability, as it requires building a specific database. In contrast, the learning-based interaction retrieval (L671-681 in the supplementary material) can handle novel objects but may struggle in complex scenarios due to two main issues:
- Stable Diffusion sometimes generates low-quality images in complex human-object interactions, leading to unnatural humans, incorrect object states, or unreasonable contact patterns.
- These lower-quality images fall outside the distribution that LEMON (an off-the-shelf model that we used for estimating object affordance and human contact from images) is not able to predict accurate contact areas (affordance).
Improving the text-to-image model for human-object interaction could enhance image quality, which we see as a promising direction for future work.
Q2: Efficiency of optimization:
- The optimization is expensive, and each step often takes a few seconds. Thus, for efficiency, as mentioned in the supplementary material (L693), optimization is only performed if the loss exceeds a certain threshold. This strategy prevents unnecessary computations, thus maintaining overall computational efficiency.
Limitation
-
Thanks for your feedback. We agree with the reviewer that generating fine-grained human-object interactions (HOIs), such as dexterous hand manipulations, remains a challenging task. One potential solution would be to incorporate a dedicated hand model, trained specifically on the nuances of fine-grained hand interactions (e.g., OakInk 2[1]), and then integrate it with the body model.
[1] Zhan, et al. OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion. CVPR 2024
Thank you to the authors for their detailed response. Most of my concerns have been addressed. Text-driven 3D HOI generation is a complex and novel task involving full-body human motion generation, affordance learning, spatial awareness of human-object interactions, and more. There are still many aspects of this task to explore. Ultimately, I am inclined to accept this work and hope it can offer valuable insights and approaches for advancing this direction. Therefore, I maintained my original rating.
Thank you for your positive feedback and constructive input throughout the review!
This work aims at human-object interaction generation under less supervision. Since existing methods rely on large-scale interaction data, this paper attempts to propose a method that does not require paired data. The proposed method includes three stages, high-level planning by LLMs, low-level control by existing text-to-motion methods, and interaction retrieval, a world model based on existing human-object interaction generation methods. The proposed method is evaluated on the public datasets, BEHAVE and CHAIRS datasets.
优点
- The issue focused on in this paper is a worthy research topic. If the training does not require pair-wise data, it will greatly reduce the dependence of the human-object interaction generation on the training data.
- Given texts, the figures intuitively demonstrate visual representations of the generated human-object motions.
缺点
- About novelty. While the issue focused on in this paper is valuable, the proposed method is a combination of existing methods. I appreciate the technical effort of the authors, but please highlight how each stage differs from existing approaches.
- Missing some details. --- In Line 165, how is the database built? If the data used to build the dataset includes the training set and the test set, it is not standard. ---Does the low-level control need to be trained? As mentioned in Line 269, the text-to-motion model is the existing method which is pre-trained on HumanML3D. If no training is required, how does the model generalize on action types? --- As mentioned in Line 180, the authors claimed “this model trained on the 3D HOI dataset”, while the main work is that the proposed method does not require training on paired HOI data. It is inconsistent and confusing. --- What is the size of the object’ signed distance fields? How many vertices is the object represented by? What network to use to encode and decode object shapes?
- The motivation for the quantitative experiment (section 4.2) is unclear. Why compare different control conditions?
- In Line 316, what is the vertex-based control? Does it mean human vertex controls in Line 210? The definition should be given.
- In the caption of Figure 4, it is claimed that the proposed method can handle complex and long sequences. Is there a strategy in the proposed method for complex and long sequences? What is the longest?
问题
As mentioned in weaknesses.
局限性
The authors have discussed the limitation of their method.
We thank the reviewer for thoughtful and insightful comments. We address your concerns below:
W1: Novelty
We appreciate the reviewer's thoughtful suggestions. While our method incorporates elements adapted from existing approaches, we would like to clarify the novelty of these components, particularly in the context of human-object interaction (HOI) generation.
-
High-Level Planning: Our approach introduces a simple yet effective method to bridge the distributional gap (Line 138) between pre-trained text-to-human generation and free-form text-to-interaction generation. This is a key distinction from most existing work [102], which primarily utilizes LLMs only for understanding contact parts. We evaluate the significance of our design in Figure 6 and L300-309.
-
Low-Level Control: While we employ existing methods and pre-trained models for low-level control, which we do not claim as novel, our contribution lies in the seamless integration and evaluating the effectiveness of four established text-to-motion models within our framework, which show that our framework is general.
-
World Model: Unlike existing work that typically encodes the full state of the interaction (as shown in references [43, 83, 108] in the paper), our method introduces a novel approach to modeling contact vertices. As discussed in the paper (Lines 64-68, 184-191), this vertex-based representation ensures that the world model focuses on the critical contact regions, preventing overfitting to specific details, such as particular object shapes. This leads to improved performance, as demonstrated in Table 1, Figure 8, and Figure 5 (generalize to unseen objects from the CHAIRS dataset).
Overall, HOI generation, as a challenging task with inherent data scarcity issues, requires a novel paradigm distinct from existing approaches. Our perspective focuses on how to effectively reuse and harness large models, which is an emerging and crucial research problem. While naive reuse can lead to poor performance (e.g., Figure 6), our approach fosters synergy among models, resulting in contributions tailored to this task. We welcome the reviewer's feedback on any specific related papers and would be happy to discuss further or clarify any similarities with these works.
W2.1: Database build for interaction retrieval:
- We apologize for the omission of details. To clarify, we used only the training set to build the database. We will revise in the revision.
W2.2: Generalization of low-level control:
The low-level control does not require additional training. We would like to clarify how the model generalizes across action types from the following perspectives:
- Generalization to Diverse Actions: The text-to-motion model inherently possesses a degree of generalization across actions, similar to CLIP, where the compositionality of natural language allows for generalization. Semantics in the language space can be composed to represent different actions. By aligning language space [74] or discrete tokens [128, 35] with motion space/tokens, the motion model benefits from the generalization capabilities of the text compositionality.
- Generalization to HOI Descriptions: While these models are not directly generalizable to free-form HOI descriptions, we bridge the distributional gap between free-form interaction descriptions and the text-to-motion model's distribution using chain-of-thought prompting, as demonstrated in Figure 6 and Lines 300-309, which constitutes one of our contributions.
W2.3: No need for pair data:
- To clarify, the proposed method does not require training on paired HOI data. However, it does require training on HOI data without paired text (Lines 62-63), and only the world model needs to be trained with this data (Lines 76-77). We will revise.
W2.4: size of SDF:
- If the size refers to the resolution of the SDF, our approach directly calculates the distance from points to the object’s surface, theoretically providing infinite precision. If a different aspect of size was intended, we would appreciate further clarification.
W2.5: number of vertices:
- It depends on the specific objects used. For example, for the BEHAVE dataset, we use their provided fine-grained objects, each containing over 10,000 vertices, to calculate the SDF.
W3: Motivation for control based on contact vertices
- The motivation for our quantitative experiment in Section 4.2 stems from the need to validate a key contribution of our work: as outlined in the paper (L64-68, L184-191), the vertex-based representation ensures that the world model focuses on critical contact regions. This enables the network to predict how object motion is influenced by interactions at these vertices, reducing the risk of overfitting to specific object shapes or detailed body part movements. By emphasizing high-level concepts–the fundamental principle that human-applied force on an object leads to acceleration (akin to Newton's law)--the network learns more generalizable patterns, enhancing its performance in HOI modeling.
- To substantiate this claim, we compared our method against various control conditions, including a raw control where all available information is fed into the network. The results clearly demonstrate that our approach is superior.
W4: vertex-based control
- Yes, we appreciate the reviewer for pointing this out. The vertex-based control refers to our method where only the contacting vertices are used for control. We will clarify this definition in the revision.
W2.6: object shape encoding:
- The object’s shape is inherently encoded using the human vertex-to-object surface distance, as described in L204-206. This information is then processed through dynamics blocks (MLPs) as outlined in L208 and L210, and further integrated with attention layers (Line 212) for interaction modeling.
We put the response to W5 in the global response of rebuttal because of the character limit.
Thanks to the authors for their responses, but there are still some concerns that remain unresolved.
W1: novelty
- About High-Level Planning. Firstly, the description “[102], which primarily utilizes LLMs only for understanding contact parts” is incomplete, [102] also predicts object size. Secondly, Figure 6 is confusing. For Figure 6 (a), a fair comparison is the generation of different methods under the same input, and the current comparison is confusing. For Figure 6 (b), the difference between red and green dots is not significant. It cannot be observed that “reduces the distributional gap (L306)”.
- About the world model. What is the full state of the interaction?
W2.3: no need for pair data: It is still confusing for L180. which dataset for training the world model, if it does not need a paired dataset?
W2.6: object shape encoding: Following the new description, it is still hard to follow to understand how to encode object shapes. It should improve the writing.
W5: handling complex and long sequences: For long sequences, why interpolate the sequences to 30fps? How to interpolate?
In short, the writing of the paper is poor, resulting in some confusing descriptions in the paper. In the rebuttal, the author still claims that there are descriptions in the paper, but the description is confusing. I hope the author will improve the writing to clarify the details of the method. Also, the novelty of the method is limited, so I keep my score, borderline reject.
Thank you to the reviewer for the additional feedback and suggestions. We will incorporate these suggestions and revise our manuscript accordingly. Below, we address the remaining concerns:
W1: [102]
In [102], the input to the LLM includes both HOI interaction descriptions (represented as action and object category labels) and human and object sizes measured in the image space; the output includes both contact reasoning and the real scale of the human and object. In contrast, we only provide the LLM with textual interaction descriptions (free-form text instead of category labels) as input – thus, in the rebuttal, we focused solely on the LLM’s reasoning capability based on the HOI textual descriptions. In comparing the LLM’s reasoning on interaction descriptions, we address not only contact reasoning (L136) but also object categorization (L134) and, importantly, distributional shift, where the free-form HOI description falls outside the distribution of the text-to-motion model (L137). We will provide a more detailed discussion of our differences with [102] in the revision, following the reviewer’s suggestion.
W1: Figure 6(a)
- We would like to clarify that Figure 6(a) ablates the effectiveness of high-level planning. Therefore, in this setup, we use a single model – the text-to-motion model, MotionGPT. We compare two types of inputs: the raw text description ("w/o planning") and the rephrased text generated by the LLM ("w/ planning"), which is stated in L300-302.
- Figure 6(a) provides an example for comparison. As noted in the figure caption, the text on the left side, “Someone can be seen sitting on a yogaball,” represents the raw description. Below this text is the motion sequence synthesized by MotionGPT based on this raw description. On the right side, the text “A person is seated on an object” represents the text rephrased by high-level planning from the original description (“Someone can be seen sitting on a yogaball”). Below this rephrased text is the motion sequence synthesized by MotionGPT using the rephrased input. For both mesh sequences, the transition in color from gray to blue indicates the progression of the time series.
- The comparison was fair: the rephrased text, generated through LLM high-level planning, does not need to be exactly the same as the raw description. Instead, it aligns more closely with the style of text descriptions used to train MotionGPT. Consequently, the motion sequence synthesized by MotionGPT from the rephrased text tends to better match the intent of the raw description (L302-304).
W1: Figure 6(b) & L306
- As stated in the rebuttal, “we evaluate the significance of our design in Figure 6 and L300-309.” Therefore, the claim that our method “reduces the distributional gap (L306)” should be interpreted as being supported by the collective evaluations in Figure 6(a), Figure 6(b), and L309, which include both visualizations and quantitative measurements.
- Specifically, Figure 6(b) qualitatively visualizes the CLIP features, while L309 provides quantitative evidence by comparing the CLIP differences between the raw descriptions and in-distribution text, as well as between the rephrased text and in-distribution text. As stated in L309: “The text processed by the planning shows greater similarity to the in-distribution text from HumanML3D, with an average cosine similarity of 0.932 compared to 0.913 from the raw annotation.” This evidence supports the claim that our approach “reduces the distributional gap (L306).”
- To provide further evidence, we tested our method on out-of-distribution text. We selected examples with an average cosine similarity to in-distribution text of less than 0.85, resulting in an overall average of 0.838. Our high-level planning successfully rephrased these texts, increasing their average similarity to 0.927. For instance, in Figure 6(a), the text 'Someone can be seen sitting on a yoga ball' has a cosine similarity of 0.874 to the closest in-distribution text, whereas the rephrased text by high-level planning, “A person is seated on an object,” achieves a similarity of 0.958 to the closest in-distribution text. We will incorporate this additional evidence to update L309 and the caption of Figure 6.
Thanks for the response about Figure 6. My concern about Figure 6(a) is addressed. There should be a caption in Figure 6 for the text, in order to improve the readability. However, Figure 6(b) is still confusing. Although the quantitative experiments (L309), Figure 6(b) still dose not show the gap reduction qualitatively. If not, what is it purpose?
W1: Full state
- In the rebuttal, we clarified that “unlike existing work that typically encodes the full state of the interaction (as shown in references [43, 83, 108] in the paper)...” The “full state” in existing work refers to the encoding of human joint and object motion with object geometry encoding.
- In contrast, our method introduces a novel approach by focusing on modeling contact vertices. As mentioned in our rebuttal to Reviewer Jz5F, our inputs to the world model include vertex-based control signals provided to the conditional block (L208), with the object’s past motion directly input to the unconditional dynamics block (L206). We believe this strategy is novel in the field. More specifically, the control signals includes the human vertex motion (including past and future, in L193) and its features: 1) vertex coordinates in T-pose; 2) vertex-to-object surface vectors; and 3) the vertex’s velocity relative to its nearest object vertex as described in L203-L206. Importantly, we only use contacting vertices instead of all vertices.
W2.3: which dataset is used
We mentioned in L68 that the BEHAVE dataset is used to train the world model. We will revise L180 to make it clearer. We would like to clarify that no text-HOI pair data is used; but BEHAVE, as a pure HOI dataset, is used.
W2.6: object shape encoding
- We would like to clarify that the object shape is encoded implicitly in our approach. As detailed earlier in “W1: Full state”, the input to the world model includes the trajectories of human vertices (depicted as small red spheres in the top-right of Figure 2) along with vertex-to-object surface vectors. By adding the vertex-to-object surface vectors to human vertices, the object vertices (shown as small blue spheres in the top-right of Figure 2) can be inferred. This is why we describe the object geometry information as being implicitly encoded. The network of the world model does not receive this information directly, but it can learn to combine these features to derive the object geometry as needed.
- In addition, our insight is that providing partial information with locality and sparsity is more effective than using the complete object geometry encoding. By focusing on critical contact regions through our vertex-based representation, the world model can more accurately predict how object motion is influenced by interactions at these key vertices. This approach minimizes the risk of overfitting to specific object geometries. We welcome further discussion with the reviewer if there are any additional questions or if further clarification is needed.
W5: handling complex and long sequences
As mentioned in L236, the BEHAVE dataset operates at 30 Hz, and our world model, trained on this dataset, is designed to handle this framerate. However, our text-to-motion model runs at 20 Hz as it is trained on the HumanML3D dataset. To ensure compatibility, we use spherical linear interpolation (Slerp) after converting HumanML3D representation to SMPL and then calculate the vertices, allowing the vertex motion to align with the speed that the world model can handle.
Writing and Novelty
We appreciate the reviewer’s thoughtful engagement and will revise unclear descriptions based on the suggestions.
However, we respectfully disagree with the reviewer’s assessment of the novelty of our work. We have clearly outlined the novelty in our proposed components, such as the locality and sparsity design in the world model, which we believe is not presented in existing work before, and the decoupling of semantics from dynamics, which enables the entire pipeline to operate without requiring paired text-HOI data for training, as well as our overall motivation to effectively reuse and leverage knowledge from large models without extensive fine-tuning, which have been recognized positively by other reviewers, acknowledging our approach as insightful (Reviewer Jz5F), promising (Reviewer R7nv), and sound (Reviewer Kjsy). In addition, our evaluations demonstrate that using existing methods naively (e.g., the naive application of text-to-motion models as seen in Figure 6) is ineffective, whereas our synergistic approach provides a robust solution. We welcome the reviewer to give more insight or explanation regarding the perceived limited novelty and are willing to address any specific concerns or questions.
Thanks for the response about other concerns. There are still some concerns that have not been addressed.
W1: the definition of ''full state'' should be added to the revision.
W5: handling complex and long sequences. I'm afraid I have to disagree that increasing from 20hz to 30hz using interpolation is long sequence generation. Long sequence motion generation should directly generate meaningful long motion, rather than simple interpolation.
About writing, good writing should be coherent and natural, but as mentioned in the rebuttal, the explanations for the questionable sentences are often placed far away in another section. This is one of my biggest concerns about writing in this paper.
Thanks for the response of novelty. I think using the LLM to reparse and using existing methods for text-to-motion can be seen as preprocessing. Only the world model is novel. So I think the novelty is limited.
We thank the reviewer for the reply. We are glad that our response has clarified some of the reviewer’s concerns. Below, we address the remaining concerns:
Figure 6(b): We would like to clarify that Figure 6(b) is informative to show the distribution differences between raw text descriptions (“w/o planning”, denoted as green dots) and annotations processed through our high-level planning framework (“w/ planning”, denoted as red dots). Our aim is for the distribution of the red dots to be closer to the blue dots (which represent the in-distribution descriptions from the HumanML3D dataset) compared to the green dots. Upon zooming in on Figure 6(b) and examining the cluster in the middle, the reviewer will observe that the red dots largely overlap with the blue dots, while the green dots show minimal overlap with the blue cluster.
W5: handling complex and long sequences: We believe there is a misunderstanding from the reviewer. As clarified in the previous general rebuttal, “the capability for managing longer sequences arises from autoregressive generation, where the length of the sequence depends on the capacity of the text-to-motion model.” Also, as clarified in the previous rebuttal, the interpretation is introduced in this process to ensure compatibility between the world model (operating at 30 Hz) and text-to-motion model (operating at 20 Hz).
Novelty:
- Our innovations, including the locality and sparsity design in our world model and the decoupling of semantics from dynamics, enable our framework to operate without paired text-HOI data. Additionally, our overarching motivation is to effectively reuse and leverage knowledge from large models without extensive fine-tuning. We believe this approach represents an important paradigm for future research, extending beyond the HOI synthesis task. Other reviewers have recognized these contributions as insightful (Reviewer Jz5F), promising (Reviewer R7nv), and sound (Reviewer Kjsy).
- While using LLMs to reparse for text-to-motion can be seen as a form of preprocessing, no existing work leverages LLMs to address distribution shifts in text-to-motion as we do. We have discussed the detailed difference with the most relevant work [102] in the previous response.
Writing: We sincerely thank the reviewer for the invaluable comments on improving our writing and presentation. We will incorporate all the comments into the revision. Meanwhile, we also found that Reviewer Jz5F rates the presentation as 3 (good) and Reviewer Kjsy rates the presentation as 4 (excellent). We are fully committed to refining our submission to ensure the highest quality in both content and presentation, and we are confident that any writing issues can be effectively resolved during the revision process.
Again, we thank the reviewer for the constructive feedback throughout the review.
InetrDreamer is a framework for synthesizing Human-Object Interactions (HOI) from textual queries. The key feature of InterDreamer is the ability to train without paired text and HOI motion data. To achieve this the work employs a multi-stage pipeline with LLM operating as a high-level planner that defines the parameters to infer the starting object pose and human motion sequences, which are brought together with the help of optimization on the next stage. The method is evaluated on the BEHAVE dataset which is additionally labeled with text as part of this work.
优点
- The key feature of InterDreamer is undoubtedly its biggest strength: the lack of requirement for paired data text-to-HOI data. This is a promising feature that in theory allows scaling the HOI modeling without tedious data-labeling by leveraging the advancements in LLMs.
- Another notable aspect is that the proposed planning can be adopted on top of other existing motion models improving their performance (as demonstrated in Table 2).
缺点
-
One of the key features of the proposed framework is a High-level Planning module that queries the LLM to extract necessary features for downstream modules. However the presented description of the query protocol is scarce, similar works (e.g. SINC by Athanasiou, Petrovich, et al., ICCV'23) also employ LLM within the framework and provide the full template of the query in the supplementary material (Section B) to ensure the reproducibility.
-
Contribution formulation in the Introduction (Lines 71-72): the considered task is text-to-HOI modeling, while training without paired data is rather a feature of the method than a task itself. Further in the text, in the Conclusions section, the work claims to introduce the novel task of 3D HOI generation from text (Line 319). Both formulations do not reflect the actual contribution of the work.
-
Evaluation is performed only on the self-labeled BEHAVE dataset, however, there exists at least one more dataset with text annotations: OMOMO [51] which is not used for evaluation.
问题
- Appendix B.1 describes two approaches to Interaction retrieval. Which of the described techniques is used? How do they compare in terms of performance?
- Are the BEHAVE text annotations planned to be released upon acceptance?
- What is the reasoning behind omitting the OMOMO dataset from the evaluation?
- Lines 157-158: should it be as it is the final result after optimization?
局限性
The work presents a substantial discussion of limitations.
We thank the reviewer for thoughtful and insightful comments. We address your concerns below:
W1: full template of the query
- We appreciate the reviewer’s suggestion and have included our detailed query log in Fig. 1 of the rebuttal PDF file. We will discuss related work on this and add the log to the revision.
W2: Contribution formulation
- We believe that learning text-guided HOI generation from data without direct text supervision is a contribution to our work. We would like to mention that Reviewer Kjsy recognized the novelty of this task, noting that "The authors introduce a novel task of synthesizing whole-body interactions with dynamic objects guided by textual commands, without relying on text-interaction pair data." We appreciate the reviewer’s suggestions and apologize for any confusion. To clarify our statements, we will revise them as follows:
- For Lines 71-72: We address the task of synthesizing whole-body interactions with dynamic objects guided by textual commands, achieving this without the need for paired text-interaction data—a novel approach to the best of our knowledge.
- For Line 319: We focus on the task of text-guided 3D human-object interaction generation, aiming to accomplish this without relying on paired text-interaction data.
W3&Q3: evaluation on OMOMO
- Our method is designed to be dataset-agnostic. We chose the BEHAVE dataset for training the world model because it provides sufficient HOI dynamics to develop an effective dynamics model. We then tested our method on both the BEHAVE, as well as CHAIRS datasets for novel objects, which we believe is a sufficient evaluation.
- We appreciate the reviewer's suggestions. In response to it, we expanded our interaction planning and retrieval database to include the OMOMO dataset, similar to our approach with CHAIRS.
- Table 1 in the rebuttal PDF demonstrates that our interaction planning effectively bridges the distributional gap between the text-to-motion model and OMOMO text. Additionally, qualitative results of the full pipeline on OMOMO, provided in Fig. 2 of the rebuttal PDF, further illustrate the effectiveness of our entire pipeline on novel objects. We will incorporate more experiments into the revision.
Q1: two approaches to interaction retrieval
- For our primary experiments, we primarily utilized the handcrafted approach to implement the full pipeline, as it is straightforward and does not require training. To explore the feasibility of retrieval without relying on handcrafted rules, we also investigated a learning-based method for interaction retrieval. As shown in Figure C of the supplementary material, our qualitative evaluation demonstrates that the learning-based retrieval effectively captures diverse and realistic interactions, producing results comparable to the handcrafted method. Notably, the learning-based approach does not require constructing a database during inference, offering a more flexible solution.
Q2: text annotation release
- Yes, we plan to release the text annotations for the BEHAVE dataset upon acceptance.
Q4: typo revision
- We appreciate the reviewer for catching this typo. Yes, the final result should indeed be denoted as . We will correct this in the revision.
Dear Reviewer R7nv,
Thank you again for your time to review this paper. Could you please check if the authors' rebuttal has addressed your concerns at your earliest convenience? The deadline of the discussion period will end in about 24 hours. Thank you!
Best regards,
AC
Thanks to the authors for providing the detailed response. Additional experiments on the OMOMO dataset are helpful to evaluate the work comprehensively.
I acknowledge that HOI synthesis from text is a challenging and relatively unexplored task, and since the rebuttal addressed some of my concerns, I am inclined to revise the rating. However, I still have a couple of concerns. Namely, the visual quality of the results in the supplementary video (cases of floating objects and severe interpenetration between objects and humans) and the manuscript's clarity (aligned with the feedback from the Reviewer aGg2). I will discuss this further with the other reviewers and AC to make a final evaluation.
We are grateful for the reviewer’s acknowledgment of our response and the task we’ve addressed. Regarding the new concern about visual quality, we agree that our generated results contain minor levels of artifacts. However, as noted by Reviewer Jz5F, “the proposed framework can generate realistic HOI motions that align with the input text conditions.”
It is important to highlight that many of the observed artifacts in fact originate from the dataset itself rather than our method. Specifically, the BEHAVE dataset lacks hand motion, and the text-to-motion models we employed do not account for hand dexterity. As a result, the hand from the average MANO pose may penetrate the object or appear to be floating above it.
Because of this inherent dataset issue, such artifacts are also present in existing works [1,2,3,4,5]. Note that our setting handles free-form text input without paired text-HOI data, which is inherently more challenging compared to the supervised methods used in these works [1,2,3,4,5] that rely on paired text-HOI data. Given these differences in setting difficulty, our achieved results, which exhibit comparable or even fewer artifacts, are particularly pronounced.
We provide examples below that demonstrate how similar artifacts are frequently observed in existing works, even in relatively simpler settings based on supervised training:
- [1] Floating and jittering are visible in the right example from 0:00-0:02, and penetration is noticeable at 2:18-2:23 in this video.
- [2] Penetration can be seen in the middle right of Figure 1 and the top right of Figure 6 in this paper, as well as in Figure 2 of the supplemental material.
- [3] Penetration can be found at 3:04-3:06 and 2:47-2:51 in this video.
- [4] Floating is evident in the first example, and penetration is present in the second example on this website.
- [5] Penetration can be seen in the top right of Figure 4, and floating is visible in the middle left of Figure 6 in this paper.
Note that the works referenced in [1,2,3] are published after the NeurIPS submission deadline (May 22, with CVPR being in June). References [4,5] are preprints available on arXiv.
Clarity: We are grateful to the reviewer for the suggestion on improving the manuscript’s clarity, as well as pointing out the confusion of contribution formulation in the review. We are confident that the writing improvements can be effectively managed in the revision process. We are fully committed to revising unclear descriptions and including any missing details based on all the reviewers’ suggestions to ensure the highest quality in writing and presentation.
We hope our clarifications have addressed the reviewer’s remaining concerns. Once again, we appreciate the reviewer’s engagement and thoughtful discussion.
[1] Diller et al. "Cg-hoi: Contact-guided 3d human-object interaction generation." CVPR 2024.
[2] Song et al. "HOIAnimator: Generating Text-prompt Human-object Animations using Novel Perceptive Diffusion Models." CVPR 2024.
[3] Li et al. "Controllable human-object interaction synthesis." ECCV 2024.
[4] Peng et al. "Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models." arXiv 2023.
[5] Wu et al. "Thor: Text to human-object interaction diffusion via relation intervention." arXiv 2024.
This work aims to address the text-conditioned human-object interaction (HOI) motion generation task. Unlike previous HOI generation approaches that rely on limited existing HOI datasets with text annotations for supervised learning, this work proposes a framework that decouples interaction semantics learning from interaction dynamics learning. This decoupling eliminates the need for large-scale HOI datasets with text annotations for training. The interaction semantics learning leverages an existing pretrained text-conditioned human motion generation model, while the interaction dynamics are optimized using a learned world model that predicts object states induced by human motions. The proposed framework can generate realistic HOI motions that align with the input text conditions.
优点
- HOI modeling remains a challenging and understudied task compared to recent developments in human motion modeling. One major bottleneck is the lack of large-scale, high-quality interaction data. This paper offers an insightful solution to overcome the scarcity of large-scale interaction datasets.
- Several key designs in this framework are reasonable, including (1) LLM-based high-level planning to reduce the distribution gap between input text instructions and the language used for text-to-motion model training dataset; (2) the world model trained with interaction dataset for human pose and object pose optimisation.
- Comprehensive quantitative experiments results show the effectiveness of the proposed modules.
缺点
- Without the object geometry information encoded as input, how does the world model perform in generalizing to different objects? How does it perform when the object is out of the predefined list?
- Efficiency of the autoregressive optimisation: as presented in figure 2, the optimisation of action and state is performed per step, which might result in inefficient inference, while the world model are trained to predict longer-horizon states. Could the author elaborate more on this design choice?
- Regarding the world model training with N vertices sampled from contact area, what if the human pose is not in contact with object, for example, for the text instruction “the person throw away the ball“ where in most of frames the ball is flying in the air without contacting with the human, how does the object state forecasted and optimised with the world model.
- From the visualisation results, there are no significant improvements over baselines HOI-Diff and CG-HOI in terms of the realism of human-object interaction. Is this mainly result from the not-perfect world model training and optimisation or the lack of hand pose? Could the author give some insights on the major challenges and bottlenecks presented in the current HOI understanding?
问题
See the weakness section, and I am happy to discuss with authors during the rebuttal phase and adjust the score accordingly.
局限性
Yes, the authors have addressed the limitations.
We thank the reviewer for thoughtful and insightful comments. We address the concerns below:
W1: How the world model acts in novel objects
-
The world model employs "contact vertices" as input, which includes features derived from the object distance field. These features encompass the human vertex-to-object surface distance and the human vertex velocity relative to the nearest object vertex (L205-206), inherently including information related to the object's shape. This encoding is consistently applied to both training objects from the BEHAVE dataset and novel objects from the CHAIRS dataset.
-
As discussed in (L64-68, L184-191), this vertex-based representation ensures that the world model concentrates on modeling the critical contact regions, and based on such contact modeling, the network learns to predict how object motion will be affected by these interactions. This approach prevents the model from overfitting to specific details, such as particular object shapes or body part motions. By focusing on high-level concepts—the principle that human-applied force on an object results in object acceleration, akin to Newton's law—the network can learn more generalizable patterns that apply across various contexts, e.g. different human actions and objects.
-
Empirically, our approach is more effective than encoding the entire object shape to the world model. The improved performance is demonstrated by the results in Table 1 and Figure 8, comparing vertex control (our proposed) v.s. raw control (full geometry and full human motion). Figure 5 further highlights the model's ability to generalize to unseen objects from the CHAIRS dataset.
W3: World model for non-contacting objects
-
How the network accepts the input without contact condition: Our network can process inputs without contact conditions by adopting an approach similar to ControlNet (Zhang et al., Adding conditional control to text-to-image diffusion models, ICCV 2023). The network comprises two components: (L208) that operates without contact vertex conditions, applicable in scenarios where no contact occurs, and (L210), akin to the control components in ControlNet, which incorporates contact vertex conditions into the object trajectory when contact is present. When there is no contact, only the unconditional network is utilized.
-
Why the network can learn to respond to the input without contact conditions: The model is aware of past object motion and thus needs to learn how human interaction affects the object’s state. This includes understanding how objects follow contact positions or normals by and how they move without contact by . With the no-contact object motion data provided by BEHAVE, the world model (more specifically, ) learns to infer whether the object should free-fall based on its previous velocity or remain on the ground based on its height, as illustrated by the example at 02:27 in demo_2.mp4 (“a person throws a yoga ball towards the ground”) in the supplementary material.
W2: Efficiency of the autoregressive optimization and reason for long-term prediction
-
As mentioned in the supplementary material (L693), optimization during autoregressive generation is performed selectively and only when the loss exceeds a certain threshold. In the Fig. 2 caption, we will clarify that while the overall framework is autoregressive, optimization is applied sparingly. This approach minimizes unnecessary computations, thereby maintaining computational efficiency.
-
The rationale behind training the world model for longer sequences is for effectiveness in capturing temporal dependencies. As noted in L197 of the main paper, we reference "Chi et al., Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, RSS 2023,” which similarly found that predicting over a longer horizon promotes smoother and more coherent sequences for autoregressive generation, by planning actions ahead of time. However, longer predictions can increase the risk of error accumulation. To address this, we use only the early stage of long-horizon predictions, optimize them (while leaving some later parts unused), and use these optimized predictions in the next generation round. This strategy balances the advantages of long-term prediction with the need for accuracy and computational efficiency.
W4: Comparison with the baselines
-
We would like to emphasize that our model is trained without direct text supervision, unlike the baselines that rely on it. Despite this, our model outperforms the baselines in text-interaction alignment to some extent, generating interactions that more accurately reflect the text instructions, even for long and complex descriptions, as demonstrated in the demo videos. Furthermore, our method achieves better realism to the baselines, with less penetration and jittering due to the optimization.
-
To further enhance realism, we acknowledge that integrating a hand model trained on hand-specific datasets (as full-body datasets like BEHAVE do not include hand poses) could address some of the limitations observed in our results. We agree with the reviewer that this addition could improve the fidelity of human-object interactions, which we leave as interesting future work.
W5: Major bottlenecks presented in the current HOI generation
- As discussed in the limitations section of our paper, one of the major challenges in human-object interaction (HOI) understanding is achieving physics realism, which requires a more advanced dynamics model or incorporating simulation. Another significant bottleneck is the lack of detailed hand pose representation in conjunction with full-body interaction in many datasets. This limitation hinders the accuracy of modeling interactions that involve fine motor skills and detailed object manipulation. We welcome the opportunity to discuss these challenges further with the reviewer.
I appreciate a lot for the detailed responses and clarification from authors, and most of my questions are addressed. It is an advantage of this work that no paired text-motion dataset is needed to supervise the training, while the realism in terms of the HOI motions did not get improved much when compared with previous work. The proposed world-model based optimization sounds reasonable and should be helpful to optimize the interaction realism, but probably it is also challenging to learn an accurate generalizable dynamic model for human-object interaction.
Ultimately, though the qualitative results presented in this work do not show significant improvement over previous works, I still appreciate the efforts made by authors in this work by exploring a human-object interaction dynamic model to improve the realism of generated HOI motions, and I would like to keep my original score as borderline accept, and I will wait for the discussion period with AC and other reviews to make final evaluation.
Dear Reviewer Jz5F,
Thank you again for your time to review this paper. Could you please check if the authors' rebuttal has addressed your concerns at your earliest convenience? The deadline of the discussion period will end in about 24 hours. Thank you!
Best regards,
AC
We greatly appreciate the reviewer’s thoughtful comments and acknowledgment of our work. We hope that our contributions, particularly the decoupling strategy that eliminates the need for paired text-HOI datasets in training and the integration of dynamics models with optimization, will inspire future research, potentially extending to tasks beyond HOI synthesis.
We are also encouraged that the reviewer recognizes our efforts to enhance the realism of HOI, especially given the challenges posed by object motion not being directly correlated with text, which complicates maintaining realistic interactions in the dynamics model. We believe that integrating a stronger dynamics model, such as a physics simulator, in future research would further improve realism.
Moreover, we would like to emphasize several aspects related to the synthesis realism:
- It is important to highlight that many of the observed artifacts in fact originate from the dataset itself rather than our method. Specifically, the BEHAVE dataset lacks hand motion, and the text-to-motion models we employed do not account for hand dexterity. As a result, the hand from the average MANO pose may penetrate the object or appear to be floating above it.
- Because of this inherent dataset issue, such artifacts are also present in existing works [1,2,3,4,5]. Note that our setting handles free-form text input without paired text-HOI data, which is inherently more challenging compared to the supervised methods used in these works [1,2,3,4,5] that rely on paired text-HOI data. Given these differences in setting difficulty, our achieved results, which exhibit comparable or even fewer artifacts, are particularly pronounced. Note that the works referenced in [1,2,3] are published after the NeurIPS submission deadline (May 22, with CVPR being in June). References [4,5] are preprints available on arXiv.
Once again, we appreciate the reviewer’s engagement and thoughtful discussion.
[1] Diller et al. "Cg-hoi: Contact-guided 3d human-object interaction generation." CVPR 2024.
[2] Song et al. "HOIAnimator: Generating Text-prompt Human-object Animations using Novel Perceptive Diffusion Models." CVPR 2024.
[3] Li et al. "Controllable human-object interaction synthesis." ECCV 2024.
[4] Peng et al. "Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models." arXiv 2023.
[5] Wu et al. "Thor: Text to human-object interaction diffusion via relation intervention." arXiv 2024.
We thank the reviewers for their constructive comments. We appreciate the recognition that our task is challenging (Reviewer Jz5F) and novel (Reviewer Kjsy), and that our solution is insightful (Reviewer Jz5F), sound (Reviewer Kjsy), and promising, particularly in its potential to scale HOI modeling without the need for tedious data labeling (Reviewer R7nv) and to reduce the dependency of training data (Reviewer aGg2). Additionally, we are excited that our experiments are regarded as comprehensive (Reviewer Jz5F) and good (Reviewer Kjsy), with figures that intuitively demonstrate the generated human-object motions (Reviewer aGg2) and clear visuals (Reviewer Kjsy).
We will carefully revise and incorporate all suggestions in the revision. We address specific concerns in separate, individual responses. We primarily clarify how our model operates, particularly the details and novelty of the world model, and add additional experiments on the OMOMO dataset. The complete log for high-level planning and experiments on the OMOMO dataset is detailed in the PDF.
Due to the 6000-character limit, we include our response to Reviewer aGg2 on the fifth weakness, regarding the handling of complex and long sequences, in the global response.
W5: handling complex and long sequence:
-
The capability for managing longer sequences arises from autoregressive generation, where the length of the sequence depends on the capacity of the text-to-motion model. For instance, MotionGPT, one of the text-to-motion models that we evaluate in the paper, can generate sequences of up to 196 frames at 20fps. We further interpolate these sequences to 30fps, resulting in the longest sequence in Figure 4 reaching 294 frames.
-
In terms of handling complex interactions, our method excels due to two key factors:
- The LLM’s reasoning improves the text-to-motion model's ability to handle complex descriptions, leading to more intricate motion sequences.
- Our dynamic model adeptly handles complex scenarios, such as human motions that differ significantly from the training set on BEHAVE. This capability arises because, unlike human motion representations, the localized and canonicalized contact vertex representation remains consistent even for out-of-distributional human motion, allowing the network to effectively handle and generalize across complex conditions.
Dear Reviewers,
Thank you very much again for your valuable service to the NeurIPS community.
As the authors have provided detailed responses, it would be great if you could check them and see if your concerns have been addressed if you haven't done so. Your prompt feedback would provide an opportunity for the authors to offer additional clarifications if needed.
Best regards,
AC
There were extensive discussions between the authors and reviewers. After the rebuttal, most of the major concerns have been addressed. The remaining issues are mainly about the presentation of the paper and quality of the visual results. On the other hand, the proposed approach has its own novelty, recognized by the reviewers, that it does not required paired text-human-motion data for training, which may inspire future work. As text to human motion generation is a relatively under-explored topic, expecting very high-quality visual results may overshadow and undervalue the emergence of new ideas in this direction.
Overall, the AC found the positive points slightly outweighs the negative side. The response to the ethics reviews are also satisfying. The AC thus recommends to accept the paper as a Poster.
The authors are highly encouraged to incorporate the new results and revise the presentation of the paper following reviewers' comments.