PaperHub
6.6
/10
Spotlight5 位审稿人
最低3最高8标准差2.0
6
8
3
8
8
4.2
置信度
正确性3.2
贡献度3.2
表达3.4
ICLR 2025

DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control

OpenReviewPDF
提交: 2024-09-18更新: 2025-04-28
TL;DR

We present DartControl, a method for high-quality, real-time motion generation from streaming text inputs. By incorporating latent space control, our method further enables diverse motion generation tasks requiring spatial control.

摘要

关键词
Human Motion Generation

评审与讨论

审稿意见
6

The authors present a model that can generate long motions controlled by sequences of text prompts, in an online and real-time setting. This model can additionally follow spatial control such as goal reaching. The model is based on learning a compact latent space that encapsulates motion primitives. A diffusion model is trained using this latent space, conditioned on motion-segment prefix and a text prompt. The paper presents two ways to achieve spatial control: latent noise optimization and reinforcement learning.

The paper is well-written and achieves high-quality results. However, its novelty is weak, as it combines existing techniques. I tend to accept it as it combines these existing techniques in a novel and non-trivial way, to achieve SOTA results.

优点

  • Good exposition
  • Qualitative results outperform prior art
  • Quantitative results are either comparable (t2m) or outperform (spatial tasks) prior art
  • Novel composition of known methods

缺点

  • Weak novelty: all the techniques mentioned are already known: autoregressive generation, latent motion space usage, RL predicting z as an action space, optimization-base control, ...
  • Qualitative results contain many foot sliding and floating artifacts.
  • RL-based control:
    • Should be compared to more than one method. In particular, there should be a comparison to methods doing running and hopping on the left leg. It can be compared to data-driven methods such as the ones used for optimization-based control, GMD (Karunratanakul et al. 2023), or the recent CLoSD (Tevet et al. 2024).
    • Skate and floor distance are used for both reward calculation and metric calculation. Testing a model on criteria that it was optimized for raises concerns.
  • Some points still need clarification. See the "Questions" section.
  • Several typos:
    • L291: "teacle"
    • L917: "w=5"

问题

Questions and Requests:

  • In your model description, please cite works relevant to your techniques (e.g., cite MLD when you describe your latent motion space). Although you cited such works in the RW section, mentioning them again adjacent to each technique would give a better context.
  • L251: By diffusion sampler, do you mean DDPM/DDIM, etc.? This is not clearly stated until line 295, where it is only vaguely mentioned. It should be explained earlier and more explicitly.
  • Please explain why you selected the deterministic DDIM.
  • Can the two methods for achieving spatial goals be used for the same tasks? E.g., textual control is limited in the RL method and is more feasible in the noise optimization one. Please discuss.
  • Compare optimization time vs. baseline methods. In particular, I expect optimization using DART to be faster compared to DNO due to the usage of latent space. Is that correct? Please show relevant results.
  • RL-based control:
    • Are you using the same policy for any prompt, or do you train three different policies, one for each action? (walk/run/hop). If the former is correct, do you support any prompt, or just walk/run/hop?
    • scene observation (L368) - "the relative floor height to the human pelvis": relative floor height: are you considering a flat surface only? if not, are you holding the floor height of the whole scene or just at a specific temporal step? pelvis height: at which frame?
  • Appendix A:
    • L761: Are the locations global or relative to the root?
    • Cite more works that use representation redundancy, e.g., HML3D.
  • L808: On one hand you argue that SMPL format is a must for your representation. On the other hand, The representation learned by the network (L758) is agnostic to betas (body definition). Please explain.
  • L849: According to the Representation paragraph (L758), dt, dJ, and dR are already part of the representation, hence part of the reconstruction loss (L833). Then why use L_aux ?
  • Appendix G rewards: It seems as if for some rewards (e.g., r_dist) "smaller is better", where for others (e.g., r_succ) "higher is better. Please explain.

伦理问题详情

NA

评论

Questions

  1. In-context citation of relevant works. Thanks for the suggestion of in-context citation. We cited the example of MLD (L195, Chen et al., 2023) when discussing the latent motion primitive space. We will explicitly add the method name MLD in the text and check other in-context citations.

  2. Diffusion sampler in L251: We use diffusion samplers of DDPM and DDIM. In L251 of the main paper, we referred to them as Ho et al. (2020) and Song et al. (2022). We will explicitly add the names of DDPM and DDIM.

  3. Why need deterministic DDIM. The gradient-descent optimization control requires the calculation of gradients with respect to the latent diffusion noise ZTZ_T​. To ensure the gradient calculation is stable and reliable, a deterministic mapping is necessary. We implement this deterministic mapping using DDIM. This also provides a deterministic action transitioning for training reinforcement learning control policies.

  4. Comparison of optimization and RL-based control. Both optimization and RL control can be applied to general text-conditioned control tasks.

    a. Optimization control is more general and plug-and-play. However, optimization can be slow.

    b. On the other hand, RL-based control is more efficient and capable of responding to dynamically changing control signals in real time, making it ideal for tasks requiring adaptive behavior. However, RL requires additional training for specialized tasks.

    c. The choice of optimization or RL control can be application-dependent.

  5. Optimization time comparison.

    a. Although using the latent space provides a lower-dimensional variable for optimization, the online rollout generation of our method requires autoregressively calling the diffusion model to generate motion primitives in temporal order. This leads to a linear time cost increase with the sequence length.

    b. For a 90-frame sequence, we report the following time costs. The time cost of our method is roughly 79 seconds, of DNO is roughly 105 seconds, of OmniControl is 35 seconds.

  6. RL-based control policy. We use a single unified policy to handle multiple actions, including walking, running, and hopping. The RL-control framework supports free text prompts, provided they are seen in the training data of the diffusion model. In implementation, we selected walk, run, hop to train the controller as proof of concept.

  7. Scene observation in RL policy.

    a. We assume a flat floor scene in this simple goal-reaching task. The relative floor height is calculated relative to the first frame pelvis in each motion primitive.

    b. The reinforcement learning-based control can incorporate adaptable scene observations tailored to specific tasks. For instance, it can be extended to height maps for terrain locomotion, while complex object interactions, such as grasping a bottle, can use more advanced spatial representations like point clouds and depth maps.

  8. Appendix A.

    a. The joint locations are relative to a local coordinate frame determined by the first frame of the body in each motion primitive, as described in Appendix A, Algorithm 3.

    b. Redundant motion representations are commonly used in human motion modeling, including but not limited to PFNN [1], HuMor [2], MotionVAE [3], and HML3D. We will add references to these mentioned works.

    [1] PFNN: Holden, Daniel, et al. "Phase-functioned neural networks for character control." ACM Transactions on Graphics (TOG) 36.4 (2017): 1-13.

    [2] HuMoR: Rempe, Davis, et al. "Humor: 3d human motion model for robust pose estimation." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

    [3] MotionVAE: Ling, Hung Yu, et al. "Character controllers using motion VAEs." ACM Transactions on Graphics (TOG) 39.4 (2020): 40-1.

评论
  1. Why need SMPL data and why use mean shape body.

    a. We require the SMPL data to derive joint rotations that are animation-ready. In contrast, the HML3D data processing starts with joint locations only, and its joint rotations are computed using naive inverse kinematics, which is not directly suitable for animation. As a result, all works using the HML3D motion representation must undergo a time-consuming joint-to-body conversion process. In contrast, our motion representation integrates seamlessly with animation pipelines, eliminating the need for joint-to-body conversion.

    b. We set the body shape parameters, betas, to zero to obtain mean-shape bodies, ensuring consistency with prior works like HML3D and FlowMDM for a fair comparison. Thus, we do not explicitly include body shape in the representation. Our method is theoretically capable of encoding body shape into the history of motion frames, either implicitly through the joint skeleton or explicitly through the beta shape parameter. We also experimented with training shape-dependent motion generation models in early tries. However, the AMASS dataset lacks sufficient shape-diverse motion data to effectively learn shape-dependent motion generation, particularly for rare behaviors with motion capture data from only a single subject. For frequent actions with diverse shape data, such as walking, our trained shape-dependent model performed reasonably well. Conversely, for rarer motions with motion data from only one or two subjects, the shape-dependent generation quality significantly degraded.

  2. Appendix C, Auxiliary loss LauxL_{aux}.

    a. The auxiliary loss LauxL_{aux} encourages the consistency between predicted translation and rotation features (referred to as zero-order features here) with their corresponding predicted first-order features (dt,dR,dJd\mathbf{t}, d\mathbf{R}, d\mathbf{J}), which is not guaranteed in the model prediction results.

    b. For instance, consider the translation features. The model output includes the prediction of the translation t^\hat{\mathbf{t}} and the first order difference feature dt^\hat{d\mathbf{t}}. Taking the first two frames as an example, It is not guaranteed that dt^[0]==t^[1]t^[0]\hat{d\mathbf{t}}[0] == \hat{\mathbf{t}}[1] - \hat{\mathbf{t}}[0]. We use the auxiliary loss to encourage such zero-first order consistency in prediction results. The term dˉ\bar{d} represents the calculation of first-order features of the predicted zero-order features.

    c. This consistency is distinct from the reconstruction loss, which compares the predicted motion features including the first-order features with the ground truth motion features.

  3. Appendix G distance reward. There is a typo in the definition of rdistr_{dist}, it should be Di1DiD^{i-1} - D^{i} instead of DiDi1D^{i} - D^{i-1}. The distance reward is positive when the current goal distance is smaller than the last step's goal distance, indicating progress toward the goal. The rewards are designed to be higher the better. Thanks for spotting this typo and we will update it in the PDF.

评论

We sincerely thank the reviewer for acknowledging our work and providing detailed feedback. Below, we offer clarifications and discussions to address the reviewer’s questions and concerns.

  1. Foot sliding and floating artifacts. Physically implausible artifacts are an inherent limitation of kinematics-based methods. We add an experiment to show that combining our method with physics-based human motion tracking techniques can effectively achieve physically accurate motion without such artifacts, as shown in this anonymous 🔗link. We use the physics tracking method PHC [1] to track and refine a crawling sequence generated by our method. This integration results in more physically plausible outcomes, with improved joint-floor contact and the elimination of hand-floor penetration. We will add this experiment to the paper.

    [1]: Luo, Zhengyi, et al. "Perpetual humanoid control for real-time simulated avatars." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

  2. More comparisons for the text-conditioned goal-reaching task.

    a. GMD:

    i. Firstly, GMD is an offline trajectory control method that can not achieve online dynamic goal-reaching as ours or the included baseline can perform. In addition, GMD can only generate motion sequences of a maximum duration of 6 seconds and cannot reach locations further than a threshold distance (approximately 5 meters) from the origin. As a result, GMD can not be evaluated on our original test paths.

    ii. Nevertheless, we compare ours with GMD on a shortened path where all the waypoints are close enough to the origin, and the quantitative results are presented in the table below. GMD fails to consistently reach the final waypoint within the threshold distance of 0.3 meters. GMD also exhibits more foot skating than our method.

    Time (s)↓Success rate↑Skate (cm/s) ↓Floor distance (cm)↓
    GMD walk6.00±0.000.72±0.113.87±0.742.59±0.20
    Our walk9.03±0.091.00±0.002.21±0.312.34±0.10
    -----------------------------------------------
    GMD run6.00±0.000.56±0.125.58±1.052.67±0.14
    Our run6.18±0.111.00±0.004.12±0.532.71±0.11
    -----------------------------------------------
    GMD hop6.00±0.000.55±0.123.90±1.054.29±0.20
    Our hop7.65±0.601.00±0.002.47±0.535.05± 0.22

    b. CLoSD is a concurrent work that is online in October. Its code release is still in progress and does not support sequential goal-reaching at this moment. We are not able to compare with CloSD for waypoints reaching given this situation. We will add discussions of CLoSD after their full code release.

  3. Use rewards as evaluation metrics. The rewards are intentionally designed by humans to evaluate the quality of trajectories. In our motion generation task, we use skate and floor distance rewards to promote realistic human-floor contact, with both rewards designed to align with human assessments of motion quality. It is common for reinforcement learning algorithms to directly utilize these rewards to assess the performance of a trained policy model [1, 2]. Using the reward function for evaluation does not inherently imply overfitting to this specific reward function. In reinforcement learning, policies are learned through interaction with the environment and must generalize to new episodes, rather than memorizing specific trajectories. To ensure our learned policy is robust, we evaluate our policy across multiple episodes and diverse environments different from training. This approach is consistent with standard benchmarks like OpenAI Gym, Atari, and MuJoCo, which commonly use cumulative rewards or environment-defined objectives for both optimization and evaluation.

    [1]: Schulman, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).

    [2]: Schulman, et al. "Trust Region Policy Optimization." arXiv preprint arXiv:1502.05477 (2015).

  4. Typos. "Teacle" should be "tackle," and the weight is 5. We will update the typos accordingly in the PDF.

评论

We have updated the PDF to address the reviewer’s questions, with changes highlighted in orange for clarity. To ensure accuracy, we are providing updated line number references:

  • The typo in Line 291 of the initial version is now corrected at Line 294 in the updated version.
  • Citation of MLD, with the explicitly added method name, can be found at Lines 196 and 207.
  • Citation of diffusion samplers DDPM and DDIM, with explicitly added method names, is located at Line 254 in the updated version.

We would be happy to provide further clarifications if the reviewer has any additional questions.

评论

I thank the authors for their thorough answers. My concerns were answered and I am keeping my score as is.

审稿意见
8

The paper presents a human motion diffusion model that allows for real-time generation with dynamic text conditioning, using an autoregressive latent-space model. The technical contributions relate to the specifics of the latent + autoregressive architecture, and showing all that is capable of. Much like for VAEs, it is further shown that the latent can also be used an action, and therefore also used for (a) gradient-based motion sequence optimization, and (b) to develop RL policies for specific tasks. The time-precise text-conditioning is enabled by (trained on) the BABEL dataset, which has frame-aligned text annotation, unlike the more generic motion clip captioning of other common human motion datasets. The results are demonstrated in a supplementary video. These demonstrations high-quality real-time motions, with dynamically changing text prompts, results for the various modes of generation, e.g., including RL, as well as various comparisons.

优点

Strengths:

  • The work demonstrates impressive real-time motion generation, with responsive, on-the-fly text conditioning; it provides the architectural design for these to work, as well as demonstrating the utility of the BABEL dataset in this context.
  • the architecture is demonstrated to be effective across three different types of usage: (1) autoregressive conditioning; (2) motion sequence optimization; (3) reinforcement learning
  • The work provides what are, to the best of my knowledge, fairly thorough ablations and comparisons. The Appendices and video provide a variety of thorough motion-quality evaluation and ablation studies.

缺点

Weaknesses:

  • video: foot-ground contact cannot be observed because of the faint ambient-occlusion-based ground shadows, and thus the motion quality is difficult to judge qualitatively. Caveat: other papers on the same topic often suffer from a similar problem.
  • Relevant related work is missing, e.g., "AMD: Autoregressive Motion Diffusion" (AAAI 2024); "Interactive character control with auto-regressive motion diffusion models" (SIGGRAPH 2024).
  • References are often missing publication venue & other details
  • It would be interesting to understand the impact of more "motion primitive" durations, i.e., H+F, on the results. A wider window provides more context and therefore perhaps better quality, while a shorter window may allow for more flexible transitions between different types of motions. I acknowledge the discussion of the H=1,F=1 case in Appendix E.3.

问题

The following is a mix of minor questions and feedback.

  • (a) the paper discusses number of "frames" and "seconds" at various points, but to the best of my knowledge it never mentions the actual frame rates (fps or Hz) of the motions. Please do mention these numbers; they are vital to understanding the speeds. L429 mentions "frame rate exceeding 300", but this presumably refers to 300 generated frames per wall-clock second for the synthesis, but the reader does not know whether the actual motion frames being generated should be played at 30 Hz, 60 Hz, etc. In fairness to the authors, other papers also fail to document animation frame rates, but perhaps as a community we can reverse this trend.
  • (b) work using RL, for either kinematic or physics-based motion generation, is by default autoregressive, and so may be worth furthing noting. The biggest difference is perhaps that these are usually not developed with text conditioning in mind.
  • (c) "DART represents human motion as a collection of motion primitives" There are a great many definitions of "motion primitive" that have been introduced over the past 30 years, and this paper introduces another. It might be more clear to simply refer to the "motion representation" used for DART, e.g., a short motion segment.
  • (d) High frequencies in the raw motion capture are described as being problematic. However, in practice, high frequencies are necessary for high-quality representations of motions, due to the rapid accelerations that are part of many motions.
  • (e) Figure 3: Can the character perceive the environment?
  • (f) The paper would be stronger if it documented examples of some of the limitations discussed in Section 5.
  • (g) Would the output be amenable to physics-based tracking, to produce higher-quality motions, e.g., effectively a physics-based "projection"?
评论

Questions:

  1. High-frequency noises in our paper refer to the jitter and glitch artifacts in motion capture data, as shown by this example sequence (anonymous 🔗link). Despite AMASS being based on marker-based motion capture, such artifacts can still occur. Our motion primitive VAE helps address issues caused by such outlier artifacts in the data. We agree that high-frequency details are crucial for capturing the subtle nuances of human motion, particularly in scenarios involving intricate movements, such as gymnastics. We will refine the text to make this difference more clear.

  2. Environment awareness.

    a. In optimization-based scene interaction synthesis in Figure 3, the 3D scenes play a crucial role in evaluating human-scene contact and collision. Represented as signed distance fields (SDF), these scenes are used to compute body-scene distances, which serve as the basis for deriving human-scene contact and collision metrics during the optimization process, as detailed in Appendix F. These contact and collision metrics encourage motion generation that adheres to the scene's physical constraints.

    b. Additionally, our proposed reinforcement learning-based control framework can incorporate adaptable scene observations tailored to specific tasks. For instance, a simple locomotion task on a flat plane may only require the floor height as an observation, while complex object interactions, such as grasping a bottle, can use more advanced spatial representations like point clouds and depth maps.

  3. Examples of limitations.

    a. We demonstrate an example of random action execution due to semantic ambiguity in this anonymous 🔗video. Given text prompts describing all actions in one sentence, our generated result shows random transitions among actions, disregarding the specified action orders. This issue arises from the semantic ambiguity of using global sequence-level text prompts to guide the generation of short, local motion primitives. The primitive model may generate any one of the actions in the sentence label that is locally feasible, leading to unpredictable transitions.

    b. However, we also show in the 🔗video that our method can generate motion results adhering to the intended action order by explicitly decomposing the sentence prompt into a sequence of individual action prompts. This explicit prompt decomposition could potentially be achieved through a hierarchical model design or by leveraging common-sense reasoning in language models, such as GPT-4.

  4. Name of primitive. In our work, we use the term segment to refer to the duration of one action in the text-conditioned temporal motion composition task (Section 4.1) and primitive to describe the more basic units that constitute these action segments. We aim to differentiate these two concepts with distinct terminology. We acknowledge that the term motion primitive has been used in various contexts with differing meanings, and we welcome suggestions for more precise and distinctive naming.

  5. References: Thank you for pointing out the omission of AMD. We will add discussions of AMD to the related works. Interactive Character Control with Auto-regressive Motion Diffusion Models is already cited as Shi et al. (L668). Additionally, we will fix the missing venue information and other details in the references in the updated PDF.

评论

Thank you for the detailed and thoughtful replies. The new video rendering style is a considerable improvement for observing the motion quality.

re: primitive duration Thanks, these results make sense. I'm still genuinely curious as to what the authors think of the following qualitative interpretation of the impact of window size -- longer duration primitives provide higher quality, while shorter durations allow for faster & more flexible motion transitions. To be clear, I'm also fine if the authors believe that this speculative interpretation on my part is simply flawed.

I think that the work advances the state of the art in a meaningful way, and will have impact.

评论

We sincerely thank the reviewer for acknowledging our work and providing insightful feedback. Below, we address the reviewer’s questions with detailed discussions and video visualizations.

  1. Combination with physics-based tracking to produce higher-quality motions. We add the experiment of combining our method with physics-based tracking to achieve more physically accurate motions, as shown in this anonymous 🔗link. We use the physics-based tracking method PHC [1] to refine a crawling sequence generated by our method. This integration results in more physically plausible outcomes, such as more accurate joint-floor contact and elimination of hand-floor penetration. We believe that combining kinematic methods with physics-based approaches is a promising direction. We will add this experiment to the paper.

    [1]: Luo, Zhengyi, et al. "Perpetual humanoid control for real-time simulated avatars." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

  2. Actual frame rate of the motions. The actual frame rate of the motions depends on the dataset, as detailed in Appendix B. Specifically, we use 30 fps for the BABEL dataset and 20 fps for the HumanML3D dataset, following conventions established in prior works. In the supplementary video, the in-between and joint trajectory control results are at 20 fps while other results are at 30 fps. We agree the motion frame rate is crucial for understanding the motions and will add fps labels in the video results.

  3. Visualization improvement. Thanks for the feedback about the limitations of our previous video visualization. We adjust the floor texture and lighting to better demonstrate foot-floor contacts. We will also add the motion frame rate labels in video results to help better understand the video results.

  4. Impacts of longer motion primitive duration. The primitive duration is a hyperparameter that determines whether the model learns the distribution of shorter and more localized motions, which can be composed into globally diverse sequences, or the distribution of longer and more global, sequence-level motions. The impact of a longer primitive duration is task-dependent.

    a. For the in-between task discussed in Section 4.2, the baseline method DNO can be considered as using a long primitive duration of 196 frames which results in inferior performance. This results in a more complex distribution compared to the 10-frame primitive distribution, which makes the optimization landscape more challenging and leads to motion results deviating from the intended text semantics. In this case, long primitive duration hinders the performance under novel test goal bodies.

    b. On the other hand, when the goal is to generate motion sequences that closely match the sequence-level distribution of the motion dataset, as evaluated in the HumanML3D benchmark, using a longer primitive duration of 60 frames achieves a better sequence similarity FID score of 0.336 compared to a shorter primitive duration of 10 frames with an FID score of 0.767 in our experiments. The optimal primitive duration depends on the application. We selected the presented primitive duration as a balance between motion quality, computational flexibility, and versatile control.

    c. For the text-conditioned motion in-between task, we add one ablation study using a longer primitive duration with a history length H of 2 and future length F of 16. This longer primitive duration results in a slightly worse segment similarity FID score (3.79->4.04) but better semantic alignment R-Prec score (0.62->0.66) compared to using primitive duration of H=2, F=8. We will add this ablation to Table 5.

评论

Thank you for your prompt reply. We agree with your interpretation regarding the impact of window size on motion generation. Longer primitive durations enable the model to learn distributions of more comprehensive motions, resulting in smoother, higher-quality motions. Conversely, shorter durations provide less context but also fewer constraints, offering greater flexibility for transitioning between diverse actions.

We sincerely value your acknowledgment of our research.

审稿意见
3

This paper introduces DART, a novel method for real-time and text conditioned human motion. While previous methods are limited to short and offline motion generation, DART uses diffusion-based framework to produce continuous motion by learning a motion primitive space conditioned on the text and motion history. DART can also use latent space optimization or reinforcement learning to control the motion generation under different spatial constraints.

优点

  • DART has indeed achieved real-time continuous motion generation. Continuous motion generation is an important issue and deserves attention.
  • DART enables effective motion control, which is user-centric.
  • DART can achieve real-time motion generation, which is user-friendly.

缺点

  • Due to the limitations of the methodology, training can only use frame-aligned text annotations, resulting in the model's performance on motion detail generation and generalization being significantly inferior to that of models trained with sequence-level annotated data, such as the HumanML3D dataset.

  • The experiments are insufficient, and the results are not solid enough. The MotionX dataset also provides frame-level data. Compared to the small-scale BABEL dataset, the authors lack demonstrations of training effectiveness on larger-scale datasets.

  • The motivation is unclear. Simply using a short motion generator combined with an motion prediction network can achieve high-quality motion generation while ensuring the quality of motion completion and spatial control over motion points. However, the authors did not conduct a comparison.

问题

Given that current short motion generation methods (such as MoMask) can achieve excellent generalization, and motion prediction methods (such as HumanMAC) are also quite effective, a straightforward idea is to concatenate two separately generated motions. This approach ensures that the generalization and details of individual motions are outstanding, as single motion generation is trained on large-scale sequence-level annotated data (like MotionX and HumanML3D), while motion prediction is trained on long-duration motion datasets (like Human3.6M). If DART only supports frame-level annotated data, it means that the quality of individual motions will fall significantly behind current state-of-the-art methods, indicating that the motivation of the paper is unclear. Therefore, I suggest the authors:

  • Provide results from training with sequence-level annotated data, even though the paper mentions that the performance is not good.
  • Present performance results on the MotionX dataset, which also has frame-level annotated data but is more challenging than BABEL.
  • Combine advanced motion generation methods (such as MoMask) with motion prediction methods (such as HumanMAC), and then compare DART’s performance in terms of individual motion generation quality and motion continuity. I anticipate that if DART does not address the issue of not being able to use sequence-level training data, the quality and generalization of the motions will be poor than other methods.

[1] Chen, Ling-Hao, et al. "Humanmac: Masked motion completion for human motion prediction." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[2] Guo, Chuan, et al. "Generating diverse and natural 3d human motions from text." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[3] Guo, Chuan, et al. "Momask: Generative masked modeling of 3d human motions." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[4] Lin, Jing, et al. "Motion-x: A large-scale 3d expressive whole-body human motion dataset." Advances in Neural Information Processing Systems 36 (2024).

[5] Ionescu, Catalin, et al. "Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments." IEEE transactions on pattern analysis and machine intelligence 36.7 (2013): 1325-1339.

[6] Punnakkal, Abhinanda R., et al. "BABEL: Bodies, action and behavior with english labels." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

评论

We sincerely thank the reviewer for acknowledging our method's real-time generation capability and effective motion control. However, we respectfully disagree with some opinions regarding the datasets and works mentioned. We kindly ask the reviewer to review the facts we present about the mentioned datasets (BABEL, HumanML3D, and MotionX) and methods (HumanMAC), and consider updating their perspectives on the references and our work.

  1. Generate-and-concatenate solution. The reviewer suggested a solution of generating action segments separately and then concatenating them, specifically combining MoMask and HumanMAC. However, we provide experiment results demonstrating that this generate-and-concatenate approach fails to produce continuous motion sequences with smooth transitions between actions.

    a. Firstly, the motion prediction work HumanMAC has a fundamental limitation of ignoring the global translation of human motion (🔗video), which is a critical component of human motion and an indispensable part of existing quantitative evaluation frameworks for text-to-motion generation.

    b. Moreover, even without considering global translation, HumanMAC fails to generate smooth local action transitions, as shown in the 🔗video result. We use MoMask to generate two action segments: walking in a circle and waving hands, and then use HumanMAC to complete the intermediate transition frames between the end of the first segment and the start of the second. However, the video results reveal that the transition frames generated by HumanMAC are discontinuous with the initial frame in the second segment (waving hands), resulting in abrupt pose changes.

    c. Furthermore, the reviewer-proposed MoMask + HumanMAC can not generate motions in real-time in response to online user text prompts as our method.

    d. Additionally, the idea of generating action segments separately and then concatenating them with transition frames is implemented in one of our baseline methods, DoubleTake, which shows inferior performance compared to our method, as shown in Table 1 and our supplementary video.

  To conclude, the reviewer’s claim that motion sequences with smooth action transitions can be achieved by simply generating action segments separately and then concatenating them (using HumanMAC) is not supported by our experimental results. This highlights the motivation of our paper: continuous motion generation with smooth action transitions is a critical and unsolved problem. Our paper presents a high-quality, real-time generation method that addresses this issue, with additional support for versatile spatial control.

 

评论
  1. Results on HumanML3D. We present the generated motion results (🔗link) by our method trained on the HumanML3D dataset.

    a. Given text prompts describing a single action, our model generates high-quality motions, such as salsa dancing or squatting exercises.

    b. For text prompts comprising multiple actions occurring in order, we demonstrate that we can leverage GPT-4 to explicitly decompose the sentence into a list of action segment prompts to obtain motion results with the intended action order, mitigating the semantic ambiguity limitation discussed in section 5.

    c. We also quantitatively evaluate our model, trained on the HumanML3D dataset, using the HumanML3D test set, which includes both single-action and complex multi-action sentences. Notably, we do not leverage GPT-4 to explicitly decompose sentences into action prompts in this experiment. As mentioned in section 5, our autoregressive model built on short motion primitives faces challenges with semantic ambiguity when handling long text prompts describing multiple human actions in one sentence, leading to a higher FID score (0.767). Interestingly, by adjusting the history motion length (from 2 to 20 frames) and future prediction length (from 8 to 40 frames) and retraining the model to better accommodate long text sentences, our model achieves performance comparable to MDM and MLD. Specifically, the FID scores are: Ours: 0.336, MDM: 0.545, MLD: 0.431. For R-Precision, Ours: 0.728, MDM: 0.708, MLD: 0.762. These results suggest that our autoregressive model has the potential to handle long text inputs effectively. In the future, an interesting direction would be to explore hierarchical models capable of capturing both high-level human behavior semantics over extended temporal horizons and temporally precise human actions.

评论
  1. MotionX dataset. The reviewer’s suggestion to train on the MotionX dataset for evaluating text-conditioned temporal motion composition is infeasible due to several issues of the MotionX dataset.

    a. Firstly, MotionX contains rule-based per-frame pose labels that lack action semantics and temporal continuity, making them unsuitable for evaluating motion generation with smooth action transitions. Below we show examples of two consecutive per-frame pose labels from the MotionX dataset:

    "The torso is vertical and the right forearm is parallel to the ground. Both hands are shoulder width apart and the feet are spread apart and the right leg is lying over the other while the right knee is bent, the right foot is to the front with the left leg behind the right."
    "Their right foot is spread far apart from their left foot and is to the front. Their right hand is close to their right knee and their right elbow is slightly bent with their knees rather bent and their left leg beneath their right leg. It is behind their right leg. Their torso is vertical while their hands are spread.”
    

    These per-frame pose labels are automatically generated based on heuristic rules proposed by PoseScript [1] 🔗(link), e.g., whether one joint angle is in a predefined range, and are then rephrased into sentences that are typically different for each frame. Therefore, these pose labels differ significantly from how humans describe actions in natural language. They capture only static pose semantics without dynamic action semantics, making them suitable for pose generation but not for motion generation. As a result, these pose labels are unsuitable for evaluating either individual action segment generation or the transitions between action segments.

    [1] Ginger, et al. "PoseScript: Linking 3D Human Poses and Natural Language." ECCV 2022

    b. Moreover, the motion data quality in MotionX falls significantly below the standards required for text-to-motion applications. See visualizations here 🔗link. The motion data in MotionX often exhibits significant artifacts such as jitter, glitches, foot skating, incorrect body orientation, and infeasible poses. Training models on such low-quality data would lead to learning a manifold of implausible human motions, resulting in poor-quality generations. We appreciate MotionX’s efforts of expanding motion data using monocular motion estimation from videos, however, motion data estimated by MotionX falls well below the required quality standard for text-conditioned motion generation.

 

  1. BABEL, HumanML3D, and Human3.6M datasets discussion.

    a. Both BABEL and HumanML3D use motion sequence data from the AMASS dataset While HumanML3D only provides coarse sentence-level descriptions for sequences, BABEL offers fine-grained, frame-aligned action annotations for sequences containing multiple actions. Moreover, in terms of motion temporal complexity, BABEL contains minutes-long motion sequences with a maximum duration of 263 seconds while HumanML3D has a maximum motion duration of 10 seconds.

    b. Furthermore, BABEL has a larger dataset scale than both HumanML3D and Human3.6M.

    HoursAnnotation granularityMaximum motion duration
    BABEL37.5Frame-aligned actions263 s
    HumanML3D28.6Global sequence10 s
    Human3.6M5.0--

 

评论

As the rebuttal discussion period approaches its end, we would like to kindly remind Reviewer WRyH to review our responses to all the questions raised. Additionally, we have updated the PDF to incorporate feedback from the reviewers. The updates include:

  1. Correct typos, add additional references (AMD, Motion-X, HumanMAC), include the method names of cited works to explicitly clarify the citations, and provide missing publication venue details.
  2. Refine texts to clarify better the role of 3D scenes in the scene-interaction synthesis task in Section 3.3.
  3. Add more detailed explanation for the auxiliary loss in Appendix C.
  4. Add discussion and demonstration of integrating our method with physics-based motion tracking to refine physically implausible artifacts like skating and penetration in Appendix H.
  5. Add discussions on open-vocabulary motion generation in Appendix I.
  6. Add additional comparison with baseline GMD for the text-conditioned goal-reaching task in Appendix J.

We kindly remind Reviewer WRyH to review the rebuttal responses addressing all the questions and update the score evaluation for this submission. If the reviewer has any further questions, we would be delighted to provide additional clarification during the remaining discussion period.

评论

As today marks the final day of the discussion period, we kindly remind Reviewer WRyH to review the rebuttal responses addressing all the questions and update the score evaluation for this submission. If the reviewer has any further questions, we would be delighted to provide additional clarification during the remaining discussion period.

审稿意见
8

This paper proposes DART, a real-time, diffusion-based text-to-motion framework that enables interactive text-driven motion control. DART leverages a motion primitive as the action space and learns a diffusion model on top of the latent space to learn continuous human motion. The learned diffusion model conditions on both text and history motion input and can be autoregressively run in real-time. Results on a number of text-conditioned generative motion tasks show the superiority of the proposed method in motion generation quality, real-time interactive control, and alignment with text or scene.

优点

  • I find DART a well-designed system that leverages existing techniques (motion primitives, autoregressive diffusion models) in an intuitive way. Using a pre-trained motion primitive during the diffusion motion generation process is well-thought-out and, as can be seen from the result, very performant.
  • The real-time interactive control part shown in the demo videos is very impressive and useful for downstream tasks. Asking the character to perform actions in real-time and the character has the ability to switch naturally to the next text prompt.
  • DART is versatile in terms of the control input and optimization techniques it can interface with (direct optimization, RL, etc.) and the resulting motion is of high quality while conforming to the text prompt.
  • Both quantitatively and qualitatively, I find the proposed method state-of-the-art.

缺点

  • Since BABEL has a limited vocabulary, DART's vocabulary is limited.
  • Since motion generation in DART is autoregressive and real-time, I am wondering what the model would do if the text does not change but the model is continuously rolled out. Would the motion be repeated, or would the model be stuck in a weird space? Provide a few single-text prompt experiments for an extended period (e.g., several minutes) and analyze the resulting motion would be helpful.

问题

See weakness: how would DART perform if one text condition is given for a long time?

评论

We would like to thank the reviewer for acknowledging our work and providing constructive feedback. Below, we address the reviewer’s concerns and questions with discussions and accompanying video results.

  1. Limited vocabulary.

    a. Limited vocabulary is a critical limitation and challenge shared by existing text-conditioned motion generation methods. Existing methods, including our approach DART, struggle to generalize to open-vocabulary text prompts due to the scarcity of 3D human motion data with text annotations. The scale of motion data available is several orders of magnitude smaller than that for text-conditioned image and video generation, primarily due to the reliance on marker-based motion capture systems, which are challenging to scale.

    b. To expand the dataset and enable open-vocabulary generation, extracting human motion data from in-the-wild internet videos and generative image/video models [1, 2] is a promising direction. Additionally, the rapid advancement of vision-language models (VLMs) holds promise for automatically providing detailed, frame-aligned motion text labels to facilitate text-to-motion generation [3].

    [1] Goel, Shubham, et al. "Humans in 4D: Reconstructing and tracking humans with transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

    [2] Kapon, Roy, et al. "MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

    [3] Shan, Mengyi, et al. "Towards open domain text-driven synthesis of multi-person motions." European Conference on Computer Vision. 2024

 

  1. Long rollout results given a single text prompt depend on the inherent nature of the actions represented in the dataset.

    a. Actions that are inherently repeatable can be extended perpetually using rollout generation, like the minutes-long rollout examples of a person jogging in circles, performing cartwheels, and dancing in anonymous videos: 🔗jog, 🔗cartwheel, 🔗dance.

    b. Some other actions, however, have inherent boundary states that mark the completion of the action. For instance, "kneel down" reaches a boundary state where the knees achieve contact with the floor, as illustrated in this anonymous 🔗link. Further extrapolation of “kneel down” beyond this boundary state is not represented in the dataset and is not intuitively anticipatable by humans, as no further motion logically extends within the action semantics. Continuing rollout using the “kneel down” text prompt results in motions exhibiting fluctuations around the boundary state.

评论

Thanks for the detailed response and the additional experiments. I appreciate the inclusion of both good and (somewhat) worse results. I think these examples should be included and added to the paper. I recommend accepting this paper.

评论

We sincerely thank the reviewer for the constructive feedback. We will incorporate the discussion on long rollouts given a single text prompt in the paper and the video results.

审稿意见
8

Authors of this propose a text-conditioned motion generation method that is capable of generating long motion sequences, is controlled by a stream of text prompts, and has short inference time and is thus suitable of real-time settings. The promposed method autoregressively generates motion latents using latent diffusion models, conditioned on motion history and the next text input. The proposed method also supports spatial constraints such as keyframe in-betweening and goal reaching. This is done via latent noise optimization or RL by treating the latent noise as actions.

优点

  • Having text-to-motion generative models that are real-time and allow for spatial constraints are import components in computer graphics research. Therefore, I find the problem very relavent and interesting.
  • Paper is well presented, clearly written, and easy to read.
  • I appreciate the detailed overview of contrallable motion generation works in the related works section (Section 2).

缺点

  • The primary weakness of the paper lies in its similarity to several concurrent works addressing the same idea. While this isn’t necessarily a flaw of this specific work, it does somewhat limit its novelty and overall significance in the broader research landscape.

问题

  1. What is the advantage of using latent diffusion models in this work? Since the data dimensionality is low, perhaps there is not much performance or speed improvements.
  2. Related to the previous question, what is the advantages of incorporating spatial constraints in the latent space versus the real motion space?
  3. For the scene-interaction tasks, does the model have any awareness of the environment? For instance, in the walking upstairs example of Figure 3.b, is the motion first generated given the pelvis location, and the stairs are later added in the visualization?
评论

We sincerely thank the reviewer for acknowledging our work and the constructive feedback. Below, we have provided clarifications addressing the reviewer’s questions.

  1. Importance of the latent motion primitive space.

    a. Using a latent space that compresses motion primitives into compact variables helps reduce jittering artifacts inherited from training motion data, as illustrated in this example (anonymous 🔗link). Although the training data comes from marker-based motion capture, the sequences can still include jitters and glitches. Section 3.2 and Table 5 present that removing the motion primitive VAE (DART-VAE) results in a significantly higher jerk in the output, demonstrated by the Peak Jerk (PJ) and Area Under Jerk (AUJ) metrics.

    b. Additionally, reducing the dimension of the diffusion noise zTz_T​ is advantageous for spatial control tasks, particularly for reinforcement-learning (RL) based control. Without the motion primitive VAE, the dimension of zTz_T​ exceeds 2000, creating an action space that is too high-dimensional for the RL policy. Our early experiments showed that training RL policies with such a large action space hindered learning and convergence.

    c. Using this latent motion primitive space also distinguishes our work from other concurrent autoregressive motion diffusion approaches in the raw motion space.

 

  1. Advantages of latent space motion control over raw space motion control.

    a. As explained in Section 3.3, solving spatial constraints in the raw motion space often leads to noisy, low-quality motion sequences, as most raw motion space samples do not represent plausible human motions. For instance, when defining a spatial constraint for the goal location of the right wrist, solutions in the raw space may yield unrealistic sequences, such as the wrist being two meters away from the elbow or the lower arm being twisted at an infeasible angle, defying reasonable human anatomy.

    b. In contrast, latent space control incorporates knowledge of plausible human motion by leveraging a model that learns the manifold of natural human movements. Solving spatial constraints within this latent space ensures that generated sequences remain within this learned manifold of natural human movements, thus enhancing the overall motion quality.

 

  1. Scene awareness. The 3D scenes play a crucial role in optimization-based scene interaction synthesis for evaluating human-scene contact and collision. Represented as signed distance fields (SDF), these scenes are used to compute body-scene distances, which serve as the basis for deriving human-scene contact and collision metrics during the optimization process, as detailed in Appendix F. These contact and collision metrics encourage motion generation that adheres to the scene's physical constraints.
评论

Thanks to the authors for answering my questions. I'm happy to announce that my questions are addressed and I'll increase my score to vote for acceptance.

评论

We sincerely appreciate your support and are delighted that our responses addressed your concerns. Your decision to increase the score means a lot to us—thank you!

评论

We sincerely thank all reviewers for their valuable comments and constructive suggestions. To facilitate discussion among the reviewers and the area chair, we have summarized the reviewers' feedback in the table below:

ReviewerImportant ProblemImpressive Results and SOTAVersatile and Effective ControlMethod Well Designed and PresentedThorough Comparisons and AblationsDetailed Literature Overview
tzns
dw3E
WRyH
J1Bn
emqN

We would like to thank the reviewers for acknowledging our work. The reviewers agree that we are tackling an important technical problem with wide applications (WRyH, tzns, dw3E). Our method, DART, is well-designed and clearly presented (emqN, dw3E, J1Bn), and achieves impressive results and performance in various demonstrated tasks (emqN, tzns, dw3E, J1Bn). The reviewers also appreciate the versatile and effective spatial motion control enabled by our method (WRyH, tzns, J1Bn). Reviewer J1Bn specifically appreciates the thorough video results comparison and ablation studies we included in the submission. Reviewer dw3E also appreciates the detailed literature review we included on controllable motion generation.

We have updated the paper as the reviewers suggested. We highlighted the updates with orange colors. The updates include:

  1. Correct typos, add additional references (AMD, Motion-X, HumanMAC), include the method names of cited works to explicitly clarify the citations, and provide missing publication venue details.
  2. Refine texts to clarify better the role of 3D scenes in the scene-interaction synthesis task in Section 3.3.
  3. Add more detailed explanation for the auxiliary loss in Appendix C.
  4. Add discussion and demonstration of integrating our method with physics-based motion tracking to refine physically implausible artifacts like skating and penetration in Appendix H.
  5. Add discussions on open-vocabulary motion generation in Appendix I.
  6. Add additional comparison with baseline GMD for the text-conditioned goal-reaching task in Appendix J.

We have addressed the reviewers' specific questions in separate threads and are delighted to provide further clarifications or answer additional questions.

AC 元评审

The submission is about real-time text-driven motion control. Reviewers in general appreciate the strong results, in particular the interactive demo in the video, the comprehensive evaluation, and the clear presentation. One reviewer was negative about the submission, but they did not engage in the discussion. The AC agreed with the majority of the reviewers and recommended acceptance. Congratulations!

审稿人讨论附加意见

The negative reviewer did not participate in the discussion, while all other reviewers are fairly positive.

最终决定

Accept (Spotlight)