PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
4
4
4.0
置信度
创新性3.0
质量2.5
清晰度2.3
重要性2.8
NeurIPS 2025

FlashMo: Geometric Interpolants and Frequency-Aware Sparsity for Scalable Efficient Motion Generation

OpenReviewPDF
提交: 2025-05-03更新: 2025-10-29
TL;DR

FlashMo introduces a geometric factorized interpolant and frequency-sparse attention, enabling scalable efficient 3D motion diffusion.

摘要

Diffusion models have recently advanced 3D human motion generation by producing smoother and more realistic sequences from natural language. However, existing approaches face two major challenges: high computational cost during training and inference, and limited scalability due to reliance on U-Net inductive bias. To address these challenges, we propose **FlashMo**, a frequency-aware sparse motion diffusion model that prunes low-frequency tokens to enhance efficiency without custom kernel design. We further introduce *MotionSiT*, a scalable diffusion transformer based on a joint-temporal factorized interpolant with Lie group geodesics over $\mathrm{SO}(3)$ manifolds, enabling principled generation of joint rotations. Extensive experiments on the large-scale MotionHub V2 dataset and standard benchmarks including HumanML3D and KIT-ML demonstrate that our method significantly outperforms previous approaches in motion quality, efficiency, and scalability. Compared to the state-of-the-art 1-step distillation baseline, FlashMo reduces **12.9%** inference time and FID by **34.1%**. Project website: https://steve-zeyu-zhang.github.io/FlashMo.
关键词
Motion GenerationDiffusion Model

评审与讨论

审稿意见
5

This paper introduces method for generating 3D human motion. The key contributions are:

  • Frequency-aware sparsification, which prunes low-frequency tokens, improves efficiency without sacrificing performance.
  • A diffusion transformer with temporal-spatial factorized interpolant with Lie group geodesics enables scalable motion diffusion and superior performance.

The proposed method is evaluated on the HumanML3D and KIT-ML dataset, demonstrating state-of-the-art performance in generating 3D human motion with computational efficiency.

优缺点分析

Strengths

  1. The paper is well written and easy to understand.
  2. Frequency-aware sparsification and considering temporal-spatial structure and the manifold geometry of motion representations are novel and intuitive for motion generation.
  3. The proposed method achieves state-of-the-art performance with the lowest inference time, training time, model size, and GFLOPs.
  4. They validate scalability of their model design compared to other variants and effectiveness of interpolants design.

Weaknesses

There is no weakness of the proposed method, but some descriptions of the method are not clear. For specific details, please refer to the following questions.

问题

  1. MotionSiT takes a latent representation X**X** instead of motion representation M**M**. Temporal-spatial factorized interpolant works well in the latent space and motion space. However, the latent representation does not lie in the manifold geometry unlike the joint rotations. Why the Lie group geodesics are necessary in the latent space?
  2. As I understand, MotionSiT leverages SMPL for motion representation due to the manifold geometry of joint rotations. However, motion representation used in HumanML3D dataset consists of velocities, joint positions, and contact labels. Which motion representations MotionSiT use and how to handle the gap between motion representations?

I will raise the rating if my concerns are solved.

局限性

Yes.

最终评判理由

The rebuttal by authors solve my concerns, so I raise the rating.

格式问题

No.

作者回复

We appreciate your valuable feedback and support. Our responses to the questions are as follows.

Q1. Manifold geometry for latent representation.

A1: Thank you for bringing up this question. Below is a clear explanation.

In order to preserve latent representations on the SO(3)\mathrm{SO}(3) manifold, our paper simply adapts a VAE by replacing the standard Gaussian latent space with an SO(3)\mathrm{SO}(3)-valued latent space and applying a generalized reparameterization trick.

Specifically, in a vanilla VAE, the latent variable

$

z \sim \mathcal{N}(\mu, \sigma^2 I)

$

lies in Euclidean space and is sampled using the standard reparameterization trick:

$

z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I).

$

In contrast, in an SO(3)\mathrm{SO}(3)-VAE, the latent variable

$

R_z \in \mathrm{SO}(3)

$

is sampled using the following reparameterization trick:

$

v \sim r(v \mid \sigma)

$

$

R = \exp(v^\times) \quad \text{(mapped to } \mathrm{SO}(3) \text{ via exponential map)}

$

$

R_z = R_\mu R \quad \text{(left multiplication to center at the mean rotation } R_\mu)

$

where v×v^\times is the skew-symmetric matrix of vv.

This ensures all latent variables RzR_z lie exactly on the SO(3) manifold without changing the architecture of the VAE. Hence, it’s also necessary to perform lie group geodesics interpolant in the latent space.

Similar VAE methods have been introduced in previous works such as [1]. Since it is not the main focus of our contribution, the details have not been explicitly included in the current version of the paper. We will include it in the next version of our paper.

Q2: Motion representation.

A2: HumanML3D introduces a preprocessing step that converts motion into velocity and joint position representations, along with the corresponding text descriptions. However, the original SMPL sequences can be retrieved by locating the exact index in AMASS [2], where the SMPL representation preserves the manifold geometry of joint rotations and is used for training in FlashMo. Moreover, as mentioned in the paper, the output motion sequence needs to be converted to the pose representation defined in HumanML3D to ensure consistency across evaluations and enable fair comparisons with other methods.

Once again, thank you for the valuable feedback. Your support means a lot and has helped us improve our work. We hope you will consider raising your score if you feel our response has addressed your concerns.

Reference:

[1] Explorations in Homeomorphic Variational Auto-Encoding

[2] AMASS: Archive of Motion Capture as Surface Shapes (ICCV 2019)

评论

My concerns are solved, so I raise the rating to "Accept".

评论

Many thanks for your prompt response and for raising your score!

评论

Dear authors,

AC chiming in.

I had the same question about the VAE latent representation. Isn't it a lot more straightforward to just use expmap rotation representation instead of quaternions as the source data representation, and just map the output back to quaternions? This way, the standard VAE and SiT will work as is, while ensuring the learning stays on the SO(3) manifold. FYI, graphics papers have been doing this since [1].

  1. Grassia, F. S. 1998. Practical parameterization of rotations using the exponential map.Journal of Graphics Tools.
评论

Dear AC,

Thank you for bringing out this important question.

There is a theoretical guarantee that the exponential map will always ensure the result lies on the SO(3) manifold if the input to the exponential map lies in Lie algebra so(3) [1]. This is the nature of the exponential map, which maps the Lie algebra of 3×3 skew-symmetric matrices (or equivalently, the angle-axis representation) to a rotation [2]. That is, as you said, with the exponential map as preprocessing to the data, you will always get inputs and outputs of a standard VAE that lie on SO(3).

However, simply converting rotation data into exponential map form does not guarantee manifold consistency in the latent space of a (latent) generative model [3]. By manifold consistency, we mean that all operations (including latent encoding, sampling, interpolation, and decoding) must preserve or respect the geometry of the SO(3) manifold. The standard VAE will treat the exp-map rotations as just a collection of real numbers (R3\mathbb{R}^3) in latent space and if we embed a non-Euclidean manifold into a Euclidean latent space, we encounter a manifold mismatch problem [4]. There is a diagram in Falorsi's paper [4] that explains this: A nontrivial manifold (top) with holes cannot be continuously mapped to a topologically trivial blob (bottom), illustrating the concept of “manifold mismatch” between data and a Gaussian latent space.

Specifically, even if we map the rotations M\mathbf{M} into exp-map form MSO(3)\mathbf{M}^* \in \text{SO}(3) via exponential maps and feed M\mathbf{M}^* into a standard VAE, this alone does not guarantee that the learned latent representations will remain on the SO(3) manifold. This is because a standard VAE maps M\mathbf{M}^* into an unconstrained latent space (typically latent vectors xRd\mathbf{x} \in \mathbb{R}^d), optimized to match an isotropic Gaussian prior, such that:

qϕ(xM)N(0,I),q_\phi(\mathbf{x}|\mathbf{M}^*) \approx \mathcal{N}(\mathbf{0}, \mathbf{I}),

which implies xSO(3)\mathbf{x} \notin \text{SO}(3) in general. As a result, the manifold structure of the input rotations is not preserved — the latent variables x\mathbf{x} no longer lie on SO(3). That is, the latent distribution in a standard VAE is trained to match a unit Gaussian, not a curved manifold, and thus fails to preserve the rotational geometry of M\mathbf{M}^* .

Furthermore, standard stochastic interpolants used in SiT operate in Euclidean latent space:

xt=α(t)x+σ(t)ϵ,\mathbf{x}_t = \alpha(t)\mathbf{x}^* + \sigma(t)\boldsymbol{\epsilon},

which linearly blends the latent code x\mathbf{x}^* with Guassian noise ϵ\boldsymbol{\epsilon}, without respecting any underlying geometry (manifold structure) present in the source data. Therefore, even if the input is on SO(3), the diffusion interpolant operates off-manifold, potentially producing invalid rotations or semantically inconsistent transitions.

It is worth noting that, as you mentioned, Grassia's [5] paper provides a solid theoretical foundation for handling rotations in computation for our work. However, since it (1998) predates modern deep generative models, it does not discuss VAEs or latent diffusion.

We hope the above clarification addresses your question, and we are happy to respond to any further questions.

Best regards,

Authors #6198

Reference:

[1] PA Absil, R Mahony, R Sepulchre, Optimization algorithms on matrix manifolds. (2008)

[2] Richard Hartley, et al. Rotation averaging. (IJCV 2013)

[3] Falorsi et al. Reparameterizing Distributions on Lie Groups (AISTATS 2019)

[4] Falorsi et al. Explorations in Homeomorphic Variational Auto-Encoding (2018)

[5] Grassia, F. S. Practical parameterization of rotations using the exponential map.Journal of Graphics Tools. (1998)

评论

Dear authors,

Thank you for your answer. But you are not answering my question.

What I am saying is that you can do all the learning (VAE and SiT) in the tangent space of rotations. Any point in this tangent space is a valid rotation so you can use vanilla VAE and SiT. I understand that what you are proposing works. But by the principle of Occam's razor, I have to ask, can you justify the need to complicate the setup?

AC

评论

To reiterate and clarify, in my suggested setup, VAE's latent representation itself does not have to be constrained to the SO(3) manifold, e.g., unit quaternions or orthonormal matrices. It is an abstract latent representation that is to be decoded back into a projection to the tangent space of SO(3), which can then be transformed into valid rotations by exponentiation.

One potential justification for keeping the VAE latent representation constrained to SO(3) is for the frequency filtering. But I do want to point out [1] successfully applied FFT on abstract latent embedding, not in rotations, to learn high-quality motions.

  1. Starke, Mason, Komura. 2022. Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Transactions on Graphics.
评论

Dear AC,

We now see that one of your suggested potential justifications is for the frequency filtering. However, as you mentioned, there is already existing work that applies FFT to abstract latent representations.

We agree with the AC that this is a good point, and the suggested vanilla VAE setup with the tangent space helps make our work more comprehensive.

Meanwhile, please forgive us for the delayed response, as we needed some time to run experiments to verify our justification, and some of the responses arrived around midnight in our local time. But we will do our best to address any questions!

Best regards,

Authors #6198

评论

Dear AC,

Thank you for your further clarification and patience. Now we understand that what you’re suggesting is that there seems to be a simpler approach: applying the logarithmic map to convert the rotations into the tangent space first, then training a standard VAE and SiT in the tangent space (valid rotation), and finally applying the exponential map on decoding.

One of the reasons we designed a more complex approach (adopting SO(3)-VAE and a geometric interpolant) is that the simpler method you suggested may encounter a non-unique mapping problem.

The exponential map from the Lie algebra so(3)\mathfrak{so}(3) (tangent space) to the group SO(3)SO(3) is surjective but not injective [1]. In other words, many different tangent vectors can correspond to the same rotation. For example, a rotation by 180° about some axis is represented in the tangent space by a vector of length π\pi in that direction, but also by a vector of length π\pi in the opposite direction – two distinct vectors map to one rotation. Hence, mapping a Euclidean latent to SO(3)SO(3) inherently involves such non-unique correspondences. A vanilla VAE in R3\mathbb{R}^3 would have to arbitrarily choose or learn a single representative for each rotation, or risk outputting inconsistent representations (e.g. spontaneously jumping between equivalent rotations). This ambiguity can complicate learning and sampling in vanilla VAE. By constraining the SO(3)-VAE in our method, it will not face this issue (homeomorphic).

To verify our justification, we conducted a fair experiment on HumanML3D with a consistent training setup (6k epochs). The results in the following table indicate that this learning approach introduces more difficulty in VAE training and decreases performance.

ModelsrFID
Vanilla VAE (tangent space)0.013
SO(3)-VAE0.006

I hope this additional explanation answers your question. We are happy to address any further questions.

Best regards,

Authors #6198

Reference:

[1] Luca Falorsi et al. Reparameterizing Distributions on Lie Groups (AISTATS 2019)

审稿意见
4
  • Introduces a frequency-aware sparsification mechanism that dynamically prunes low-frequency motion tokens.
  • Using a temporal-spatial factorized interpolant based on Lie group geodesics over quaternion manifolds enables geometrically consistent and scalable joint rotation modeling.
  • Combines temporal-spatial factorization with Lie group interpolation to preserve motion structure and consistency during denoising.
  • Avoids sparsity granularity mismatch, improving training–inference consistency.
  • Seamlessly integrates with hardware-efficient exact attention mechanisms.
  • Achieves high performance on HumanML3D and KIT-ML. Scales well with large-scale data (e.g., MotionHub V2).

优缺点分析

  • Efficiency and Speed: FlashMo achieves 2.25× faster inference than full attention without performance loss.
  • Frequency-Aware Sparsification: The model dynamically prunes low-frequency motion tokens, preserving high-frequency components at the attention head level. This enables faster computation while balancing with quality.
  • Integrates seamlessly with hardware-efficient attention mechanisms without custom kernel design.
  • Lie group geodesic interpolation on quaternion manifold, more suitable for joint rotations than Euclidean interpolants.
  • Scalability: No U-Net inductive bias + Geometric factorized interpolant

问题

I'm a little bit confused about the high-low frequencies in Figure 1: the attention operates over the tokens in the temporal dimension, so each token represents the entire body at a specific time frame, not individual body parts. However, Figure 1 seems to imply that high and low frequencies are based on specific body parts, which suggests that frequency is defined per joint or region rather than over time.

局限性

yes

最终评判理由

The proposed method introduces Frequency-Aware Sparsification, leading to faster inference; however, this improvement comes with a trade-off in quality. The geometrically factorized interpolant appears to improve motion quality to some extent, but the method still falls short of the performance achieved by spatio-temporal masked motion models (MoGenTS) and the phased consistency approach (MotionPCM).

The results on MotionHub V2 pretraining may not provide a fair basis for comparison; therefore, my evaluation is primarily based on results from HumanML3D.

Based on this, I maintain my initial score, Borderline Accept.

格式问题

N/A

作者回复

We appreciate your valuable feedback and support. Our responses to the questions are as follows.

Q1: High-low frequencies in Figure 1.

A1: In Figure 1, the frequency of the motion features is mapped onto the motion sequence to better illustrate motion frequency. This visualization shows that motion frequency varies across both the temporal (frame-wise) and spatial (body-part-wise) dimensions. Dynamic frames appear in lighter colors compared to others, and similarly, dynamic body parts are shown in lighter colors compared to static ones. In our paper, frequency-aware sparsification is applied to the temporal attention, i.e., token pruning is performed along the temporal dimension.

To extend sparsification to both temporal (frame-wise) and spatial (body-parts) dimensions, one could simply adopt a standard temporal-spatial attention structure, where temporal and spatial attention layers are alternately stacked. To verify this, we conducted an ablation study on HumanML3D, which shows slight improvements in both generation quality and efficiency. This is because applying sparsification to both temporal and spatial dimensions provides finer-grained sparsification granularity. These results are not included in the current version of the paper, as the spatial-temporal architecture has been widely discussed in prior works [1,2] and is not the main focus of our contribution. However, we will include them in the next version of our paper.

MethodAIT(s) ↓R Precision ↑ (Top 1)R Precision ↑ (Top 2)R Precision ↑ (Top 3)FID ↓MM Dist ↓Diversity →MModality ↑
Real-0.511 ± .0030.703 ± .0030.797 ± .0020.002 ± .0002.974 ± .089.503 ± .065-
FlashMo0.0270.562 ± .0040.754 ± .0050.847 ± .0050.041 ± .0022.711 ± .0069.614 ± .0562.812 ± .046
FlashMo w/temporal-sptial0.0250.564 ± .0030.758 ± .0030.849 ± .0040.040 ± .0012.709 ± .0099.608 ± .0722.829 ± .051

Once again, thank you for the valuable feedback. Your support means a lot and has helped us improve our work. We hope you will consider raising your score if you feel our response has addressed your concerns.

Reference:

[1] MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling (NeurIPS 2024)

[2] Motion Mamba: Efficient and Long Sequence Motion Generation (ECCV 2024)

评论

Dear Reviewer,

Thank you for your valuable feedback and support. If we have addressed your concerns, we kindly ask you to consider raising your score. We are happy to address any further questions.

Best regards,

Authors #6198

评论

Thank you for the clarification. After considering the additional details the author's rebuttal provided to all reviewers, I am still maintaining my original score, Borderline accept.

  1. As the proposed method introduces Frequency-Aware Sparsification, which leads to faster inference. However, this improvement comes with a trade-off in quality.
  2. Also, the geometrically factorized interpolant appears to improve motion quality to some extent. Nonetheless, the method still falls short of the performance achieved by spatio-temporal masked motion models (MoGenTS) and by phased consistency approach (MotionPCM).

Note that, as Reviewer 8uxk mentioned, the results on the MotionHub V2 pretraining may not be a fair basis for comparison, so I base my evaluation primarily on the results from HumanML3D.

评论

Thanks for your prompt response and postive feedback. We would like to provide further clarification based on your comment.

Q2. Efficiency improvement comes with a trade-off in quality.

A2. First, it is important to clarify that our method achieves an extraordinary balance between generation quality and efficiency, both of which are critical and essential to the overall performance of a model. This has already been acknowledged in your original review: “Frequency-Aware Sparsification… This enables faster computation while balancing with quality.”

The trade-off between quality and efficiency is not a problem unique to sparsification; it is a general challenge faced by efficiency AI. Token sparsification prunes redundant and less important tokens from the original set to improve efficiency, but this inherently introduces information loss—a limitation not exclusive to our method. Similarly, step distillation methods, such as MotionLCM and MotionPCM, accelerate inference by relying on consistency models but require fewer inference steps. Linear architectures, such as Motion Mamba, reduce computational complexity but often introduce more complex scanning mechanisms to maintain model capacity.

Hence, the goal of our efficient method is not simply to trade off quality for efficiency, but to achieve an extraordinary balance — which is exactly what our method accomplishes.

Q3. Falls short of the performance achieved by MoGenTS and MotionPCM, and the results on the MotionHub V2 pretraining may not be a fair basis for comparison.

A3. This question has been addressed in 8uxk A2.1. As mentioned in the rebuttal, MotionPCM and MoGenTS do not outperform FlashMo (without pretraining). Therefore, the comparison is fair, as FlashMo is also a train-from-scratch model in the first line of Table 1.

Below is a sub-table extracted from the original Table 1, which presents the comparison on HumanML3D.

MethodAIT(s) ↓R-Precision Top-1Top-2Top-3FID ↓MM Dist ↓Diversity →MModality ↑
MoGenTS0.1810.5290.7190.8120.0332.8679.570-
MotionPCM (1-step)0.0310.5600.7520.8440.0442.7119.5591.772
MotionPCM (2-step)0.0360.5550.7490.8390.0332.7399.6181.760
MotionPCM (4-step)0.0450.5590.7520.8420.0302.7169.5751.714
FlashMo (without pretrain)0.0270.5620.7540.8470.0412.7119.6142.812

It is evident that FlashMo (without pretraining) achieves SOTA performance on 6 out of 8 metrics on HumanML3D, while MotionPCM achieves state-of-the-art results on at most two metrics (across two model variants), and MoGenTS does not achieve state-of-the-art on any metric.

Moreover, in terms of FID, the only method with close efficiency to FlashMo (0.027 AITs) is MotionPCM (1-step) at 0.031 AITs. However, its FID score is 0.044, which does not outperform FlashMo’s 0.041.

This performance is unrelated to the MotionHubV2 dataset, as FlashMo in this table does not use any pretraining.

We hope this clarification addresses your concerns, and we are happy to answer any further questions. Once again, thank you for your positive feedback and support.

评论

Dear Reviewer uEGj,

Thank you for your consistent support for our work!

We welcome further discussion to see whether our response addresses your remaining concerns. To save your time, we provide a brief summary below:

1. FlashMo outperforms SOTA methods with comparable AIT (including FID).

The only method with comparable efficiency to FlashMo (0.027 AITs) is MotionPCM (1-step) at 0.031 AITs (even slower). FlashMo (sparse, no pretrain) achieves a better FID (0.041) than MotionPCM (1-step, FID 0.044). Moreover, FlashMo (full attention, no pretrain) achieves the lowest FID (0.029), outperforming all methods. Moreover, our method achieves significant improvements in other metrics as well. (The detailed results is are tables of 8uxk A4&A5.)

2. FlashMo's performance gains from its advanced interpolant and architectural design, not simply from data-driven training.

(1) To showcase and ablate the performance gains from our method design, we adopt full attention with temporal-spatial attention. The results of FlashMo (full attention, no pretrain) demonstrate that our method design improve not only FID (0.029) but also overall generation. This improvement is unrelated to MotionHub V2, as no pretraining is used in this additional ablation.

(2) The other key contribution of FlashMo is scalability. We conducted a fair comparison between MoGenTS and ours on HumanML3D, both of which are pretrained on MotionHub V2. FlashMo (FID 0.041→ 0.029) demonstrates significantly better scalability than MoGenTS (FID 0.033 → 0.052).

We hope these answers address your remaining concerns, and we kindly hope you will consider raising your rating if your concerns have been mainly addressed.

Once again, thank you for your time and support! We wish you a successful run at NeurIPS!

Best regards,

Authors #6198

审稿意见
4

This paper proposes FlashMo, an efficient and scalable 3D human motion generation framework to deal with high computational cost and limited scalability due to the U-Net inductive bias. The authors propose a frequency-aware sparse attention mechanism that dynamically prunes low-frequency motion tokens without requiring custom kernel implementations. They also design a transformer-based backbone, MotionSiT, which incorporates a geometric factorized interpolant using Lie group geodesics over quaternion manifolds, enabling accurate and scalable modeling of joint rotations. Experiments on HumanML3D and KIT-ML, show that FlashMo outperforms previous methods in both motion quality and efficiency.

优缺点分析

Strengths:

  1. The frequency-aware sparse attention mechanism is well-motivated, practical, and integrates seamlessly with hardware-optimized exact attention.
  2. MotionSiT decouples temporal and spatial dimensions and applies lie group geodesic interpolation, which enhances the structural integrity of motion sequences.
  3. The proposed method shows better performance and becomes more efficient. Weakness:
  4. In Figure 2 (d), the Guide Token Pruning should be detailed in the figure for easier understanding.
  5. In the ablation study, the performance undergoes significant changes with slight variations in \gamma. Does this indicate that the method is not robust to hyperparameters? Additionally, are the same parameters applicable to another dataset?

问题

  1. Can you explain why your method outperform the Real data in R precision?

局限性

There is no obvious negative societal impact of their work.

最终评判理由

The authors have addressed my most concerns, so I lean to accept this paper.

格式问题

There is no paper formmating concerns.

作者回复

We appreciate your valuable feedback and support. Our responses to the questions are as follows.

Q1: Figure 2(d) details.

A1: As shown in Figure 2(d) and Section 3.3, the input tokens 𝑋 are first projected into queries 𝑄 and keys 𝐾, which are used to compute attention scores. These attention scores indicate the importance of each token and guide the selection of a subset of tokens 𝑍, where low-importance tokens are pruned. This adaptive token selection reduces redundancy while preserving key information for subsequent attention computation. We will include a more detailed figure in the next version of the paper.

Q2: Method robustness to the hyperparameter γ.

A2: Thank you for bringing up this important question. The attention threshold γ is related to the number of adaptively selected tokens, as mentioned in Section 3.3 (Line 173). In token sparsification, model performance is sensitive to the number of pruned tokens, and pruning more tokens leads to greater information loss, which the similar pattern can be observed in both vision and language models [1]. Hence, sparsification methods aim to prune redundant tokens while preserving as much performance as possible.

This exactly demonstrates the benefit of our adaptive token pruning, which significantly reduces information loss compared to methods using a fixed number of pruned tokens (e.g. predefined attention mask, or hard token threshold). This advantage can be also demonstrated in the experiment on KIT-ML in the following table, where the same sparsification hyperparameters remain applicable, which verify the method robustness to our hyperparameters.

MethodAIT(s) ↓R Precision ↑ (Top 1)R Precision ↑ (Top 2)R Precision ↑ (Top 3)FID ↓MM Dist ↓Diversity →MModality ↑
Real-0.424 ± .0050.649 ± .0060.779 ± .0060.031 ± .0042.788 ± .01211.08 ± .097-
Full Attention0.0650.451 ± .0030.677 ± .0030.805 ± .0040.141 ± .0062.705 ± .04410.69 ± .0623.431± .058
β = 0.75 γ = 0.930.0160.411 ± .0020.637 ± .0050.759 ± .0040.310 ± .0042.981 ± .05710.55 ± .0731.692 ± .054
β = 0.75 γ = 0.950.0180.426 ± .0030.645 ± .0030.766 ± .0050.275 ± .0022.861 ± .04410.46 ± .0492.075 ± .061
β = 0.75 γ = 0.970.0210.438 ± .0040.665 ± .0020.790 ± .0030.181 ± .0022.738 ± .05010.84 ± .0632.797 ± .069
β = 0.50 γ = 0.930.0240.430 ± .0020.658 ± .0040.785 ± .0020.199 ± .0052.740 ± .03910.69 ± .0772.524 ± .055
β = 0.50 γ = 0.950.0290.449 ± .0020.670 ± .0040.799 ± .0020.152 ± .0042.709 ± .00510.64 ± .0743.287 ± .042
β = 0.50 γ = 0.970.0360.447 ± .0030.669 ± .0020.797 ± .0040.150 ± .0022.711 ± .02810.79 ± .0563.279 ± .036
β = 0.25 γ = 0.930.0440.435 ± .0040.664 ± .0020.788 ± .0050.164 ± .0022.725 ± .04110.62 ± .0422.986 ± .053
β = 0.25 γ = 0.950.0500.449 ± .0010.668 ± .0030.793 ± .0010.150 ± .0042.709 ± .01710.76 ± .0583.286 ± .049
β = 0.25 γ = 0.970.0560.447 ± .0020.669 ± .0010.799 ± .0030.148 ± .0032.710 ± .02910.81 ± .0403.285 ± .051

Q3: Outperform the real data in R precision.

A3: This is a classic question. R-Precision evaluates the alignment between a generated motion and its corresponding text description. For each motion sample, a pool of 32 descriptions is created, including the ground-truth text and 31 randomly selected mismatched texts from the test set. The Euclidean distances between the motion feature and the text features in the pool are computed and ranked. If the ground-truth description appears in the top-k closest matches, it is counted as a successful retrieval. The final R-Precision score is the average retrieval accuracy at Top-1, Top-2, and Top-3 positions [2].

Since the R-Precision calculation relies on the distances in the feature space, in some cases, the generated motion may align more closely with the semantics of the text in the embedding space than the ground-truth motion itself. This could happen if the generated motion is smoother, less noisy, or more "canonical" than the real motion, making it easier for the model to match it to the text. As a result, it's possible for the R-Precision of generated motion to slightly surpass that of the ground truth. You can also observe this pattern in other works, such as MoGenTS [3], MotionPCM [4], etc. Hence, in some works, such as MoMask [5], ground truth values are not mentioned in their tables.

Once again, thank you for the valuable feedback. Your support means a lot and has helped us improve our work. We hope you will consider raising your score if you feel our response has addressed your concerns.

Reference:

[1] HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models (AAAI 2026)

[2] Generating Diverse and Natural 3D Human Motions from Text (CVPR 2022)

[3] MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling (NeurIPS 2024)

[4] MotionPCM: Real-Time Motion Synthesis with Phased Consistency Model

[5] MoMask: Generative Masked Modeling of 3D Human Motions

评论

The authors have addressed most of my concerns, it may help the paper easier to understand.

评论

Thanks for your prompt response and positive feedback. We're glad that we addressed your concerns!

评论

Dear Reviewer,

Thank you for your valuable feedback and support. If we have addressed your concerns, we kindly ask you to consider raising your score. We are happy to address any further questions.

Best regards,

Authors #6198

评论

Dear Reviewer GZdj,

Thank you for your positive feedback. We’re glad to hear that your concerns have been adequately addressed.

We kindly hope you will consider raising your rating to further champion our paper. Your support means a lot to us, and we are happy to answer any further questions.

Once again, thank you for your time and support! We wish you a successful run at NeurIPS!

Best regards,

Authors #6198

审稿意见
4

This paper presents FlashMo, a text-to-motion (T2M) generation approach that incorporates frequency-aware information. To enable an efficient backbone for both training and inference, the authors extend the Xt interpolation in flow-matching ODEs, which are commonly used in the image domain's Euclidean space, into the realm Lie group which is used to present body joint rotations. The authors also introduce a new model called MotionSiT. Furthermore, they propose a pruning strategy for the attention score matrix based on Hi-Fi and Lo-Fi frequency patterns in the final feed-forward layer, aiming to improve training efficiency while preserving realistic, frequency-aware motion details. The model is pretrained on the large-scale MotionHubV2 dataset and evaluated on HumanML3D and KIT-ML, showing superior quantitative performance compared to existing T2M methods.

优缺点分析

Strength

(1) The idea of incorporating the interpolation process to the SO(3) manifold is well-motivated. Since human motion is inherently represented through rotations, modeling it in Euclidean spaces is fundamentally inappropriate. Traditional approaches rely on Euclidean gradient projections to approximate the target, rather than walking on the geodesic on the rotation manifold directly. By operating directly within SO(3), the proposed method aligns more naturally with the underlying structure of motion data.

(2) The pruning of attention score matrix for the Lo-Fi heads efficiently speeds up training.

Weakness

(1) The sparse multi-head attention is not sufficiently motivated or evaluated. While the authors integrate a Lie Group interpolant with sparse attention, no ablation studies are provided to isolate and quantify the contribution of each component to overall performance. Although the frequency-aware sparsification is conceptually well-grounded based on observed patterns in attention heads, it remains unclear whether alternative designs would lead to motions with different frequency characteristics. Additionally, it would be valuable to understand whether models employing frequency-aware pruning exhibit more realistic dynamics or richer frequency details compared to those without it. Some qualitative results or visualizations could help clarify the rationale and effectiveness of this design choice.

(2) The evaluation of the proposed framework is somewhat unclear. In Table 1, FlashMo (ours) without pretraining appears to be trained solely on HumanML3D or KIT-ML. However, despite the strong performance of FlashMo w/ Pretrain, the non-pretrained variant does not outperform existing methods such as MotionPCM or MoGenTS. Given that MotionHubV2 is described as a large-scale dataset, I assume it contains a richer diversity of motions compared to HumanML3D, which raises the question of whether the final performance gains stem primarily from the proposed technical innovations or from the benefit of pretraining on a much larger dataset. This distinction is important for assessing the true contribution of the method itself. Additionally, I found no public access for MotionHubV2 for now which also hinders reproducibility.

问题

The description of the "frequency-aware" head partition in Section 3.3, as well as the procedure for generating Figure 4, is unclear. Specifically, it is not well explained how the frequency patterns are computed or how attention heads are categorized into Hi-Fi and Lo-Fi. Additionally, the caption of Figure 4 mentions "100 motion features," but it is not specified what exactly these features represent. Clarifying these details would help readers better understand and reproduce the analysis.

局限性

yes

最终评判理由

I appreciate the authors’ efforts and have carefully read all of their responses. I think they addressed my concerns well so I will raise my score.

格式问题

NO

作者回复

We appreciate the reviewer’s valuable feedback and reponse to the concerns as follows.

Q1.1: Motivation of sparse multi-head attention.

A1.1: Our motivation of sparse multi-head attention is recognized by Reviewer GZdj, “The frequency-aware sparse attention mechanism is well-motivated…”.

Futhermore, in our paper, the motivation of sparse multi-head attention has explicitly mentioned in lines 28-31 “Efficiency. Motion diffusion suffers from high computational cost, as well as long training and inference times. Existing efficient methods such as step distillation [1,2], step reduction [3], and linear models [4,5] either require additional training of teacher models or involve complex scanning mechanisms, resulting in overengineering and low efficiency.” and lines 37-40 “We observe that dynamic motion, which is more important, exhibits higher frequency compared to static motion in generation process, as shown in Figure 1. Hence, we design a frequency-aware sparsification mechanism that dynamically prunes tokens corresponding to low frequency motion, enhancing efficiency while preserving high frequency motion at the attention head level.”

Q1.2: Ablation study.

A1.2: The ablation study of each component including geometric factorized interpolant and frequency-aware sparsification have shown in Table 2 and 3.

(1) In Table 2, our interpolant method significantly improves generation quality compared to other methods.

(2) In Table 3, our frequency-aware sparsification significantly improves efficiency while largely preserving generation quality compared to the full attention baseline.

Both generation quality and efficiency are non-negligible perspectives of overall motion generation performance, which is recognized by Reviewer uEGj, “This enables faster computation while balancing with quality”.

Q1.3: Frequency-aware sparsification for different architecture.

A1.3: Sparsification methods are often designed for a series of models with similar or shared architectures. e.g. Sparse VideoGen [6] introduces a semi-structured sparsification method by designing a predefined attention mask based on the observed attention patterns in CogVideoX [7] and HunyuanVideo [8], both of which share a common video DiT architecture.

Moreover, our frequency-aware sparsification is a structured sparsification method with adaptive token selection, without predefined attention masks, which expands the potential of our method to adapt to other architectures.

Q1.4: Exhibit more realistic dynamics or richer frequency.

A1.4: This question relates to the fundamental mechanism of token sparsification methods, which prune tokens from the original set, reducing redundant and less important information without introducing new content [9,10], i.e., token sparsification methods aim to improve efficiency while preserving important information as much as possible (which impacts generation quality); they do not enrich motion frequency or exhibit more realistic dynamics.

Q2.1: Evaluation and concerns about w/o pretrain.

A2.1: In Table 1 and Figure 6 of our paper, the performance of FlashMo w/ pretraining is to demonstrate the outstanding scalability of our method, which does not hinder the performance gains achieved through our method’s advancements.

In Table 1, FlashMo (without pretrain) achieve SOTA in 5 out of 8 metrics in HumanML3D and 4 out of 7 metrics in KIT-ML, while MotionPCM achieves SOTA on at most two metrics (1-step) and MoGenTS does not achieve any SOTA on HumanML3D, similarly, MoGenTS achieves SOTA on only two metrics and MotionPCM does not achieve any SOTA on KIT-ML.

This has been widely recognized by Reviewer GZdj, 'The proposed method shows better performance and becomes more efficient,' and by Reviewer HnUJ, 'The proposed method achieves state-of-the-art performance with the lowest inference time, training time, model size, and GFLOPs.’

Q2.2: No public access for MotionHub V2.

A2.2: The MotionHub V2 dataset is publicly available on the GitHub repository of MotionLLAMA [11], provided via a BaiduPan link.

Q3: Frequency-aware head partition and figure 4.

A3: As defined in lines 156–158 of our paper, attention heads are partitioned based on their frequency into Hi-Fi and Lo-Fi groups, where β is the partition ratio, which is ablated in Table 3.

As for the procedure for computing the frequency pattern in the figure, the caption of Figure 4 in our paper mentioned: 'The frequency magnitude is computed using the Fast Fourier Transform (FFT) and averaged over 100 latent motion features’. Specifically, the frequency magnitude of each attention head is calculated with a Fast Fourier Transform (FFT). We take the output feature maps from either the Hi-Fi or Lo-Fi attention head. For each feature map, we apply FFT to convert the latent representation into the frequency domain. We then compute the magnitude of the frequency components and apply a logarithmic transformation for stability and better interpretability. To ensure a reliable estimate, we randomly select 100 latent motion features and average their frequency magnitudes. This averaged result reflects the typical frequency characteristics of each attention head and is used to analyze or visualize the frequency sensitivity of different groups. Adapting FFT to calculate frequency magnitude is a common approach, and similar methods have also been introduced in works such as [12] and [13].

Once again, we appreciate the reviewer’s valuable feedback. We kindly hope you will consider raising your score if we have adequately addressed your concerns.

Reference:

[1] MotionLCM: Real-time controllable motion generation via latent consistency model (ECCV 2024)

[2] MotionPCM: Real-time motion synthesis with phased consistency model.

[3] Emdm: Efficient motion diffusion model for fast and high-quality motion generation (ECCV 2024)

[4] Motion mamba: Efficient and long sequence motion generation. (ECCV 2024)

[5] Light-t2m: A lightweight and fast model for text-to-motion generation (AAAI 2025)

[6] Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity (ICML 2025)

[7] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

[8] HunyuanVideo: A Systematic Framework For Large Video Generative Models

[9] Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers (CVPR 2023)

[10] PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models (CVPR 2025)

[11] MotionLLaMA: A Unified Framework for Motion Synthesis and Comprehension (also known as “VersatileMotion”)

[12] Improving Vision Transformers by Revisiting High-frequency Components (ECCV 2022)

[13] Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding (TPAMI 2025)

评论

Dear Reviewer,

Thank you for your valuable feedback and support. If we have addressed your concerns, we kindly ask you to consider raising your score. We are happy to address any further questions.

Best regards,

Authors #6198

评论

Thank you for your prompt response. Below is our detailed response to your comment.

Q4. Performance.

A4. Below is a sub-table extracted from the original Table 1 in our paper, which presents the comparison on HumanML3D, as reviewer emphasized.

(1) The reviewer claimed that our method fails to outperform prior state-of-the-art methods with comparable AIT in terms of FID, which he/she claimed as the most critical metric. However, this claim is not supported by the results.

In terms of FID, the only method with close efficiency to FlashMo (0.027 AITs) is MotionPCM (1-step) at 0.031 AITs (slower). However, its FID score is 0.044, which does not outperform FlashMo’s 0.041.

Moreover, the advantages of our method extend beyond FID. It is evident that FlashMo (without pretraining) clearly outperforms MoGenTS and MotionPCM, achieves SOTA performance on 6 out of 8 metrics on HumanML3D, while MotionPCM achieves state-of-the-art results on at most two metrics (but across two model variants), and MoGenTS does not achieve state-of-the-art on any metric.

The results in the table speaks for itself.

MethodAIT(s) ↓R-Precision Top-1Top-2Top-3FID ↓MM Dist ↓Diversity →MModality ↑
MoGenTS0.1810.5290.7190.8120.0332.8679.570-
MotionPCM (1-step)0.0310.5600.7520.8440.0442.7119.5591.772
MotionPCM (2-step)0.0360.5550.7490.8390.0332.7399.6181.760
MotionPCM (4-step)0.0450.5590.7520.8420.0302.7169.5751.714
FlashMo (without pretrain)0.0270.5620.7540.8470.0412.7119.6142.812

(2) The reviewer claimed that FID is the most critical metric, disregarding our significant improvements in other metrics. However, this claim is one-sided and tenuous.

FID exhibits notable limitations in various aspects. For example, as ParCo [1] mentioned: 'FID does not assess the alignment between textual descriptions and generated motions.' Similarly, as mentioned in Motion-Agent [2]: 'variable length can lead to higher FID scores.” That is, FID can only assess the distance between the distributions of two sets of motions [2].

Hence, it is essential to evaluate motion generation comprehensively, considering multiple aspects such as motion distribution distance (FID), text matching (R-Precision), text-motion alignment (MM Dist), diversity (MModality), and efficiency (AITs), etc. Judging a model’s performance primarily based on FID while disregarding other metrics is misleading.

It is worth noting that many latest models published in 2025—such as Motion-Agent (ICLR 2025) [2], ReMoGPT (AAAI 2025) [3], ScaMo (CVPR 2025) [4], and others—did not achieve the FID levels required by the reviewer, as shown in the table below. However, this did not prevent them from generating high-quality motions or being accepted by top-tier conferences. (Their FIDs are highlighted in the table.)

MethodR-Precision Top-1Top-2Top-3FID ↓MM Dist ↓Diversity →MModality ↑
MoGenTS0.5290.7190.8120.0332.8679.570-
MotionPCM (1-step)0.5600.7520.8440.0442.7119.5591.772
Motion-Agent (ICLR 2025)0.5150.8010.2302.9679.908
ReMoGPT (AAAI 2025)0.5010.6880.7920.2052.9299.7632.816
ScaMo-3B (CVPR 2025)0.5120.6950.7960.1012.9909.590
FlashMo (without pretrain)0.5620.7540.8470.0412.7119.6142.812

Reference:

[1] ParCo: Part-Coordinating Text-to-Motion Synthesis (ECCV 2024)

[2] Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs (ICLR 2025)

[3] ReMoGPT: Part-Level Retrieval-Augmented Motion-Language Models (AAAI 2025)

[4] ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model (CVPR 2025)

评论

The authors have addressed most of my concerns in their rebuttal, but I remain unconvinced about the model’s performance with and without pretraining. While the frequency-aware sparsification indeed accelerates training and inference, the non-pretrained model still fails to outperform prior state-of-the-art methods with comparable AITs (e.g., MotionPCM at 0.031–0.045 s).

Although the authors claimed in their rebuttal that they achieved SOTA results on most HumanML3D metrics, they do not lead on the most critical FID score. This raises doubts about whether the claimed key contribution truly delivers across the board. The only substantial improvement in FID comes from augmenting the training set with MotionHubV2 data (from 0.041 down to 0.029). For these reasons, I maintain my original rating.

评论

Q5. Whether the claimed key contribution truly delivers across the board.

A5. (1) The reviewer acknowledged that 'frequency-aware sparsification indeed accelerates training and inference,' but expressed concerns about whether the remaining key contribution is truly effective. It is important to note that, beyond efficiency, another key contribution of our method is scalability. That is, both our geometric factorized interpolant design and MotionSiT architecture are aimed at improving scalability in a data-driven manner.

Since MotionPCM is not yet open-sourced, we conducted a fair comparison between MoGenTS and our method, both of which are pretrained on MotionHub V2 and evaluated on HumanML3D. As shown in the table below, our method demonstrates significantly better scalability than MoGenTS. Moreover, after pretraining, MoGenTS does not show performance improvement and even exhibits a decline in some metrics—including FID (0.033 to 0.052), which the reviewer obviously cares about most.

MethodR-Precision Top-1Top-2Top-3FID ↓MM Dist ↓Diversity →MModality ↑
MoGenTS0.5290.7190.8120.0332.8679.570-
MoGenTS (pretrained)0.5300.7180.8150.0522.9629.647-
FlashMo (pretrained)0.5680.7610.8510.0292.7039.6012.851

This verifies that our key contribution delivers not only efficiency but also scalability.

(2) The reviewer also claimed that “The only substantial improvement in FID comes from augmenting the training set with MotionHubV2 data (from 0.041 down to 0.029)”, but ignoring the performance boost by our interpolant and architecture design. However, the claim is not true.

To fully showcase and ablate the performance gains from our interpolant and architecture design, we adopt full attention with standard temporal-spatial attention, excluding the effect of sparsification. The results in the following table demonstrate that our interpolant and architecture improve not only FID (0.029 without pretrain) but also overall generation quality. This improvement is unrelated to MotionHub V2, as no pretraining is used in this setting.

MethodR-Precision Top-1Top-2Top-3FID ↓MM Dist ↓Diversity →MModality ↑
MoGenTS0.5290.7190.8120.0332.8679.570-
MotionPCM (1-step)0.5600.7520.8440.0442.7119.5591.772
MotionPCM (2-step)0.5550.7490.8390.0332.7399.6181.760
MotionPCM (4-step)0.5590.7520.8420.0302.7169.5751.714
FlashMo (full attention, without pretrain)0.5670.7610.8500.0292.7079.6132.841

We hope these additional information has addressed your remaining concerns. Thank you.

评论

Dear reviewers 8uxk,

We welcome further discussion to see whether our response addresses your concerns. To save your time, we provide a brief summary below:

1. FlashMo outperforms SOTA methods with comparable AIT (including FID).

The only method with comparable efficiency to FlashMo (0.027 AITs) is MotionPCM (1-step) at 0.031 AITs (even slower). FlashMo (sparse, no pretrain) achieves a better FID (0.041) than MotionPCM (1-step, FID 0.044). Moreover, FlashMo (full attention, no pretrain) achieves the lowest FID (0.029), outperforming all methods. Moreover, our method achieves significant improvements in other metrics as well.

2. FlashMo's performance gains from its advanced interpolant and architectural design, not simply from data-driven training.

(1) To showcase and ablate the performance gains from our method design, we adopt full attention with temporal-spatial attention. The results of FlashMo (full attention, no pretrain) demonstrate that our method design improve not only FID (0.029) but also overall generation. This improvement is unrelated to MotionHub V2, as no pretraining is used in this additional ablation.

(2) The other key contribution of FlashMo is scalability. We conducted a fair comparison between MoGenTS and ours on HumanML3D, both of which are pretrained on MotionHub V2. FlashMo (FID 0.041→ 0.029) demonstrates significantly better scalability than MoGenTS (FID 0.033 → 0.052).

We hope these answers address your remaining concerns, and we look forward to your positive feedback!

Once again, thank you for your time and support! We wish you a successful run at NeurIPS!

Best regards,

Authors #6198

评论

AC chiming in.

Thanks to both Reviewer 8uxk and the authors for the active discussions.

I acknowledge that the authors have conducted extensive numerical evaluations to prove the superiority of the proposed setup. If Reviewer 8ukk can continue the constructive discussions with critical thinking, we deeply appreciate it.

I do wish the authors had provided qualitative ablation studies with videos (I know providing more videos is not allowed at this stage). As in the discussion on the FID metric, the numbers do not always reflect the perceptual quality of motions. For example, while the method may outperform others in numbers, I still notice jitters, discontinuities, and physical inaccuracies (e.g. foot slides). A user study was conducted to qualitatively compare with other methods, but not with different design choices of the architecture.

评论

I agree that in motion generation, the actual motion quality is way more important than numbers. In this field, there are various metrics for evaluating different aspects of generated motions—for example, R-Precision and matching score for measuring text–motion alignment, and Multi-Modality and Diversity scores for assessing the variety of generated motions.

While FID essentially compares the distributions of real and generated motions, thereby reflecting motion fidelity, semantic correctness, and diversity. Therefore I consider it the most comprehensive, important and informative metric. However, the limitation of the current motion FID lies in the feature extractor most widely adopted in the community, introduced by Guo et al. in the original HumanML3D paper. This pre-trained motion extractor was trained on a limited dataset and captures only a restricted range of motion features, without tunning with the text-motion pairs with similar semantics. Because of this, I am often skeptical about whether a lower Guo FID truly reflects better motion quality, especially now that leaderboard scores in motion generation have dropped below 0.04. At such extremely small values, I wonder whether further reductions meaningfully indicate improved quality.

So, for evaluating real motion quality, I would always expect to see more comprehensive video results covering a wide range of motion categories, both common daily motions such as walking, running, and jumping, and more complex, challenging motions such as dancing, flipping, and karate, etc. Unfortunately the authors provided only five groups of comparisons, some of which contain artifacts, as the AC has noted. At this stage, however, it seems impossible to include additional videos.

I remain open to further discussion of these points with the AC and the authors.

评论

Dear AC,

Thank you for your continued support and patience, especially your encouragement regarding our extensive experiments (the authors are truly grateful and moved).

Meanwhile, we share the same understanding with AC that FID does not always reflect the perceptual quality of motions. We hope our response addresses 8uxk’s concerns, and we are happy to answer any further questions.

As AC mentioned, unfortunately we cannot upload videos at this stage, but we will include improved visualizations with full attention and different design choices in the next version of our paper.

Best regards,

Authors #6198

评论

Dear reviewer 8uxk,

Thank you for your further discussion and critical thinking!

We’re glad to see that you mentioned the limitation of the current FID metric in the community. The authors agree with the reviewer and share the same concern that the current motion FID in the community heavily relies on the feature extractor provided by Guo et al., which has limitations in reflecting the quality of generated motions.

This exactly reflects the importance of evaluating motion generation from a comprehensive perspective, not only using FID, which we mentioned in the previous discussion. The advantages of our method extend beyond FID, with significant improvements also observed in other metrics.

From a deeper perspective, this might require modern evaluation metrics and benchmarks that are carefully designed to better reflect motion generation quality and can be seamlessly adapted to various recent baselines. However, this cannot be easily accomplished by a single work or paper, but may require support and effort from the broader community. Otherwise, it will just be an ordinary dataset and benchmark paper.

Both the reviewers and the AC also mentioned a good point regarding visualization. I must admit that artifacts such as jitter and foot sliding are common problems in motion generation tasks, and this is exactly the motivation for us to keep exploring novel methods to address these challenges. If you look at modern models such as FlowMDM, it is an excellent work in long motion generation with high quality and ease of use, but you can still observe the foot sliding problem in their teaser video. Similarly, recent SOTA methods such as MoGenTS set a good standard in text-to-motion, but they include only four comparison demos to MoMask and also show some small artifacts.

From another perspective, as mentioned in the paper, recent VQ-VAE-based methods achieve outstanding quantitative results but often show small artifacts due to tokenization, e.g., frame-wise noise arising from directly decoding discrete tokens. Diffusion models, however, seem to perform much better in this regard, which is also the reason why we aimed to design a new method based on the diffusion architecture. We agree with the AC and reviewers that providing more ablation videos and user studies comparing different architectural designs will better showcase the effectiveness of our method. We will continue to improve these aspects to make our work more comprehensive.

Once again, thank you for the further discussion and support of our work.

Best regards,

Authors #6198

评论

Dear AC and all reviewers,

First, we would like to thank all reviewers and the AC for their valuable feedback and active discussions. Your feedback has helped us improve our work and make it more comprehensive.

We would especially like to thank AC pKvX for your critical thinking on the fundamental aspects of our method, including the potential of learning standard VAE and SiT in tangent space, which provided a novel view to our work and motivated us to try and verify different setups through experiments. Moreover, your active discussions on both methodology and experiments made this an unforgettable and enjoyable rebuttal experience, and made us believe that NeurIPS is a fair, polite, and friendly top-tier conference.

Our paper’s contributions have been recognized in various aspects, including “motivation” by GZdj (frequency-aware sparsification) and 8uxk (geometric interpolant design), “novelty” by HnUJ, “balancing between performance and efficiency” by HnUJ and GZdj, “geometric interpolant design” by GZdj and 8uxk, “scalability” by HnUJ and uEGj, “efficiency improvement via frequency-aware sparsification” by uEGj and 8uxk, “extensive experiments” by the AC, “well-written” and "no weakness of the method" by HnUJ.

The majority of reviewers (HnUJ, uEGj, GZdj) provided overall positive feedback and acknowledged that their concerns were addressed after the rebuttal and discussion.

The main concern after the rebuttal was the FID raised by 8uxk, which has been explained and validated in our subsequent discussion. The AC agreed with the authors that FID does not always reflect the perceptual quality of motions, and 8uxk later acknowledged the limitation of the current FID metric in the community.

Moreover, the AC proposed a potentially simpler design for our work, involving learning a vanilla VAE and SiT in tangent space. This provided a novel view to our paper. However, our justification in the later discussion indicated that this simpler design may encounter a non-unique mapping problem, as the mapping from the Lie algebra so(3) (tangent space) to the Lie group SO(3) is not one-to-one, introducing difficulty and ambiguity to VAE training. Our later experiments on the VAE verified this problem. The AC also provided another potential justification related to frequency filtering, which the authors believe is also a good point.

Besides, we will continue to improve our paper by providing more comprehensive videos, user studies, and better visualizations to showcase the effectiveness of our method.

Once again, thank you all for your effort and support!

Best regards,

Author #6198

最终决定

Strengths

  • Applying SiT (scalable interpolant transformer) for motion generation with specific designs
    • Independent noise schedule to the temporal and spatial dimensions
    • Interpolant in the tangent space of the SO(3)
    • Sparse attention separating high-frequency and low-frequency signals

Weaknesses

  • Not enough visual results provided
    • Only 5 sample videos
    • No videos on ablations to justify the design choices
  • Contrary to the claim of achieving SOTA in numbers, the provided results exhibit discontinuities, jitters, and physical implausibility
  • The authors left out an important detail in VAE (in short, the VAE latent code is regularized to be SO(3)) until HnUJ and AC asked
    • AC followed up with a question on simplifying the design by just doing all the learning (VAE and SiT) in the tangent space of SO(3). The author's answer says this is not desirable because of the singularity of the tangent space. This does not answer anything because the same singularity exists in their formulation of using the SO(3) tangent space for the SiT interpolant representation.

While the reviews leaned positively after the discussions, the design choices are not well justified in math and the results. I recommend accepting the paper under the condition that the authors will revise the paper with more convincing reasons why the method needed the SiT interpolant in SO(3), with qualitative ablation studies.