VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models
摘要
评审与讨论
This paper explores a new video-based retrieval-augmented motion generation framework for motion large language models.
VimoRAG uses large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals.
The authors try to address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results.
They design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes.
Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input.
优缺点分析
strengths:
-
This paper uses some nice plots for the pipeline, such as fig 1, 2, 3 and 4.
-
The comparisons show some insights of the proposed framework.
-
Visualisations, while very limited, do show the effectiveness of the model, eg, Fig. 5.
weaknesses:
-
The motivations of this work are not strong enough, also the contributions listed do not sound like contributions.
-
As a research paper, related work section should present enough closely related works, with differences clearly highlighted between this work and existing works. This paper only presents around 20 lines of related work, this is inappropriate.
-
As a main pipeline figure, it is suggested to clearly show the novel designs with enough descriptions to show the insights. For example, fig 2 caption only contains 8 words, and it is unclear how this work differs from existing works.
-
It is suggested to have a notation section detailing the maths symbols, operations presented in the paper.
-
Some nice visualisations and discussions are presented in the appendix section. The authors should consider move those key insights into main paper, and thoroughly improve the paper to make it better. The current presentations and organisations need significant improvement.
-
This paper also reads a bit heavily machine generated.
问题
Please refer to weaknesses.
局限性
Limitations are provided in appendix section E.
最终评判理由
After carefully re-reading the revised related work section and thoroughly examining the full paper again, I still feel this work lacks sufficient novelty.
While the proposed VimoRAG framework introduces some architectural refinements, such as the Gemini-MVR retriever and the McDPO training strategy, and uses large-scale in-the-wild video data, the core idea of retrieval-augmented motion generation is not new.
The central methodology closely follows the paradigm established by ReMoDiffuse, and the retrieval-augmented generation (RAG) concept has already been explored in that and several other works (e.g., MotionBank, MotionLLM, and the MotionGPT family).
The claimed novelty of scaling RAG to 2D in-the-wild video data is an incremental step, not a fundamental conceptual advance. The work essentially extends ReMoDiffuse from 3D motion retrieval to 2D video retrieval, with engineering solutions to known challenges (e.g., retrieval quality, modality gap, alignment), but without redefining the problem or offering a genuinely new approach.
Given the field's rapid progress, meaningful novelty should go beyond scaling existing methods or plugging in improved retrievers or trainers. Despite solid execution and competitive results, this paper falls short of that bar.
I therefore recommend rejection.
格式问题
NA
Dear Reviewer waEi,
We sincerely thank you for recognizing the value of our experimental results. We apologize for any confusion caused by the presentation of our motivation and greatly appreciate the opportunity to provide further clarification.
Q1: The motivations are not strong enough; contributions do not sound like contributions.
A: We would like to restate our motivation and contributions more clearly. We aim to address a key challenge: existing motion datasets are small and have limited text diversity, making it hard for models to generate realistic motion from complex queries. We observe that human action videos are abundant and diverse. This leads to our main research question: Can 2D human motion signals from videos guide 3D motion generation? If so, what are the challenges, and how do we solve them?
We restate our specific contributions as follows:
- (The paradigm perspective) We are the first to demonstrate that 2D videos effectively guide 3D motion generation, providing strong generalization ability even beyond the original dataset domain.
- (The technical perspective) We find that existing text-to-video retrieval models underperform on human behavior videos. We analyze the potential reasons in detail and propose Gemini-MVR. We identify the issue of error propagation in cross-modal RAG settings and introduce McDPO training strategy.
- (The empirical perspective) Experiments confirm the effectiveness of the proposed paradigm across multiple benchmarks, while also demonstrating its potential for scalability.
We believe these contributions are valuable not only for the motion generation domain, but potentially for broader applications:
- Our text-to-human-video retrieval model has practical utility in applications such as sports clip retrieval and short video search. It outperforms strong basebones like InternVideo in this context, while maintaining a simple and extensible design.
- Error propagation remains a major limitation in RAG-based systems. McDPO proposes a novel solution by aligning both input and output spaces, providing a generalizable framework that could benefit other LLM applications involving retrieval-based generation.
Q2: Related work section is too brief.
A: We appreciate this suggestion to improve the paper’s quality. We focus on what we believe are the most closely related works, but we are happy to expand this section in the appendix (in revised version) to provide a more comprehensive overview.
Q3: As a main pipeline figure, it is suggested to clearly show the novel designs with enough descriptions to show the insights.
A: We appreciate this helpful suggestion to improve the paper. Due to space constraints, we keep the figure caption concise and provide more detailed explanations in the Introduction. Although we believe the Figure are already quite clear, we agree to revise them again in the revised version. We think all these issues can be easily addressed without significant cost.
Q4: Add a notation section.
A: We appreciate this helpful suggestion to improve the paper. We consider including a notation section in the revised version.
Q5: The authors should consider move those key insights into main paper, and thoroughly improve the paper to make it better.
A: We appreciate this helpful suggestion to improve the paper. According to our point, the current content in the main part is more imortant than appendix. We have tried our best to present the most important content without exceeding the page limit. Of course, we would be happy to revisit this in the revised version.
Q6: This paper also reads a bit heavily machine generated.
A: We apologize for any unpleasant reading experience. However, all the content in this paper is written personally by the author team. We only use LLMs to polish the grammar and word choice of a few individual sentences.
We would be deeply grateful if you could reconsider the contributions of our paper in light of this clarification. If you have any further questions or concerns, we would be more than happy to discuss them in detail.
Best regards,
Authors
Regarding Q2: Could the authors provide the revised related work section, as mentioned in the appendix, to offer a more comprehensive overview?
Regarding Q3: What detailed explanations have been added to the Introduction? Additionally, could you share the revised figure captions?
Regarding Q4: Could the authors specify which notations have been added?
Providing these modifications now would strengthen the submission and allow the reviewer to clearly assess the improvements. The reviewer is particularly interested in understanding the specific changes made.
Thank you!
Dear Reviewer waEi,
We sincerely thank you for the opportunity to further elaborate on these details, and we are happy to present a more thorough explanation below.
Q2: The extended related work section.
A: We provide a detailed overview of representative related works. Due to the rebuttal length constraint, we have omitted the reference citations here.
Motion generation has long been a hot research topic in the related community, aiming to generate human-like 3D motion based on a given prompt, such as text, action, or incomplete motion. Text-to-motion is among the most significant task settings and has attracted substantial research attention. For instance, T2M-GPT explores a generative framework utilizing VQ-VAE and Transformer for motion generation. MDM introduces a diffusion-based generative model trained across multiple motion tasks. MLD enhances the latent diffusion model to produce motions conditioned on various inputs. These motion generation specialists demonstrate strong performance following in-house training.
In recent years, the emergence of LLMs has demonstrated unprecedented levels of intelligence, driving the evolution from specialists to generalists. In the motion domain, motion language models have been proposed, where motion-aware encoders are connected to a central LLM, leading to motion generalists (also named motion LLMs) capable of perceiving various motions. Further, motion generators (e.g., diffusion or VQ-VAE models) are integrated to achieve unified motion LLMs for both comprehension and generation. To achieve robust motion manipulation capabilities, these motion LLMs require fine-tuning on large annotated datasets. However, compared to comprehension tasks, motion generation is more reliant on data, but motion annotation datasets are often quite limited due to the high cost of annotation.
A very recent study, Remodiffuse, introduces a motion generation method based on the retrieval-augmented generation (RAG) paradigm. It performs text-to-text retrieval from a labeled 3D motion database to fetch related motion signals and enhance generation quality. However, as previously mentioned, existing motion databases are typically limited in scale. In contrast, large-scale video databases are more accessible and diverse. Motivated by this, we explore a human motion-centric video retrieval framework to support 3D motion generation, where motion-consistent 2D features extracted from videos are effectively transferred to guide the 3D motion synthesis. Compared to Remodiffuse, our approach introduces two key innovations. First, we leverage cross-modal text-to-video retrieval to eliminate the reliance on motion databases that require manually written textual descriptions. Second, we are the first to identify the issue of error propagation in motion-RAG frameworks, and propose a novel algorithm, McDPO, to address it.
Notably, several motion LLM studies have also explored human-related video tasks. Inspired by MotionBank, we construct our video corpus from multiple action-centric datasets. Unlike previous works that focus on building high-quality video collections, this work instead centers on validating the potential and robustness of VimoRAG framework when retrieving from a wild video corpus, and on addressing the potential inconsistency between video input and the generated motion.
Q3: The detailed explanations and the revised figure captions.
A: The revised figure captions:
Overview of the VimoRAG pipeline: (1) text-to-video retrieval via Gemini-MVR, and (2) video-augmented motion generation guided by text and retrieved video. Gemini-MVR (Sec. 3.2) is designed to improve cross-modal human-centric video retrieval, while the McDPO training strategy (Sec. 3.3) mitigates error propagation caused by noisy retrievals.
We provide a brief explanation in the Introduction, with a more detailed explanation in Lines 39–73. A brief explanation:
As illustrated in Figure 2, VimoRAG first retrieves a relevant video from an unlabeled video database based on the input text. Both the text and retrieved video are then fed into a LLM to generate motion tokens, which are finally decoded into a motion sequence via VQ-VAE. To enhance this pipeline, we propose two key components: Gemini-MVR for effective cross-modal human video retrieval, and McDPO, a novel training strategy to mitigate error propagation in this process.
Q4: Could the authors specify which notations have been added?
A: Although we have explained the notations multiple times in this paper, we appreciate the reviewer’s recommendation to include a dedicated notation section. Accordingly, we will add these notations: , , , , , , , , , , .
We hope this response helps clarify your concerns. We would be glad to discuss any further questions!
Best regards,
Authors
Thank you for the responses.
After carefully re-reading the revised related work section and thoroughly examining the full paper again, I still feel this work lacks sufficient novelty.
While the proposed VimoRAG framework introduces some architectural refinements, such as the Gemini-MVR retriever and the McDPO training strategy, and uses large-scale in-the-wild video data, the core idea of retrieval-augmented motion generation is not new.
The central methodology closely follows the paradigm established by ReMoDiffuse, and the retrieval-augmented generation (RAG) concept has already been explored in that and several other works (e.g., MotionBank, MotionLLM, and the MotionGPT family).
The claimed novelty of scaling RAG to 2D in-the-wild video data is an incremental step, not a fundamental conceptual advance. The work essentially extends ReMoDiffuse from 3D motion retrieval to 2D video retrieval, with engineering solutions to known challenges (e.g., retrieval quality, modality gap, alignment), but without redefining the problem or offering a genuinely new approach.
Given the field's rapid progress, meaningful novelty should go beyond scaling existing methods or plugging in improved retrievers or trainers. Despite solid execution and competitive results, this paper falls short of that bar.
Dear reviewer waEi,
We truly appreciate your thoughtful comments and fully respect your viewpoint. Nonetheless, we feel there might have been some misunderstandings about the technical contributions of our work — perhaps due to insufficient explanation on our part. We hope to clarify these aspects from a different angle, and thank you again for your time and attention.
Q1: The reviewer argues that the paper only incrementally extends ReMoDiffuse from 3D to 2D retrieval, using engineering tweaks without offering a fundamentally new approach or redefining the problem.
A:Our contribution goes beyond simply introducing the video-based motion RAG paradigm — this serves merely as the starting point of our investigation. As we previously stated, our work includes both technical and empirical contributions. Here, we would like to restate these aspects in a more concrete and detailed manner:
On technical contributions:
The two key technical contributions in our work are not merely solutions to pre-identified or well-known challenges. Rather, they are built upon newly observed, under-explored issues that we identify and analyze in depth.
Technical Contribution 1 – Gemini-MVR for Human-Centric Video Retrieval:
One key challenge we identified is that existing Video Foundation Models (VFMs), even when pre-trained on large-scale video data, perform poorly on human-centric video retrieval, particularly in capturing fine-grained human poses and actions. To our knowledge, this issue is rarely discussed in prior works.
The Gemini-MVR proposed in this paper achieves innovation entirely through architectural design, with almost no fine-tuning of hyperparameters and using significantly less training data than the backbone VFM pretrained on massive datasets. The final performance even surpasses that of the fine-tuned VFM, which leads us to believe this is not merely an engineering solution. From another perspective, if we were to adopt engineering approaches (such as synthesizing large-scale domain-specific datasets), we believe the results could be even better.
Technical Contribution 2 – McDPO to Mitigate Error Propagation in Motion RAG:
Another challenge is error propagation from noisy or misaligned retrievals — an issue not examined in ReMoDiffuse or similar prior works. We tackle this with McDPO, a novel training strategy inspired by DPO, to guide the LLM in how and when to rely on retrievals.
To our knowledge, DPO has rarely been applied in RAG contexts. Our design of the reward function specifically addresses the nuanced decisions involved in whether to trust retrieved video content. This is far from an ``engineering tweak" and represents a new direction for mitigating modality gap and alignment issues in cross-modal generative tasks.
Empirical Contributions:
As you generously acknowledged, our experiments demonstrate “solid execution and competitive results.” We’d like to emphasize that our empirical contributions provide new insights beyond performance improvements:
- We showcase the effectiveness of video-based motion RAG in diverse scenarios, including both cross-domain generalization and in-domain performance.
- We answer critical scalability questions: How does performance change with the size of the retrieval corpus? This is crucial for understanding the long-term potential of video-based motion RAG, yet remains unexplored in other works.
- We empirically show that error propagation is a significant challenge in motion RAG, which has been overlooked in other works.
Q2: The central methodology closely follows the paradigm established by ReMoDiffuse, and the (RAG concept has already been explored in that and several other works (e.g., MotionBank, MotionLLM, and the MotionGPT family).
A: As noted earlier, while ReMoDiffuse introduced the RAG concept, our contribution goes far beyond a simple cross-modal extension. We conduct a deeper investigation within this paradigm, identify new challenges, and propose novel solutions. As for MotionBank, MotionLLM, and the MotionGPT family, to the best of our knowledge, none of these works explicitly explore RAG.
We sincerely apologize if the previous rebuttal failed to clearly convey our technical contributions due to space limitations. We truly hope the reviewer can reconsider the full scope of our work — beyond the paradigm itself — and recognize the value of the new problems we uncover and the methods we propose.
Best regards,
Authors
This paper proposes a novel video-based 3D motion RAG framework, named VimoRAG, which comprises Gemini-MVR retriever and McDPO trainer. By constructing a large-scale dataset called HcVD, the authors first trained Gemini-MVR retriever and then combine it with the McDPO training strategy to optimize the overall VimoRAG framework. The dataset preprocessing and organization are handled with notable care, leading to high-quality text–video pairs that strengthen the overall pipeline. The McDPO, as key component of the training strategy in VimoRAG, successfully bridged the gap between 2D visual representations and 3D motion tokens. This paper articulates the framework and its underlying rationale with exemplary clarity, and the extensive experiments results show that VimoRAG surpasses most baselines, thereby validating the effectiveness of the proposed approach. However, despite the well-structured design of the HcVD dataset, the selection and filter mechanism is not fully detailed. Additionally, several methodological details remain ambiguous. Therefore, I recommend an accept score for this paper.
优缺点分析
Pros:
- This paper presents a novel motion generation framework based on video RAG, which leverages a large-scale dataset named HcVD to address the long-standing issue of data scarcity in the motion generation domain.
- The Gemini-MVR retriever, which is specifically designed for human-centric video retrieval, is trained using the proposed HcVD dataset to enhance the effectiveness of the RAG pipeline.
- A novel training strategy, McDPO, is introduced to effectively bridge the gap between 2D visual representations and 3D motion tokens.
Cons:
- Although the HcVD is well organized, its selection and filter mechanism involved in its construction process is not fully illustrated.
- Several methodological details remain ambiguous.
问题
- As illustrated in the paper, HcVD is gathered by several existing video datasets. However, It remains unclear that whether HcVD was constructed by directly including all videos from the source datasets, or by selecting a subset of videos based on certain criteria. If the latter is the case, I would appreciate a clarification on the selection standards applied during this process.
- As far as I understand, object-level representations are typically introduced in human–object interaction (HOI) tasks, where modeling object context is essential. However, in VimoRAG, which can be regarded as motion generation task, an object encoder is utilized within the Gemini-MVR retriever. This raises questions about the actual contribution of object encoder to the overall framework. Would you please conduct an additional ablation study accompanied along with an explanatory discussion to better clarify its impact on the overall generation performance?
- The paper mentions that the captions in HcVD are not involved in the RAG pipeline. Does it mean the captions are excluded from the McDPO training stage, or that they are not used during model inference and evaluation?
- In the experimental section, the HumanML3D dataset is used for evaluation. However, HumanML3D only contains motion and text data, without any video data. Given that the VimoRAG relies on retrieving videos based on input text, how is the model evaluated on HumanML3D dataset? Does this imply that, VimoRAG can operate solely with text input and bypass the video retrieval stage during inference?
局限性
Yes
最终评判理由
The author has addressed my questions. This is a good work, and I will keep my original score. However, I must emphasize that although there are now many motion-related datasets available, a large portion of them rely on annotations generated by models such as GPT-4o, followed by insufficient human review. This has resulted in poor data quality in many datasets, which in turn harms the interests of researchers in the field (especially since reviewers often require comparisons against these flawed datasets). I hope you can carefully screen the data before releasing any public dataset.
格式问题
No
Dear Reviewer SSnd,
We sincerely appreciate your recognition of the technical aspects of our work. We are more than happy to address each of the questions you raised in detail.
Q1: HcVD's selection and filter mechanism involved in its construction process is not fully illustrated. Whether HcVD was constructed by directly including all videos from the source datasets, or by selecting a subset of videos?
A: Thank you for pointing this out. We do apply a filtering step when constructing HcVD. Specifically, we remove videos in which no human is detected in the sampled frames. This is briefly mentioned in lines 130–131 of the paper, and we will provide a more detailed explanation in the revised version.
Importantly, our design philosophy is to avoid heavy preprocessing of wild video data. Instead, we aim to handle the inherent noise via a stronger retriever (Gemini-MVR) and a noise-resilient training algorithm (McDPO), rather than by manual curation.
Q2: Several methodological details remain ambiguous.
A: We apologize for any confusion. We will revise the paper to include additional methodological details wherever possible to ensure better clarity.
Q3: The impact of object-encoder on the overall generation performance.
A: The generation performance using only the object-level retriever has already been shown in Table 3 (Setting: Int+Mc). To further clarify the impact of object-level and action-level features on generation quality, we provide the generation results using the action-level retriever below:
| Retriever | FID ↓ | TOP1 ↑ | TOP2 | TOP3 |
|---|---|---|---|---|
| Gemini-MVR | 0.148 | 0.429 | 0.625 | 0.756 |
| InternVideo (object-level) | 0.205 | 0.433 | 0.638 | 0.736 |
| action-level | 0.342 | 0.426 | 0.630 | 0.731 |
As shown, using only action-level features yields worse performance than using only object-level features. This is primarily because action-level retrieval is less accurate. We provide retrieval performance below:
| R@1 ↑ | R@5 ↑ | R@10 ↑ | |
|---|---|---|---|
| action-level | 24.0 | 48.8 | 61.4 |
| object-level | 52.3 | 84.0 | 91.5 |
The object-level retriever is not intended to directly guide generation, but to compensate for the current limitations of action-level retrieval. If the latter were sufficiently strong, relying on object-level features might no longer be necessary.
Nonetheless, object interactions remain essential in some specific motion generation scenarios, especially in the IDEA-400 test set (see examples in Figures 12 and 13).
Q4: Does it mean the captions are excluded from the McDPO training stage, or that they are not used during model inference and evaluation?
A: Captions in HcVD are only used to train and evaluate the text-to-video retriever. They are not used during McDPO training, model inference, or evaluation. During McDPO training, videos are retrieved from HcVD based on input motion descriptions (from motion datasets), not the video captions. The retrieval stage in VimoRAG uses text-to-video retrieval, not text-to-text retrieval, so HcVD captions are not involved. In other words, if an off-the-shelf text-to-human video retrieval model were available, we wouldn’t need to synthesize captions for HcVD at all.
Q5: Given that the VimoRAG relies on retrieving videos based on input text, how is the model evaluated on HumanML3D dataset? Does this imply that, VimoRAG can operate solely with text input and bypass the video retrieval stage during inference?
A: You're correct that HumanML3D does not contain videos. For both training and evaluation on HumanML3D, we retrieve videos from HcVD using the motion descriptions from HumanML3D, and then use the retrieved videos to condition the motion generation model. Thus, the model still operates in a video-retrieval-augmented manner. Like all existing methods, we only validate the generated motions on the HumanML3D dataset, even though our model retrieves videos during generation.
It is noted that VimoRAG can be slightly modified to support text-only input. We conduct the following additional experiment to evaluate this setting:
| Method | FID ↓ | TOP1 ↑ | TOP2 | TOP3 |
|---|---|---|---|---|
| VimoRAG (text-only) | 0.283 | 0.420 | 0.610 | 0.706 |
| VimoRAG | 0.131 | 0.452 | 0.655 | 0.764 |
While VimoRAG can operate with text-only input, using retrieved videos yields significantly better performance.
Please let us know if you have any further questions—we would be happy to discuss them!
Best regards,
Authors
-
In-the-wild videos capture diverse and unconstrained human motions. VimoRAG uses large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motions.
-
VimoRAG consists of two components: Gemini-MVR retriever and McDPO trainer. Gemini-MVR is an effective motion-centered video retrieval model , and (2) McDPO trainer mitigates the issue of error propagation caused by suboptimal retrieval results.
-
Gemini-MVR includes two independent retrievers—an object-level retriever and an action-level retriever—to improve retrieval accuracy.
优缺点分析
Strengths
- VimoRAG proposes to use video-based 3D motion RAG, which significantly alleviates the motion data scarcity bottleneck in existing methods.
- The research motivation of the paper is reasonable, and the method is relatively easy to understand.
Weaknesses 1.The paper mentions that 2D visual priors may not align well with the 3D motion space, and i suggests imposing constraints on the Visual Projector to align the 2D visual and 3D motion spaces, to further improve performance.
2.The experimental section lacks a direct comparison of retrieving 3D motion and retrieving videos via the current architecture, so there is no evidence to prove that retrieving videos has a greater advantage than retrieving 3D motion.
问题
See Weaknesses
局限性
Yes, the authors discussed the limitations about latency. But I think more other issues should be involved.
最终评判理由
All my concerns have been addressed.
格式问题
There are no formatting issues.
Dear Reviewer 5LZb,
First of all, we sincerely appreciate your recognition of the motivation and other key aspects of our work. We are very glad to address your concerns.
Q1: The reviewer suggests imposing constraints on the Visual Projector to align the 2D visual and 3D motion spaces, to further improve performance.
A: Thank you for the insightful suggestion. We did consider imposing explicit constraints to directly align the 2D visual and 3D motion spaces. However, in our setting, this is technically difficult because the 3D motion is represented as discrete tokens (via VQ-VAE), which are not differentiable. As a result, there is no direct gradient path from the 3D motion tokens back to the 2D visual features, making end-to-end alignment infeasible.
Instead, we map both the 2D visual inputs and the 3D motion tokens into a shared language space and align them implicitly through next-token prediction. This allows us to bypass the non-differentiability issue while still enabling effective cross-modal fusion.
Even so, we find the reviewer’s suggestion very promising. In future work, we plan to explore the performance of directly projecting the hidden states of the LLM into the 3D motion space for finer alignment.
Q2: The experimental section lacks a direct comparison of retrieving 3D motion and retrieving videos via the current architecture.
A: We appreciate this point. We do not include results for retrieving 3D motions using our architecture because it is not designed or pre-trained for this task, and a direct comparison might be unfair. Specifically:
- Our visual projector and LLM are jointly pre-trained on a large-scale 2D visual dataset (CC-595K). To our knowledge, there is no 3D motion-language projector with comparable pretraining scale.
- The retrieval pipelines for 2D videos and 3D motions differ substantially, and the retrieval quality for text-to-3D motion is still limited compared to 2D video, which would impact the generation results.
In fact, we compare our method against Remodiffuse, a strong retrieval-augmented 3D motion generation baseline. As shown in Table 2 (in-domain setting) in this paper, Remodiffuse achieves better results than our method. This is likely because it retrieves motions directly from the HumanML3D training set, which shares the same distribution as the test set. In contrast, our retrieval pool comes from a different domain, making the setting more challenging. However, in out-of-domain scenarios (Table 1 in this paper), our method outperforms Remodiffuse by a clear margin, demonstrating better generalization.
To further address the reviewer’s concern, we modify our pipeline to support 3D motion retrieval directly. We retrieve motion sequences by performing CLIP-based text-to-text retrieval on their associated textual labels, then encode the retrieved motions into motion tokens using VQ-VAE, and concatenate them with the text prompt. The results are shown below:
HumanML3D
| Retrieval Source | FID ↓ | Top-1 ↑ | Top-2 ↑ | Top-3 ↑ |
|---|---|---|---|---|
| Motions | 0.427 | 0.471 | 0.651 | 0.748 |
| Videos | 0.131 | 0.452 | 0.655 | 0.764 |
IDEA400 (zero-shot)
| Retrieval Source | FID ↓ | Top-1 ↑ | Top-2 ↑ | Top-3 ↑ |
|---|---|---|---|---|
| Motions | 4.640 | 0.108 | 0.168 | 0.242 |
| Videos | 2.388 | 0.113 | 0.193 | 0.270 |
Even in the in-domain setting, 3D-motion-based retrieval underperforms 2D-video-based retrieval in this architecture. This may be attributed to:
- Poor retrieval quality from the CLIP text encoder for motion descriptions.
- The difficulty in aligning motion tokens with the LLM input space, as LLMs are not trained on such token distributions, making both input and output alignment challenging.
We hope this clarifies our design choice and provides a fair comparison.
We sincerely thank the reviewer once again for the valuable suggestions. We are more than happy to further discuss if there are any additional questions!
Best regards,
Authors
This paper presents VimoRAG, a novel video-based retrieval-augmented generation framework designed to address the motion data scarcity problem in existing approaches. The proposed method introduces two key plug-and-play components: (1) GeminiMVR Retrieval: A human-centric video retrieval module that overcomes limitations of previous text-based retrieval systems. (2) McDPO Trainer: A robust training framework that mitigates error propagation in cross-modal motion generation pipelines. Experimental results show that VimoRAG delivers: (1) Significant performance gains in out-of-distribution (OOD) scenarios. (2) Consistent improvements for vanilla motion LLMs in in-domain settings. By effectively bridging the video-motion modality gap, VimoRAG represents an important advance in data-efficient motion generation.
优缺点分析
Strengths
-
Clear Presentation: The paper is well-written, logically structured, and easy to follow. The authors effectively address all questions raised in the introduction throughout subsequent sections.
-
Novel Contribution: The application of Retrieval-Augmented Generation (RAG) to motion generation presents an interesting approach to addressing data scarcity. While inherent modality gaps between 2D videos and 3D motion exist, the proposed solutions demonstrate meaningful progress.
-
Reproducibility: The availability of code and partial data significantly enhances the paper's reproducibility and practical utility.
Weaknesses & Questions
- Fundamental Mechanism of 2D-to-3D Transfer
While the results demonstrate empirical improvements, the theoretical foundation for how 2D videos (lacking 3D information) enhance motion generation remains unclear. Given the limited category overlap between motion datasets (HumanML3D) and video datasets (Kinetics-400, NTU-RGBD-120, HMDB).
Could the authors provide a more detailed explanation of the underlying transfer mechanism? Or analysis of which specific video features contribute most to motion generation? Or potential limitations when category mismatches are more severe?
- Scalability Considerations
The current 400K video dataset, while substantial, shows limitations in category diversity (only 120-400 classes) and relevance to human motion tasks. Have the authors evaluated the noise-performance tradeoff when scaling to larger but noisier datasets? What strategies could maintain quality control with dataset expansion? Or is there a theoretical or empirical threshold for beneficial scale versus noise?
问题
Please refer to the weakness
局限性
The limitation of this paper is discussed by the authors.
The authors provide their solutions.
Please refer to the weakness
最终评判理由
The authors addressed my concerns during the rebuttal process. The paper presents a well-motivated study, and its clear, logical organization enhances its readability. I believe the discussion will provide valuable insights to the motion generation research community.
格式问题
No
Dear Reviewer BcNS,
We sincerely thank your comments for recognizing the manuscript’s writing and contributions, and we are happy to address the points raised.
Q1: Clarification on the theoretical foundation of 2D-to-3D transfer, particularly given limited category overlap between video and motion datasets.
A: We are happy to discuss this insightful question. During our early-stage investigation, we also recognized the potential concern about category mismatch. However, our analysis revealed that the HumanML3D dataset primarily consists of compositions of atomic motion units [1] (e.g., walking, running, lifting hands), rather than complex or domain-specific activities.
Our video corpus, though based in part on datasets with labeled categories (e.g., Kinetics-400), also contains a large number of web-sourced videos (e.g., from MotionX) with rich and diverse motion patterns [2]. While these videos are not easily classifiable into discrete categories, they encompass a wide range of atomic motion units that appear frequently across different actions.
We consider the 2D-to-3D mapping to be achieved via shared atomic action units present in both video and motion data. To verify this intuition, we conduct interpretability studies (see Appendix, Fig. 12 and Fig. 13). In the second example of Figure 12, for example, although the retrieved video does not exactly match the target motion (it misses the “swing arms” component), it includes “alternating raising knees”, which is highly relevant and beneficial for motion generation. We observe similar meaningful atomic-level alignments in many other examples.
[1] Examples in HumanML3D
1. a person runs and jumps forward.
2. a person walks in place.
3. a person jogs in place with their arms swinging.
4. a person walks forward slowly.
5. a person lifts both hands toward face and then lowers them to their sides.
[2] Examples in MotionX
1. The person is simulating chopping wood while seated. They rotate their torso, simulate grabbing and lifting an object with their hands, bring it overhead, and then perform a striking motion downward as if impacting a log between their legs. This action is repeated, emulating the motion of splitting wood with an axe.
2. The person is performing a stationary basketball shooting motion. Starting from a standing position, they bend their knees to generate power, raise the ball with both hands in front of them, extend their arms upwards while jumping slightly, and then follow through with one hand to release the ball, mimicking a basketball shot.
Q2: On the noise-performance tradeoff when scaling to larger video datasets.
A: We conduct an experiment on the HumanML3D validation set to evaluate the tradeoff between data scale and quality. we add 4K noisy videos where human poses are often missing. The results are summarized as follows:
| Dataset | FID ↓ | Top-1 ↑ | Top-2 ↑ | Top-3 ↑ |
|---|---|---|---|---|
| Base | 0.148 | 0.429 | 0.625 | 0.756 |
| +Noise | 0.172 | 0.430 | 0.635 | 0.740 |
We observe a slight performance drop with noisy data, but the degradation remains limited. In our pipeline, we apply basic quality control by using a keypoint detector to remove videos without human presence (as mentioned in Line 130-131), which filters out about 1% of the data.
While more sophisticated filtering is possible, our focus lies in building a robust retrieval module and a noise-resistant training framework that can tolerate imperfect data. This design ensures scalability without heavy reliance on manual dataset curation.
We'd be happy to discuss further with you if you have any new questions.
Best regards,
Authors
Dear authors,
Thank you for your detailed responses to the comments --- my concerns have been fully addressed. The paper is well-organized, clearly written, and supported by extensive experimental validation. While the related work section could be expanded slightly, this does not significantly detract from the overall quality of the work. Given these strengths, I have increased my score from 4 to 5.
Once again, we sincerely thank you for your valuable feedback and for the time you have invested in improving our manuscript!
This paper proposes a framework that leverages retrieved videos to augment 3D motion generation. A video retriever is trained on a large-scale dataset constructed by the authors, followed by an overall optimization to bridge 2D video and 3D motion while mitigating error propagation. Extensive empirical evidence demonstrates the effectiveness of the proposed method over baselines. The effort of building the data and framework with detailed analysis provides valuable insights to the community.