Controllable Human-centric Keyframe Interpolation with Generative Prior
摘要
评审与讨论
Main goal of the paper is to propose a system for performing frame interpolation focused on human videos. This task is particularly challenging as human motion tends to be complex and highly articulated – which leads to a lot of ambiguity making it hard to model. This proposed solution is to reduce this ambiguity with per-frame 3D and 2D human poses (or control signals). The work introduced a framework to condition a pre-train video diffusion model on such poses and also introduces a dataset focused on human centric key frame interpolation withaccompanying annotations for 2D and 3D human poses (SMPL-X parameters).
优缺点分析
Strengths
-
The core-contribution of the paper, PoseFuse3D seems novel and technically sound for adding 3D and 2D control signals to for interpolation. Overall, the idea of resolving motion ambiguity with per-frame 3D poses seems to be valuable and can be used for future work
-
The CHKI dataset appears to be a high-quality contribution. Descriptions of the data collection, filtering and annotation pipeline seem to indicate care and rigor in the dataset preparation, making it likely to be useful to the community.
-
Paper is well written and the experiments are conducted thoughtfully (except one detail as highlighted in the weakness section). I appreciate the care taken to standardize the backbone (i.e use SVD backbone when comparing against FCVG), making it easier to identify the improvements due this framework.
Weaknesses
-
It is not clear if the groundtruth intermediate frames are used to extract line features for FCVG when evaluating against FCVG in Table 1 and 2. If not, this is a weakness in the evaluation as motion ambiguity should be removed for FCVG as well for a fair comparison to this method.
-
In line with the previous point, I think the approach suggested in section 1.4 of the supplementary material is a critical evaluation criterion for this method. Current method relies on extracting the control signals from the groundtruth intermediate frames. However keyframe interpolation as a problem relies on just the start and end frames and doesn’t not have access to the intermediate frames. So, to support the claim of a keyframe interpolation method, this work should show performance on a typical keyframe interpolation task.
问题
-
Can you report numbers on an approach similar to what you describe in section 1.4 of the supplementary material? This will help address point 2 of the weaknesses.
-
Can you clarify if lines extracted from groundtruth intermediate frames are used for evaluating FCVG (instead of the interpolations from the start and end frames). If not, can FCVG be evaluated in such a set-up? This will help address point 1 of the weakness.
局限性
yes.
最终评判理由
The paper solves an important sub-problem in keyframe interpolation, performs rigorous experiments and has novel contributions to make that can be utilized by other future work as well. Through their rebuttal, authors sufficiently address the weaknesses I could initially identify, hence i raise my rating to an Accept.
格式问题
None.
We really appreciate your constructive comments. Please find the following for our response.
Q1: [Weakness 1] It is not clear if the groundtruth intermediate frames are used to extract line features for FCVG when evaluating against FCVG in Table 1 and 2. If not, this is a weakness in the evaluation as motion ambiguity should be removed for FCVG as well for a fair comparison to this method.
A1: As described in the `Evaluation Protocol' paragraph (Sec. 5) of our main paper, we would like to clarify that intermediate ground-truth control signals are leveraged for evaluating all the controllable interpolation methods, including FCVG. We will revise the paper to clarify this point more explicitly in the updated version.
Q2: [Weakness 2] In line with the previous point, I think the approach suggested in section 1.4 of the supplementary material is a critical evaluation criterion for this method. Current method relies on extracting the control signals from the groundtruth intermediate frames. However keyframe interpolation as a problem relies on just the start and end frames and doesn’t not have access to the intermediate frames. So, to support the claim of a keyframe interpolation method, this work should show performance on a typical keyframe interpolation task.
A2: We fully agree that it is important to evaluate interpolation methods in scenarios where no ground-truth intermediate control signals are available. We define this setting as `in-the-wild interpolation' in our supplementary material. In response to your suggestion, we conducted additional evaluations on the CHKI-Video test set for both FCVG and our proposed method, using only the start and end frames as inputs, without any intermediate information. The results are summarized in the table below:
| Methods | PSNR | PSNR | PSNR | LPIPS | LPIPS | LPIPS | HA |
|---|---|---|---|---|---|---|---|
| FCVG | 18.47 | 10.54 | 9.31 | 0.2607 | 0.1321 | 0.0921 | 0.9284 |
| PoseFuse3D-KI | 19.30 | 10.98 | 9.69 | 0.2031 | 0.1194 | 0.0859 | 0.9289 |
We observe that under this setting, PoseFuse3D-KI consistently outperforms FCVG across all metrics, demonstrating the robustness and effectiveness of our method for in-the-wild interpolation. We will add this table in the updated version of our supplementary material.
Q3: [Question 1] Can you report numbers on an approach similar to what you describe in section 1.4 of the supplementary material? This will help address point 2 of the weaknesses.
A3: Thanks for your suggestion, we have conducted quantitative evaluations for the in-the-wild interpolation setting (described in section 1.4 of the supplementary material). The resulting table is presented in A2.
Q4: [Question 2] Can you clarify if lines extracted from groundtruth intermediate frames are used for evaluating FCVG (instead of the interpolations from the start and end frames). If not, can FCVG be evaluated in such a set-up? This will help address point 1 of the weakness.
A4: All evaluations of FCVG presented in both the main paper and the supplementary material were conducted with intermediate ground-truth control signals. We will revise the paper to clarify this point more explicitly in the updated version.
Thank you for clarifying that FCVG is also evaluated with groundtruth control signals. Also good to see that proposed methods does better than FCVG on the in-the-wild interpolation case as well. I don't have any other follow-up questions.
We would like to thank the reviewer for taking the time to evaluate both our submission and rebuttal. We are pleased to hear that your concerns have been resolved.
The paper introduces PoseFuse3D-KI, a controllable human-centric keyframe interpolation framework that integrates 3D human guidance signals into the diffusion process for human-centric keyframe interpolation (CHKI). This novel framework aims to improve interpolation fidelity, especially for articulated human motion, by leveraging 3D SMPL-X models alongside 2D poses for precise spatial and structural control. The paper proposes a new dataset, CHKI-Video, annotated with both 2D poses and 3D SMPL-X parameters, and demonstrates the effectiveness of the model through extensive experiments, achieving state-of-the-art performance on keyframe interpolation tasks.
优缺点分析
Strengths:
- Novel Contribution: The integration of 3D human guidance with 2D poses for keyframe interpolation is a significant improvement over existing methods, addressing the challenges of complex human motion and occlusions. In addition, a new dataset is presented in this paper.
- Comprehensive Evaluation: The introduction of a new dataset, CHKI-Video, specifically designed for evaluating controllable human-centric keyframe interpolation, strengthens the paper's contributions and allows for fair comparison with existing methods.
- State-of-the-art Performance: The proposed PoseFuse3D-KI framework outperforms existing techniques, showing improvements in PSNR and reductions in LPIPS, indicating its effectiveness in generating high-quality human-centric interpolations.
Weaknesses:
- Limited Discussion on Scalability: While the paper presents impressive results, there is limited discussion about how well the method scales with different video lengths or more complex human actions, particularly in large-scale applications.
- Reproducibility Concerns: The code and data are not currently available, and while the authors plan to release them, this limits the reproducibility of the results.
- Computational Efficiency: Although the paper provides implementation details, there is no clear discussion about the computational efficiency of the framework or how it scales with increasing complexity or dataset size.
- Potential Overfitting: While the method shows strong performance on the CHKI-Video dataset, it would be useful to see how the model generalizes to other datasets or human motion tasks to avoid potential overfitting.
- Limited Practical Expandability: The reliance on 3D motion information (e.g., SMPL-X parameters) may hinder the method’s applicability in real-world scenarios where such annotations are difficult or expensive to obtain. The lack of fallback strategies for 2D-only input scenarios limits the applicability of this framework.
问题
- How does the framework perform on a wider range of human actions beyond sports and everyday activities covered in CHKI-Video? Could the method be generalized to other domains, such as robotics or virtual character animation?
- What is the maximum temporal gap (in terms of frames or seconds) that the proposed method can effectively interpolate?
- Given that the model combines 3D and 2D signals, how does the performance degrade in scenarios with extreme occlusion or very fast motion? In other words, how does the model address the change in background after a large change in character position?
- Regarding the 3D auxiliary information aspect, does the proposed method require the SMPLX parameters for all frames, or only for the first and last frames? If the former is the case, can it be argued that the method actually draws on some information from ground truth rather than being truly effective?
- In some cases where human SMPLX parameters are not available, does the method proposed in this paper still maintain a high level? In other words, did the authors consider scenarios where 3D information is not available?
- Has the performance been evaluated using SSIM or other structural similarity metrics? Including such results could offer a more complete view of visual quality.
- How much GPU resources are required in the training phase of the proposed method? What about the inference time?
局限性
- Dependency on Specific Datasets: The results presented are based on the CHKI-Video dataset, which may limit the generalizability of the findings. The method's performance on other datasets or real-world video sources remains untested.
- Dependency on SMPLX Parameters: The method proposed by the authors, although capable of generating higher quality videos, is overly reliant on 3D information and lacks extensibility.
- Future Work on Data and Code Availability: Although the authors plan to release the code and data after acceptance, the current lack of access limits the paper's ability to be fully reproducible and verifiable by the research community.
最终评判理由
The author provides an answer to the question of how to obtain SMPL-X parameter information for intermediate frames, resolving my previous main concerns. The interpolation algorithm is a viable solution. Considering the content of the entire paper, I believe that this article has no major shortcomings overall, its contributions are clear, and it is worthy of consideration for acceptance.
格式问题
no
Thank you for the constructive comments. Please find the following for our response.
Q1: [Weakness 1] Limited Discussion on Scalability: … how well the method scales with different video lengths or more complex human actions, particularly in large-scale applications. [Question 2] What is the maximum temporal gap (in terms of frames or seconds) that the proposed method can effectively interpolate?
A1: Thank you for your suggestion to evaluate the scalability of our method across different video lengths and temporal gaps. We have conducted additional evaluations of PoseFuse3D-KI on varying temporal gaps, and the results are summarized in the table below:
| 24 frames | 48 frames | 72 frames | 96 frames | |
|---|---|---|---|---|
| PSNR | 23.86 | 21.46 | 20.37 | 19.44 |
| PSNR_{mask}$$\uparrow | 15.93 | 13.80 | 12.68 | 11.89 |
| LPIPS | 0.1144 | 0.1566 | 0.1882 | 0.2208 |
| LPIPS_{mask}$$\downarrow | 0.0332 | 0.0432 | 0.0514 | 0.0577 |
As with most interpolation methods, performance decreases as the temporal gap increases. Nevertheless, PoseFuse3D-KI maintains competitive performance across a wide range of gaps, especially for 24 and 48 frames. At 96 frames, while performance degradation is observed (e.g., PSNR drops below 20 dB and LPIPS exceeds 0.2000), the method still provides reasonable results. These results demonstrate the robustness of PoseFuse3D-KI in long-range controllable interpolation settings.
Q2: [Weakness 2] Reproducibility Concerns: The code and data are not currently available, … limits the reproducibility of the results. [Limitation 3] Future Work on Data and Code Availability: …, the current lack of access limits the paper's ability to be fully reproducible and verifiable by the research community.
A2: We fully agree that open-sourcing code and data is important for the research community. We will try to make our database and codebase publicly available as soon as possible.
Q3: [Weakness 3] Computational Efficiency: … discussion about the computational efficiency of the framework or how it scales with increasing complexity or dataset size.
A3: Thank you for your valuable feedback regarding computational efficiency. We have conducted a runtime comparison between our method and existing baselines. Specifically, we measured the time required to interpolate 25 frames at a resolution of on an NVIDIA A800 GPU. The results are shown below:
| Methods | Runtime (s) |
|---|---|
| GI | 975 |
| FCVG | 523 |
| PoseFuse3D-KI (Ours) | 212 |
Our method achieves the fastest runtime among the evaluated approaches, indicating comparatively superior computational efficiency.
Q4: [Weakness 4] Potential Overfitting: …, it would be useful to see how the model generalizes to other datasets or human motion tasks to avoid potential overfitting. [Limitation 1] Dependency on Specific Datasets: …The method's performance on other datasets or real-world video sources remains untested.
A4: We appreciate the reviewer for this valuable suggestion. To further evaluate generalizability, we conducted both qualitative and quantitative experiments on the FCVG test set, as described in Section 1.5 of the supplementary material. The results, shown in Table 1 and Figure 2, demonstrate that PoseFuse3D-KI consistently outperforms all baseline methods in terms of quantitative metrics and produces robust human-centric interpolation in challenging cases, such as the `Breaking Dance' case in Figure 2. These results support the generalization capability of our method beyond the CHKI-Video dataset.
Q5: [Weakness 5] Limited Practical Expandability: The reliance on 3D motion information (e.g., SMPL-X parameters) may hinder the method’s applicability in real-world scenarios … The lack of fallback strategies for 2D-only input scenarios limits the applicability of this framework. [Question 5] In some cases where human SMPLX parameters are not available … In other words, did the authors consider scenarios where 3D information is not available? [Limitation 2] Dependency on SMPLX Parameters: The method proposed by the authors, although capable of generating higher quality videos, is overly reliant on 3D information and lacks extensibility.
A5: Thank you for the insightful suggestion. We address scenarios where only start and end frames are available, without ground-truth 3D annotations, as in-the-wild interpolation, described in Section 1.4 of the supplementary material. In this setting, we estimate SMPL-X parameters from the input frames and apply linear joint interpolation to generate intermediate SMPL-X controls. Figure 1 in the supplementary material illustrates the qualitative results, showing the effectiveness of this strategy. Additionally, we conducted quantitative comparisons with FCVG under this setting. The results are shown below:
| Methods | PSNR | PSNR | PSNR | LPIPS | LPIPS | LPIPS | HA |
|---|---|---|---|---|---|---|---|
| FCVG | 18.47 | 10.54 | 9.31 | 0.2607 | 0.1321 | 0.0921 | 0.9284 |
| PoseFuse3D-KI | 19.30 | 10.98 | 9.69 | 0.2031 | 0.1194 | 0.0859 | 0.9289 |
Our method consistently outperforms FCVG across all metrics, demonstrating its effectiveness for the in-the-wild interpolation setting. Furthermore, our strategy has the potential to incorporate advanced techniques such as text-to-motion models for generating intermediate human model controls, enabling more flexible and effective interpolation for complex motion.
Q6: [Question 4] Regarding the 3D auxiliary information aspect, does the proposed method require the SMPLX parameters for all frames, or only for the first and last frames? If the former is the case, can it be argued that the method actually draws on some information from ground truth rather than being truly effective?
A6: Our PoseFuse3D-KI framework requires intermediate SMPL-X controls to guide the interpolation process. However, these controls do not need to be ground-truth. As described in Section 1.4 of the supplementary material and elaborated in A5, the intermediate SMPL-X parameters can be obtained via simple interpolation or possibly generated using text-to-motion models. We emphasize that the method remains effective even in the absence of ground-truth 3D annotations. This is further supported by the quantitative results we present in A5 under the in-the-wild interpolation setting. This demonstrates the strong performance of our method even when only the first and last frames are available.
Q7: [Question 1] How does the framework perform on a wider range of human actions … Could the method be generalized to other domains, such as robotics or virtual character animation?
A7: While PoseFuse3D-KI is designed to perform plausible interpolation for real-world human motions with complex dynamics, it can also be applied to other video domains such as robotics and virtual character animation. In our tests, it has demonstrated promising results in these settings. However, due to limitations of the discussion format, we are currently unable to include the qualitative results here. These examples will be added to the supplementary materials in the revised version of the paper.
Q8: [Question 3] … how does the model address the change in background after a large change in character position?
A8: We would like to clarify that PoseFuse3D-KI is designed to address controllable human-centric interpolation, with a focus on plausibly interpolating foreground humans with control signals. For the background, our framework relies on the underlying adapted video diffusion model to synthesize the in-between frames, without explicit control signals. While this approach works well in many cases, we acknowledge that it may lead to artifacts under challenging conditions such as large background changes. Addressing this limitation is an interesting direction for future research and could extend the applicability of our method to more general video interpolation tasks. We will include this limitation in the revised version of our paper.
Q9: [Question 6] Has the performance been evaluated using SSIM or other structural similarity metrics? Including such results could offer a more complete view of visual quality.
A9: Thank you for the valuable suggestion. We have computed SSIM scores as additional evaluation metrics to complement Table 2 in the main paper. The results are presented below:
| Methods | SSIM | SSIM_{bbox}$$\uparrow | SSIM_{mask}$$\uparrow |
|---|---|---|---|
| GIMM-VFI | 0.7776 | 0.6070 | 0.4985 |
| GI | 0.5875 | 0.4378 | 0.3483 |
| Wan2.1-KI | 0.7460 | 0.5852 | 0.4853 |
| FCVG | 0.7429 | 0.6735 | 0.6210 |
| PoseFuse3D-KI | 0.8049 | 0.7308 | 0.6802 |
We observe that PoseFuse3D-KI achieves the highest SSIM scores across all image regions, further demonstrating its strong controllable interpolation performance and visual quality. We will include this table in the revised version of our paper for a more complete view of the interpolation quality.
Q10: [Question 7] How much GPU resources are required in the training phase of the proposed method? What about the inference time?
A10: PoseFuse3D-KI is trained using 4 NVIDIA A100 GPUs, each with 80 GB of memory. During inference, it takes 212 seconds to interpolate 25 frames at a resolution of on an NVIDIA A800 GPU. A more detailed discussion on runtime performance is provided in A3.
Dear reviewer Ayns,
Can you comment on the rebuttal and engage in discussions before finalizing the score? Perhaps your deleted comment had replies to the rebuttal?
AC
The authors provide an answer to the queation of how to obtain SMPL-X parameter information for intermediate frames, resolving my previous main concerns. In this way, the interpolation algorithm seems to be a viable solution. Considering the content of the paper and the rebuttal, I think this paper has no major shortcomings overall, and therefore it is worthy of consideration for acceptance.
We would like to thank the reviewer for taking the time to evaluate both our submission and rebuttal. We are glad to hear that your concerns have been resolved and appreciate your recommendation.
This paper introduces PoseFuse3D-KI for controllable human-centric keyframe interpolation. The core of this framework is the PoseFuse3D control model, which integrates 3D human guidance signals (from SMPL-X models) and 2D pose information into a video diffusion model to generate intermediate frames. The key innovation is an SMPL-X encoder that processes 3D model data directly, rather than relying solely on 2D renderings, to better handle complex motions and occlusions. To evaluate their method, the authors built a new dataset, CHKI-Video, containing over 180,000 frames from 2,614 video clips, annotated with 2D poses and 3D SMPL-X parameters. Experiments show that PoseFuse3D-KI outperforms existing methods on this new benchmark.
优缺点分析
Strengths
-
The proposed SMPL-X encoder, which directly processes 3D model information instead of just its 2D renderings, is a novel approach to preserving geometric details, especially for occluded parts.
-
The paper contributes the CHKI-Video dataset, a valuable resource for the specific task of controllable human-centric interpolation, as it includes both 2D pose and 3D SMPL-X annotations which are missing in many existing interpolation datasets.
Weaknesses
-
My major concern about this paper is the novelty claimed in "our encoder processes features directly in 3D and integrates projected features through feature aggregation". Although the claimed novelty mainly focus on the effectiveness of directly utilizing 3D SMPL information, the experiments in Table 1 fails to support this claim: the performance gain of VE+SE over VE/VE+DN is marginal. The overall framework can be viewed as an combination of two established research lines: diffusion-based video interpolation and pose-guided video generation, e.g., Champ. Although the authors want to emphasize the contribution of direct learning from SMPL parameters instead of visualizations in Champ, the experiments does not support such claim. Besides, there is an important ablation study missing: only using SE. If only using SE is way worse than only using VE+DN, it will indicates that the proposed modules is not effective at all compared to Champ (which could be regarded as VE+DN).
-
As we know 2D pose detector is generally more accurate than SMPL pose estimator. The paper relies on the SMPLer-X tool for automated 3D model fitting, supplemented by "manual verification". However, it lacks a detailed description of this verification process, such as the scope of manual checks or the final error rate. The quality and accuracy of the underlying SMPL-X annotations are critical for training a robust 3D-aware model, and any systematic errors from the automated tool could compromise the model's performance and the validity of the evaluation.
-
The paper creates a new dataset with 2,614 videos by sourcing videos from Pexels, but it does not discuss or compare its contribution to other recent datasets like HumanVid [a] or MiraData [b], which also use Pexels videos for video-related tasks. Both mentioned datasets have a scale larger than this submission. A discussion of why a new dataset was necessary instead of augmenting existing ones would strengthen the justification for this contribution. For example, HumanVid [a] already curated human-centric 20K videos from Pexels, will this dataset shares a large portion of videos with this submission?
[a] Wang, Zhenzhi, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan et al. "Humanvid: Demystifying training data for camera-controllable human image animation." In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2024.
[b] Ju, Xuan, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. "Miradata: A large-scale video dataset with long durations and structured captions." Advances in Neural Information Processing Systems 37 (2024): 48955-48970.
问题
Please see the weaknesses. My major concern about this paper is that the experiments fail to support its main contribution: "effectively extracts control signals from 3D SMPL-X and fuses 2D signals, allowing precise and informative control", considering that 2D signals are well-studied in Champ [50].
局限性
Yes
最终评判理由
There are still some concerns about the small performance gain of VE+SE (the proposed method) over VE+DN (baseline method). The author also not fully justify the necessity of the proposed dataset compared to existing ones and the quality control of SMPL annotation process. Yet, this paper has provided extensive experiments and ablation studies to analyze the proposed method, which could be useful for the community to further explore in this direction. So I think it is acceptable to recommend the exposure of this paper to the community. I have increased the rating from borderline reject to be borderline accept. But I am not willing to defend acceptance for this paper against other reviewers, instead, I think the final recommendation of this paper could be achieved by aggregation the rating of all reviewers.
格式问题
None
Thank you for the constructive comments. Please find the following for our response.
Q1: [Weakness 1] Although the claimed novelty mainly focus on the effectiveness of directly utilizing 3D SMPL information, the experiments in Table 1 fails to support this claim: the performance gain of VE+SE over VE/VE+DN is marginal. … Although the authors want to emphasize the contribution of direct learning from SMPL parameters instead of visualizations in Champ, the experiments does not support such claim.
A1: We respectfully disagree with the reviewer. We argue that our results support the effectiveness of our method. Specifically, in Table 1 of the main paper, our 'VE+SE' consistently outperforms the 'VE+DN' baseline across all quantitative metrics, with a notable 0.21 dB improvement on PSNR, which focuses on the human foreground. More importantly, our SMPL-X encoder is explicitly designed to model occluded body parts, information typically discarded in the rendering process used in visualization. This advantage is evident in the qualitative results. For example, as shown in Figure 3 of the main paper, only our strategy `VE+SE' correctly handles the challenging occlusions, such as the occluded arm interpolation in Skateboarding ( row) and Jumping ( row). Therefore, with both quantitative and qualitative results, we claim that our results demonstrate the effectiveness of our proposed SMPL-X encoder.
Q2: [Weakness 1] Besides, there is an important ablation study missing: only using SE. If only using SE is way worse than only using VE+DN, it will indicates that the proposed modules is not effective at all compared to Champ (which could be regarded as VE+DN).
A2: Thank you for your suggestion to include this ablation study. We would like to clarify that we have already included an experiment using only SE in Section 1.6 of the supplementary material. This variant, denoted as 'Non-Vis', is used to ablate the importance of visual encoding in PoseFuse3D-KI. The corresponding results are provided in Table 2 of the supplementary material.
Furthermore, we would like to emphasize that it is inappropriate to directly compare 'Non-Vis' with 'VE+DN', as our proposed method is designed as an integrated framework. While the SMPL-X encoder is responsible for preserving occluded information, our visual encoder incorporates 3D human rendering to provide direct shape and appearance cues. Therefore, a more meaningful comparison is between our method 'VE+SE' and 'VE+DN', whose results are shown in Table 1 and Figure 3 of the main paper. As discussed in A1, these comparisons demonstrate the effectiveness of our proposed SMPL-X encoder in modeling occluded human motion.
Q3: [Weakness 2] As we know 2D pose detector is generally more accurate than SMPL pose estimator. The paper relies on the SMPLer-X tool for automated 3D model fitting, supplemented by "manual verification". However, it lacks a detailed description of this verification process, such as the scope of manual checks or the final error rate. The quality and accuracy of the underlying SMPL-X annotations are critical for training a robust 3D-aware model, and any systematic errors from the automated tool could compromise the model's performance and the validity of the evaluation.
A3: As described in Section 4 of the main paper and Section 1.4 of the supplementary material, our annotation pipeline involves both 2D human keypoint annotation using Sapiens and 3D SMPL-X model fitting using SMPLer-X. To ensure annotation quality, we conduct manual verification at two stages: 1) Pre-annotation verification is applied to the outputs of the human detection stage to filter out low-quality detections before proceeding with annotation. 2) Post-annotation verification is performed on the annotated data to correct or discard erroneous fits. Thanks to the high re-projection accuracy of SMPLer-X, most fitted SMPL-X models closely align with the image human body motion evidence upon projection, which supports their use as reliable supervision in our human-centric keyframe interpolation task. Furthermore, the effectiveness of the resulting dataset is validated through both quantitative and qualitative results presented in our paper. We believe this demonstrates that the SMPL-X annotations are of sufficiently high quality for training and evaluation.
Q4: [Weakness 3] The paper creates a new dataset with 2,614 videos by sourcing videos from Pexels, but it does not discuss or compare its contribution to other recent datasets like HumanVid [a] or MiraData [b], which also use Pexels videos for video-related tasks. Both mentioned datasets have a scale larger than this submission. A discussion of why a new dataset was necessary instead of augmenting existing ones would strengthen the justification for this contribution. For example, HumanVid [a] already curated human-centric 20K videos from Pexels, will this dataset shares a large portion of videos with this submission?
A4: Thank you for the suggestion to clarify our motivation for creating a new dataset. Our dataset is specifically designed for the Controllable Human-centric Keyframe Interpolation (CHKI) task, with a focus on complex human motions, human-centric annotations, and distant keyframe inputs. These features are not directly addressed by existing datasets. As described in Section 4 of the main paper and Section 1.4 of the supplementary material, our dataset is sourced from both the SportsSlomo dataset and the Pexels website. Specifically, to ensure motion complexity and diversity, we begin with videos from the SportsSlomo dataset, which contains challenging scenarios. Then, we further expand the dataset in diversity by retrieving additional videos from Pexels using a curated list of motion-related keywords. This targeted video collection strategy provides more precise control over the dataset content, which is crucial for training and evaluation in the CHKI setting. While there may be some overlap in Pexels-sourced content with datasets such as HumanVid and MiraData, our dataset is independently curated with a distinct task focus. Unlike those datasets, our design prioritizes motion complexity, keyframe temporal sparsity, and annotation consistency specific to CHKI. We appreciate the reviewer bringing up these related datasets and will cite both HumanVid and MiraData in the revised version, along with a discussion highlighting the distinctions and motivation behind our dataset construction.
Thanks for the author rebuttal. For Q1 and Q2, although the performance gain of VE+SE is not significant over VE+DN, the author has provided extensive experimental results and ablation studies to support their claim. The reviewer appreciates the effort of the quantitative results. For Q4, the reviewer's advice is to leverage videos in the two mentioned datasets for scaling up this submission's dataset, as they has already collected large amount of high-quality videos.
The reviewer has a follow-up question about the dataset curation process. According to the rebuttal, how the authors to verify results in "2) Post-annotation verification is performed on the annotated data to correct or discard erroneous fits"? Could authors provide statistics of error scale or error rate, and recognition methods (including its threshold parameters) used in this process? Or whether this process is performed by annotators? Or, only the quantitative results of models trained on this dataset could support the effectiveness of this dataset? As this paper has dataset contribution claim, the quality of dataset is important for the community.
Thanks for the feedback. We appreciate the reviewer’s recognition of the effectiveness of our proposed method. Please find the following for our response to the dataset-related questions.
Q1: For Q4, the reviewer's advice is to leverage videos in the two mentioned datasets for scaling up this submission's dataset, as they has already collected large amount of high-quality videos.
A1: Thanks for your suggestion to scale up the data. We would like to clarify that the current scale of our dataset is sufficient for the training and evaluation of a controllable human-centric interpolation task.
- First, our dataset contains over 2.5k video clips, which is approximately five times larger than the interpolation dataset collected by FCVG (~500 clips), representing a significant scale-up in this domain.
- Second, the dataset has been proven to be effective by both qualitative and quantitative results of models trained on it.
Therefore, the scale of our dataset is sufficient for the current usage. While future work may involve leveraging larger datasets, our current dataset already provides a strong foundation for the proposed task.
Q2: The reviewer has a follow-up question about the dataset curation process. According to the rebuttal, how the authors to verify results in "2) Post-annotation verification …
A2: Thanks for your question. We would like to make further clarifications on the annotation process.
- First, we employ the state-of-the-art 3D human estimator to generate SMPL-X annotations. This automatic fitting approach is consistent with prior works, such as Human4DiT [1], and provides reliable data, especially for training.
- Second, the post-annotation verification involves annotators. This verification is to manually verify the annotations to avoid extremely erroneous fits where the projection of the human shift largely or presents an entirely different pose from the human image. Thanks to the accurate detections and the high re-projection accuracy of SMPLer-X estimations, such extreme errors were not observed during the verification. To make a solid evaluation dataset, we further manually filtered the test split of our dataset, focusing on minor errors, such as slight arm shifting in the image. In this step for testset manual verification, we preserve 162 clips from 187 clips, presenting an error rate of 13.4% on the minor errors, according to the annotators’ preference. This verification helps to ensure the quality of both the train split and test split of the proposed dataset.
- Third, as presented in the paper and our previous discussion, both the qualitative and quantitative results of models trained on the dataset have proven the effectiveness of the dataset.
Therefore, our dataset is constructed with ensured quality and can be further leveraged for potential future usage of the community.
[1] Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer.
Dear Authors,
As reminded by the emails sent from NeurIPS organizers, I will update some information about my final justification here to confirm that I have participated this discussion and updated my final rating about this submission.
There are still some concerns about the small performance gain of VE+SE (the proposed method) over VE+DN (baseline method). The author also not fully justify the necessity of the proposed dataset compared to existing ones (although they are not originally designed for interpolation, it is very simple to do so) and the quality control of SMPL annotation process. Yet, this paper has provided extensive experiments and ablation studies to analyze the proposed method, which could be useful for the community to further explore in this direction. So I think it is acceptable to recommend the exposure of this paper to the community. I have increased the rating from borderline reject to be borderline accept. But I am not willing to defend acceptance for this paper against other reviewers. Instead, I hope the final recommendation of this paper could be achieved by aggregation of rating from all reviewers.
Dear reviewer,
We would like to thank the reviewer for taking the time to evaluate both our submission and participate in the discussion. We also sincerely appreciate the reviewer for raising the final rating and fully respect the reviewer’s decision and justification.
We would like to make a summary of our rebuttal and response for a better presentation to the readers.
-
Effectiveness of SMPL-X Encoder (SE) [Rebuttal A1-A2]. As a part of our proposed method, it is designed to supply additional spatial information and model occluded body parts, information typically discarded in the rendering process. We have conducted comprehensive experiments to support its effectiveness:
- Quantitatively, in Table 1 of the main paper, our 'VE+SE' consistently outperforms the 'VE+DN' variant across all quantitative metrics, with a notable 0.21 dB improvement on PSNR, which focuses on the human foreground.
- In the qualitative results, the advantage of our SMPL-X Encoder (SE) is also evident. For example, as shown in Figure 3 of the main paper, only our strategy `VE+SE' correctly handles the challenging occlusions, such as the occluded arm interpolation in Skateboarding ( row) and Jumping ( row).
-
The necessity of our dataset [Rebuttal A4]. Our dataset is specifically designed for the Controllable Human-centric Keyframe Interpolation (CHKI) task, with a focus on complex human motions, human-centric annotations, and distant keyframe inputs. Specifically, to ensure motion complexity and diversity, we begin with videos from the SportsSlomo dataset, which contains challenging scenarios. Then, we further expand the dataset in diversity by retrieving additional videos from Pexels using a curated list of motion-related keywords. This video collection strategy provides more precise content control over the dataset content, which is crucial for training and evaluation in the CHKI setting.
-
The scale of our dataset [Response A1]. Our dataset contains over 2.5k video clips, which is approximately five times larger than the interpolation dataset collected by FCVG (~500 clips), representing a significant scale-up in this domain. Our dataset of the current scale has been proven to be effective by both qualitative and quantitative results of models trained on it.
-
SMPL-X annotation and manual verifications [Rebuttal A3, Response A2]. We employ the state-of-the-art 3D human estimator to generate SMPL-X annotations. This automatic fitting approach is consistent with prior works, such as Human4DiT. We have clarified that manual verifications have been conducted to ensure annotation quality. These verifications include:
- Pre-annotation manual verification. It is applied to the outputs of the human detection stage to filter out low-quality detections before proceeding with annotation.
- Post-annotation manual verification. This verification is to manually verify the annotations to avoid extremely erroneous fits where the projection of the human shift largely or presents an entirely different pose from the human image. Thanks to the accurate detections and the high re-projection accuracy of SMPLer-X estimations, such extreme errors were not observed during the verification. To make a solid evaluation dataset, we further manually filtered the test split of our dataset, focusing on minor errors, such as slight arm shifting in the image. In this step for testset manual verification, we preserve 162 clips from 187 clips, presenting an error rate of 13.4% on the minor errors, according to the annotators’ preference. This verification helps to ensure the quality of both the train split and test split of the proposed dataset.
As presented in the paper and our previous discussion, both the qualitative and quantitative results of models trained on the dataset have proven the effectiveness of our annotated dataset. Therefore, our dataset is constructed with ensured quality and can be further leveraged for potential future usage of the community.
This paper aims to improve the performance of Human-centric Keyframe Interpolation (HKI) by proposing PoseFuse3D, which introduces 3D-informed control signals SMPL (rendering and 3D information) to guide HKI.
优缺点分析
Strengths:
- Existing keyframe interpolation methods cannot effectively address human scenarios, making the scenario addressed in this paper meaningful.
- The introduction of SMPL-X compensates for the insufficient 3D information in DWPose.
- It is innovative to utilize both the rendering information of SMPL-X and propose an SMPL-X Encoder to directly encode its 3D information.
- The results of this paper are sota.
Weaknesses:
- Introducing human studies would better evaluate the performance of generative interpolation models.
- I found that most cases provided by the authors target a single person, but when extending to many humans, such as a group of people (more than 5), can PoseFuse3D-KI still work? Can SMPL still estimate accurately?
- How does the performance fare when humans are not positioned at the center of the frame?
- Is there any insight into the design of the SMPL-X Encoder?
- Do you need to consider copyright when using Pixels data?
问题
see weaknesses.
局限性
yes.
最终评判理由
In the rebuttal, the authors have: 1) addressed my concerns regarding data openness and licensing, 2) provided further explanation of the design insights behind the SMPL-X Encoder, and 3) offered clarifications regarding multi-person scenarios, scenarios where people are not at the center, and scenarios with camera movement. I believe my concerns have been satisfactorily resolved. I consider this paper to provide valuable insights for the field of human-centric generative video interpolation, and therefore I maintain my score as accept.
格式问题
no
Q1: [Weakness 1] Introducing human studies would better evaluate the performance of generative interpolation models.
A1: Thank you for the valuable suggestion to include a human user study. To further evaluate the perceptual quality of our method, we conducted a user study involving 40 participants across 20 interpolation scenarios. Participants were asked to choose their preferred output among the results generated by the baseline methods and our proposed PoseFuse3D-KI. The summarized results are as follows:
| Methods | GIMM-VFI | GI | FCVG | PoseFuse3D-KI (Ours) |
|---|---|---|---|---|
| User Preference (%) | 1.75 | 5.13 | 6.63 | 86.50 |
Our method received a strong majority of user preferences (86.5%), consistently outperforming all baselines. These results further demonstrate the perceptual effectiveness of PoseFuse3D-KI.
Q2: [Weakness 2] I found that most cases provided by the authors target a single person, but when extending to many humans, such as a group of people (more than 5), can PoseFuse3D-KI still work? Can SMPL still estimate accurately?
A2: Thank you for the valuable question. As our method focuses on the plausible interpolation of human images with complex motion, we limit the number of humans in our training dataset. However, we have qualitatively validated the effectiveness of PoseFuse3D-KI in test cases involving more than five people. Unfortunately, due to limitations of the discussion format, we are currently unable to include the qualitative results here. These examples will be included in the supplementary materials of the revised version of the paper.
Q3: [Weakness 3] How does the performance fare when humans are not positioned at the center of the frame?
A3: We appreciate the reviewer for raising this question. Our method generalizes well to scenarios where humans are not centered in the frame, maintaining consistent and plausible interpolation performance. Such cases are included in the test set of our proposed dataset and are part of the quantitative evaluations presented in the paper. We will include visualizations of these examples in the revised version of the paper.
Q4: [Weakness 4] Is there any insight into the design of the SMPL-X Encoder?
A4: The SMPL-X encoder is introduced to supply additional spatial information, particularly in occluded regions where rendering operations result in information loss. While the rendering process transforms the 3D model into 2D images, it only preserves visible (non-occluded) information during projection. However, occluded information is crucial for modeling motion and generating plausible interpolations. Discarding it often leads to implausible results, as illustrated in Figure 3 of the main paper. To address this, the SMPL-X encoder directly processes 3D information and uses attention mechanisms to aggregate it into the image space, effectively preserving important cues that are lost during rendering.
Q5: [Weakness 5] Do you need to consider copyright when using Pexels data?
A5: Thank you for raising this important question. All Pexels content is provided under the Pexels license, which allows free use. The videos we collected are modified and used purely for research purposes, not distributed on other platforms or sold. Therefore, our usage complies fully with Pexels’ licensing terms.
Thank you for your response. Most of my concerns have been addressed.
Additionally, I have the following questions:
- Will the CHKI-Video dataset be open-source?
- Is the interpolation effective when there is camera motion? I believe it's important to discuss broader application scenarios and limitations in the revision.
Thank the reviewer for the valuable feedback. Please find the following for our response.
Q1: Will the CHKI-Video dataset be open-source?
A1: Thanks for the question. We believe that open-sourcing code and data is important for the research community, and we will make both our dataset and codebase open-source for future research.
Q2: Is the interpolation effective when there is camera motion? I believe it's important to discuss broader application scenarios and limitations in the revision.
A2: Thanks for pointing out the camera motion cases.
- Focusing on human-centric interpolation, our proposed method is effective for the foreground human under camera motion and can still perform plausible human interpolation. Such cases of camera motion are included in the test set of our CHKI-Video. We will add related qualitative results in the revised version of paper.
- While our proposed method focuses on plausibly interpolating foreground humans, our approach works well in many cases for the background interpolation under camera motion, a qualitative example (surfing, 3rd case in Part 1) can be observed in the supplementary video. We acknowledge that there might be artifacts under challenging conditions of extreme camera motion. Addressing this limitation is an interesting direction for future research and could extend the applicability of our method to more general video interpolation tasks. We will also include this limitation in the revised version of our paper.
Thank you for your response.
I hope that in your revision, you will: 1) add human studies, 2) further clarify the design insights of the SMPL-X Encoder, and 3) provide clarifications regarding multi-person scenarios, scenarios where people are not at the center, and scenarios with camera movement.
I will maintain my score.
We would like to thank the reviewer for taking the time to evaluate both our submission and rebuttal. We appreciate your constructive comments and will include the indicated points in our revised version of paper.
Strengths
- Video keyframe interpolation technique focusing on humans by making use of the 2D-rasterized DWPose and SMPL-X, and 3D features from the specially-designed SMPL-X encoder, which maps 3D and 2D-projected SMPL pose information to a conditioning feature
- Dataset collection and curation strategy
Weaknesses
- Just another variant of the ControlNet application on top of an off-the-shelf video generation model
- Uses simple linear pose interpolation between keyframes, completely ignoring the difficulty of the pose interpolation/in-betweening problem and previous papers on this topic
The contributions are relatively incremental and weak, but may be worth sharing with the community as another idea for improving human video generations, under the condition that the authors will incorporate feedback from reviewers for the final revision.
Therefore, I agree with reviewers to accept this paper.