PaperHub
6.0
/10
Poster4 位审稿人
最低4最高8标准差1.4
4
6
8
6
5.0
置信度
正确性3.0
贡献度3.0
表达3.0
NeurIPS 2024

TPC: Test-time Procrustes Calibration for Diffusion-based Human Image Animation

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

This paper proposes Test-time Procrustes Calibration, which enhances the robustness of diffusion-based image animation systems in the real-world unseen domain by guiding correspondences of human shapes between reference image and target poses.

摘要

关键词
Human Image AnimationDiffusion modelProcrustes Analysis

评审与讨论

审稿意见
4

This paper proposes diffusion guidance (Test-time Procrustes Calibration, TPC) for human image animation. TPC incorporates an auxiliary branch into the diffusion conditions process and provides a calibrated reference image latent. Experimental results show the effectiveness of the provided method.

优点

  1. This paper studies an interesting problem of human image animation, that is, when the given human image and pose image have shape differences, the performance of the original model is poor.
  2. The structure of this paper is well-organized and easy to follow.
  3. The experimental results show the effectiveness of the proposed method.

缺点

  1. In Figure 1a, I acknowledge that there may be a large shape difference between a given image that needs to be animated and a guided image. In this case, I think the most direct and useful way is to find a way to align the two images, and I don't think it is very difficult to align the two images. So why do we need the proposed TPC method?
  2. In line 54, the method of injecting guided images into the model should be more than just cross-attention. Would TPC work if the human image animation methods do not use cross-attention to guide image injection into the model?
  3. The last paragraph of the introduction introduces the proposed TPC approach. However, this part is to analyze the gains that TPC can bring, and I think this part is more suitable for methods or experiments.
  4. As for the TED man in the demo of the experimental results, why does the animated video only show the man's upper body?
  5. For the unseen domain shown in the provided demo, I think in human image animation, we'll give the model images that need to be animated and guided images. Also, the base model is trained on a lot of images of human bodies, and the model should have seen something like this before, so calling it an unseen domain isn't accurate. For this kind of scene, there is a big gap between the character identity in the generated video and the given image, and TPC seems to have no way to help the original model maintain the character identity.

问题

Please see the above.

局限性

  1. The necessity of TPC
  2. The experimental results do not help the original model retain the details of the input image.
  3. Introduction needs to be reorganized
作者回复

Thank you for your valuable comments. We address your concerns below.

[Q1] I think the most direct and useful way is to find a way to align the two images, and I don't think it is very difficult to align the two images. So why do we need the proposed TPC method?

[A1] The direct way to input an aligned reference image is sub-optimal, as it fails in maintaining temporal consistency and preserving the background. Figure J of our attached PDF shows the experimental results for this. For right-to-left walking motions, the single reference image cannot aligned with both target positions, resulting in temporal inaccuracies, and at the same time, the background is not preserved.

The TPC method addresses these issues by performing frame-wise calibrations to match all target frames and providing these as additional conditions for the diffusion model. To preserve the background, TPC filters out backgrounds in the calibrated frames (Line 214). The Procrustes approach used in TPC is more effective for calibration than other transformations, as demonstrated in Figure G and Table C, which show qualitative and quantitative superiority over methods like Affine [1,2], Perspective [3,4], and Global Pose Normalization (GPN) [5].

Table C. Ablation study of various transformation methods for calibration, reported in a format of foreground/background (validation splits, average score of compositional alignment/misalignment). PS: Perspective.

L1↓×E-04_{\times \textrm{E-04}}PSNR↑SSIM↑LPIPS↓FID↓FID-VID↓FVD↓
w/o calibration4.51/3.5727.81/29.150.689/0.7580.304/0.25938.3/27.737.2/29.4184/162
Affine3.92/3.2528.61/29.320.709/0.7670.281/0.24929.1/26.136.7/27.6177/153
PS3.46/3.0628.88/29.370.715/0.7690.279/0.24728.6/25.935.2/27.4171/151
GPN3.12/2.8329.06/29.480.731/0.7760.274/0.24427.3/25.631.5/26.9164/145
TPC (Ours)3.08/2.7929.12/29.500.734/0.7820.273/0.24126.9/25.131.2/26.4162/142

[1] FedMed-ATL: Misaligned Unpaired Cross-Modality Neuroimage Synthesis via Affine Transform Loss, ACM MM'22

[2] Adaptive Affine Transformation: A Simple and Effective Operation for Spatial Misaligned Image Generation, ACM MM'22

[3] Shape-Preserving Half-Projective Warps for Image Stitching, CVPR'14

[4] Image stitching with perspective-preserving warping, ISPRS'16

[5] Everybody Dance Now, ICCV'19

[Q2] In line 54, the method of injecting guided images into the model should be more than just cross-attention. Would TPC work if the human image animation methods do not use cross-attention to guide image injection into the model?

[A2] Yes, it works. Figure K demonstrates that even when using feature addition as a simple injection method, it is effective, though not as effective as the cross-attention method. Especially, the addition method is effective when the shapes are positionally aligned between the calibrated frame and the target frame. Furthermore, Table D demonstrates the quantitative effectiveness of these.

Table D. Ablation study of different injection methods for applying TPC, reported in a format of foreground/background (validation splits, average score of compositional alignment/misalignment).

L1↓×E-04_{\times \textrm{E-04}}PSNR↑SSIM↑LPIPS↓FID↓FID-VID↓FVD↓
w/o TPC4.51/3.5727.81/29.150.689/0.7580.304/0.25938.3/27.737.2/29.4184/162
w/ TPC (addition)3.16/2.9229.01/29.370.727/0.7710.283/0.24928.2/25.932.6/26.9171/152
w/ TPC (cross-attention)3.08/2.7929.12/29.500.734/0.7820.273/0.24126.9/25.131.2/26.4162/142

[Q3] The last paragraph of the introduction is more suitable for methods or experiments.

[A3] Yes, we will move it to the experiment section.

[Q4] As for the TED man in the demo (Figure 8) of the experimental results, why does the animated video only show the man's upper body?

[A4] This is because the input target motion only includes the upper body. The skeleton's green and blue branches, connected to the pelvis, are part of the upper body. For reference, please see the full-body skeletons in the upper sample of the woman walking.

[Q5] For the unseen domain shown in the provided demo, I think in human image animation, we'll give the model images that need to be animated and guided images. Also, the base model is trained on a lot of images of human bodies, and the model should have seen something like this before, so calling it an unseen domain isn't accurate. For this kind of scene, there is a big gap between the character identity in the generated video and the given image, and TPC seems to have no way to help the original model maintain the character identity.

[A5] The unseen domains we aim to address are samples containing compositional misalignment (i.e., samples of human shape misalignments in terms of scale and rotation between the reference and target images). Human image animation datasets (e.g., TikTok, TED-talks) ensure this alignment, but real-world scenarios often involve compositional misalignments, making current systems vulnerable to these samples. To address this problem, we introduce a calibration branch for the diffusion model, demonstrating its effectiveness at test-time without additional training. To quantitatively validate our approach, we have collected an additional test set specifically for compositional misalignment samples, alongside the original test set (as mentioned in Line 221, with source links available in our supplementary material).

We acknowledge the redundant use of the term "unseen" in Figure 10 and the demo. This term actually refers to the reference image obtained from the Text-to-Image (T2I) model. We will update the terminology to "T2I-synthesized" to avoid confusion.

评论

Dear Reviewer KKQ2:

Thanks for reviewing this work. Would you mind to check authors' feedback and see if it resolves your concerns or you may have further comments?

Best, AC

评论

Thank you for your feedback. It seems you may have some reservations about giving us a score of 5, possibly due to concerns about novelty (which currently, lead to a rescaling to 4). We would like to clarify that the system presented in our manuscript represents a generalized version of diffusion-based human image animation systems and is not intended to replicate AnimateAnyone. As shown in Figure B of our attached PDF, the image encoder and pose encoder in our system correspond to the Appearance Encoder and ControlNet in MagicAnimate, and to the CLIP Encoder and ControlNet in DisCo. Moreover, the experiments in Table 1 of our manuscript demonstrate our module’s effectiveness across four different systems. In this regard, We hope this clarification addresses some of your concerns and wish your reconsiderations about the contributions of our proposed module.

评论

Thank you for your response. Based on the comments of other reviewers, I decided to adjust the score from 3 to 4. The main reason is that, like reviewer mDg5, this method is actually an enhanced version of the redirection technique proposed in AnimateAnyone, which is not novel enough. In addition, most of the videos displayed in the demo have artifacts, such as the flicker of the finger of the woman at 42s, the flicker of clothes at 49s, and the shaking of the background at 54s.

审稿意见
6

In this paper, the authors propose TPC, an alignment algorithm for human image animation systems. optimal precision is currently achieved only when the physical compositions of the human shapes in the reference image and target pose frame are aligned. Misalignment leads to a significant decline in fidelity and consistency. To address this issue, the authors propose Test-time Procrustes Calibration (TPC). TPC provides a calibrated reference image for the diffusion model, enhancing its ability to understand the correspondence between human shapes in the reference and target images. This technique can be applied to any diffusion-based image animation system in a model-agnostic manner, improving effectiveness at test time without requiring additional training.

优点

(1) TPC bridges the gap in shapes between reference and target by providing correspondence guidance latent. With this guidance condition, diffusion-based animation systems achieve robustness to fidelity variations and maintain temporal consistency among frames.

(2) The TPC is simple and works in a model-agnostic manner without additional training.

缺点

(1) The implementation details of Iterative Propagation are not relatively comprehensive, and I suggest the authors make an algorithm table to depict the concrete process of the proposed Iterative Propagation.

(2) There are some skeleton alignment algorithms in the previous works, such as Everybody Dance Now. The authors should conduct additional quantitative and qualitative comparison experiments between TPC and other previous skeleton alignment algorithms.

(3) According to my observations of your provided demos, the backgrounds of some rotation cases encounter considerable changes during the animation process. The background details don’t align with the given reference image, such as the layout and color distribution. For instance, in the galaxy case, the positions of the stars in the background of the animated video are not the same as those in the original given reference image.

问题

(1) In some particular situations, the protagonist in the given reference image only shows half of the whole body, while the driven target pose shows the whole body parts. Can TPC handle this sort of situation? Could you please give some quantitative and qualitative experiment results regarding this situation?

(2) I am unsure how you can estimate or ensure the accuracy of the subset X. Could you please provide some visual comparison results in terms of different subsets with different numbers of key points? Additionally, could you provide the details of the common filtering process?

局限性

The authors have adequately addressed the limitations and potential negative societal impact of their work.

作者回复

Thank you for your valuable comments. We address your concerns below.

[Q1] I suggest an algorithm table of the proposed Iterative Propagation.

[A1] Figure E of our attached PDF presents the algorithm table for Iterative Propagation. We will incorporate this into our manuscript. Thank you.

[Q2] The authors should conduct additional quantitative and qualitative comparison experiments between TPC and other previous skeleton alignment algorithms (e.g., Everybody Dance Now [1]).

[A2] Our investigation identified several transformation (i.e., alignment) methods, which can be categorized into two types: (1) shape-distorting (e.g., Affine [2,3], Perspective [4,5]) and (2) shape-preserving methods (e.g., TPC (Ours), Global Pose Normalization (GPN) [1]). Qualitative results in Figure G show that shape-distorting transformations can accurately align to the targeted region but tend to cause significant information loss in other areas due to distortion, leading to low fidelity. Shape-preserving methods generally performed better, but GPN struggles with rotational motions due to its inability of rotation. Quantitative analysis in Table B confirms that TPC is the most effective. We will include these comparisons in our manuscript. Thank you.

Table B. Ablation study of various transformation methods for calibration, reported in a format of foreground/background (validation splits, average score of compositional alignment/misalignment). PS: Perspective.

L1↓×E-04_{\times \textrm{E-04}}PSNR↑SSIM↑LPIPS↓FID↓FID-VID↓FVD↓
w/o calibration4.51/3.5727.81/29.150.689/0.7580.304/0.25938.3/27.737.2/29.4184/162
Affine3.92/3.2528.61/29.320.709/0.7670.281/0.24929.1/26.136.7/27.6177/153
PS3.46/3.0628.88/29.370.715/0.7690.279/0.24728.6/25.935.2/27.4171/151
GPN3.12/2.8329.06/29.480.731/0.7760.274/0.24427.3/25.631.5/26.9164/145
TPC (Ours)3.08/2.7929.12/29.500.734/0.7820.273/0.24126.9/25.131.2/26.4162/142

[1] Everybody Dance Now, ICCV'19

[2] FedMed-ATL: Misaligned Unpaired Cross-Modality Neuroimage Synthesis via Affine Transform Loss, ACM MM'22

[3] Adaptive Affine Transformation: A Simple and Effective Operation for Spatial Misaligned Image Generation, ACM MM'22

[4] Shape-Preserving Half-Projective Warps for Image Stitching, CVPR'14

[5] Image stitching with perspective-preserving warping, ISPRS'16

[Q3] The backgrounds of some rotation cases don’t align with the reference image. For instance, in the galaxy case, the positions of the stars in the background of the animated video are not the same as those in the original reference image.

[A3] Thank you for your observation. The galaxy case issue was mistakenly generated without applying a background mask in our method, which led to the observed artifacts in background preservation. As noted in our implementation details (Line 214), we eliminate background intervention in calibrated images. When the background mask is correctly applied, it ensures alignment of background details with the reference image. Figure H shows the corrected sample with proper background preservation. We will update this and apologize for any confusion it caused.

[Q4] In some particular situations, the protagonist in the given reference image only shows half of the whole body, while the driven target pose shows the whole body parts. Can TPC handle this sort of situation? Give some quantitative and qualitative experiment results.

[A4] Yes, Figure 13 in the Appendix of our paper illustrates the resulting frames in such scenarios (specifically, the sample of a woman wearing white clothes). Existing models generate unnatural lower bodies from the upper body reference. However, integrated with TPC, they seamlessly blend the lower body with the upper body's style. For quantitative analysis, we collected such samples as another test set where the reference and target human shapes are not aligned in terms of rotation or scale, indicating compositional misalignment. Table 1 provides a quantitative evaluation of compositional misalignment samples. TPC significantly improves the qualities of these samples, comparable to those of composition-aligned samples.

[Q5] I am unsure how you can estimate the accuracy of the subset X. Provide some visual comparison results in terms of different subsets with different numbers of key points and provide the details of the common filtering process.

[A5] Yes, Figure I in our attached PDF shows visual comparisons with different numbers of keypoints, demonstrating optimal performance with 6 keypoints. Using 4 or 5 keypoints was also effective in this sample.

The selection of optimal keypoints subset involves two steps (code available in our supplementary material).

  1. Filtering Keypoints: Keypoints from the reference and target images are filtered to retain only commonly visible ones. Points with a low prediction score (< 0.3, invisible poinsts are usually lower than 0.3) are removed (using OpenPose for keypoint prediction). The remaining points common to both images form the common sets XX and YY, representing the same body parts.

  2. Selecting Optimal Subset: From the common sets, we generate all possible subsets with four or more points (for computational efficiency). Using Procrustes Analysis, we obtain transformation parameters (i.e., scaling, rotation, translation) from each subset xX\mathbf{x} \subset X to corresponding subset yY\mathbf{y} \subset Y and transform the reference image into the calibrated image. We then measure the overlap accuracy of human shapes between the calibrated and target images using the pixel-wise IoU score (with SAM for segmentation). The subset with the highest score is selected as the optimal subset x\mathbf{x}^*. For computational efficiency, this process is also available with a batch-wise implementation.

评论

Dear Reviewer c8u6:

Thanks for reviewing this work. Would you mind to check authors' feedback and see if it resolves your concerns or you may have further comments?

Best, AC

评论

I have carefully read the author's response. It has addressed all of my concerns. I will keep my original score.

审稿意见
8

This paper starts from an interesting problem that what happens when the motion condition and the reference image are not well aligned. It analyses the robustness of an existing human animation network given different levels of misalignment, and tries to find the underlying cause through attention maps. Then this paper propose TPC, a test-time calibration module to mitigate this issue. This module provides a calibrated latent feature as an additional condition to the diffuser. The calibrated latent is optimized via Procrustes Warping, and Iterative Propagation is proposed to improve temporal consistency among calibrated features.

优点

This paper is among the most innovative work I've ever seen in human animation area. It is a curiosity-driven work starting with an interesting problem. The solution is neat and elegant.

The presentation is clear and easy to follow. The analysis in Fig 1 and Fig 2 is very interesting and insightful.

Technically, using keypoints as the evidence to align shapes is quite direct and reasonable. Using Procrustes analysis to optimize a transformation is sound.

缺点

I did not find obvious weakness, except that the demo results still contain some artifacts. But it is mainly due to the baseline MA or Disco. I recommend the authors to try some new methods as the baseline that have higher quality.

The method is insightful, but only for human animation, a small area. It would greatly enhance the value of this paper if it can be applied to some more general tasks.

问题

What if directly calculating the transformation between the reference image and the target pose via some heuristical evidences? For example, considering the torso part, i.e., keypoints set {6,7,13,12}, since torso keeps rigid to some extend, compared to arms and legs. Then using the vector (mid(6,7) , mid(13, 12)) to compute the rotation angles, the perimeter for scales, the center coordinate for shifting. I know this design is not optimal. But how is the result with such simple idea? I noticed Table 2, the "Linear" and "Affine" setting, but seems not exactly the same one.

Can this method be potentially applied to a broader tasks that require spatial alignment between some input variables?

局限性

The Broader Impacts and Ethic Statements have been adequatedly addressed.

作者回复

We appreciate your recognition of our work. We address your concerns below.

[Q1] Demo results still contain some artifacts. But it is mainly due to the baseline MA or Disco. I recommend the authors to try some new methods as the baseline that have higher quality.

[A1] Yes, as shown in Figure F of our attached PDF, the application of TPC with more recent models, such as Champ (to be published in ECCV'24) and MimicMotion (Arxiv'24), also enhances their inferences, resulting in better quality. We will incorporate this into our manuscript. Thank you!

[Q2] It would greatly enhance the value of this paper if it can be applied to some more general tasks. Can this method be potentially applied to broader tasks that require spatial alignment between some input variables?

[A2] Yes, as shown in Figure C, we confirmed that our proposed calibration concept can also be applied to other image-to-image tasks (e.g., virtual try-on) and video-to-video tasks (e.g., video editing), enhancing their quality. We will incorporate this as an additional section for plug-and-play applications in our manuscript.

[Q3] What if directly calculating the transformation between the reference image and the target pose via some heuristical evidences? For example, considering the torso part, i.e., keypoints set {6,7,13,12}, since torso keeps rigid to some extend, compared to arms and legs. Then using the vector (mid(6,7) , mid(13, 12)) to compute the rotation angles, the perimeter for scales, the center coordinate for shifting. I know this design is not optimal. But how is the result with such simple idea? I noticed Table 2, the "Linear" and "Affine" setting, but seems not exactly the same one.

[A3] Figure D illustrates the resulting frames based on the set of keypoints on the torso. This method is as effective as our original method when the torso is clearly visible in the target motion (i.e., first frame). However, it shows vulnerabilities when the torso is not visible (i.e., third frame). In fact, our initial approach also used keypoints of rigid body parts (e.g., torso) for transformations, but visibility issues arose with various movements. To address this, we proposed finding the optimal keypoints for each frame, considering commonly visible keypoints between the target and reference.

评论

Dear Reviewer rFbR:

Thanks for reviewing this work. Would you mind to check authors' feedback and see if it resolves your concerns or you may have further comments?

Best, AC

评论

The authors well addressed my concerns, with sufficient analysis and visualization. I have no further questions. I will keep my original rating.

审稿意见
6

This work proposes a method that combines exsiting human image animation method and Procrustes Warping, which improves the robustness of image animation approach. In addition to the explicit image warping, this work also proposes an interative propagation method to improve the temporal consistency. Experiments show that the proposed method is effective in enhancing the animation performance.

优点

  1. The proposed method is well motivated, the observation of the misalignment between reference image and target pose is straightforward and insightful.
  2. The experiments in this work are solid, both the comparisons and ablations are extensive and detailed. Compared with the baseline methods, the improvements are noticeable.
  3. The writing is clear and easy to follow.

缺点

  1. As far as I know, the Procrustes alignment is popular in the evaluation of 2D/3D human pose estimation. So, this work introduces this simple technique into the human image animation but technically this contribution is not novel enough.
  2. There is no detailed description about the training process of this method, it could be better to introduce more about the training process in the main text.
  3. From the Table 2 we can see that the best M is 20, but the evaluation metrics shown in the table are SSIM and FVD, it could better to show more results to help reader understand the ablation comprehensively.
  4. Are there any trade-off boserved between the temporal smoothness and motion precision? For example, if M=1, the temporal consistency is the best while the pose precision is the worst?

问题

How do you apply TPC in DisCo and MagicAnimate? I am a little bit confused since the Figure 5 only depicts the structure of Anymate Anyone.

局限性

Discussed.

作者回复

Thank you for your valuable comments. We have addressed your concerns to the best of our ability.

[Q1] Procrustes alignment is popular in evaluating 2D/3D human pose estimation. This work introduces simple technique into the human image animation but technically contribution is not novel enough.

[A1] We are not claiming novelty for the Procrustes alignment but instead for the test-time calibration concept for diffusion model that aligns the target subject in the reference image to the pose in the pose video. To the best of our knowledge, our study is the first work to address the issue of human image animation systems performing poorly in real-world scenarios, identifying compositional misalignment as the cause. When incorporated with any diffusion-based image animation system, our calibration significantly enhances its robustness in real-world applications.

The Procrustes approach is one of our experimental attempts to implement this calibration concept. Table 2 in our manuscript also provides results from other attempts, including affine and linear transformation approaches. The Procrustes approach proved to be the most effective and our technical contributions (i.e., optimal keypoints selection and iterative propagation) focus on optimizing the calibration.

[Q2] There is no detailed description about the training process of this method, it could be better to introduce more about the training process in the main text.

[A2] The proposed test-time method does not have a separate training procedure, generating video during inference. No training is required as mentioned in Line 18 and 99.

[Q3] From the Table 2 we can see that the best MM is 20, but the evaluation metrics shown in the table are SSIM and FVD, it could be better to show more results to help reader understand the ablation comprehensively.

[A3] As mentioned in Line 288 and also shown in Table 2, MM=30 provides the best setting. Table A below provides the ablation studies with additional metrics for both single frames (i.e., PSNR, SSIM, LPIPS, FID, L1 error) and videos (i.e., FID-VID, FVD). We will incorporate this into Table 2 of our manuscript. Thank you.

Table A. Ablation study of Iterative Propagation (IP) according to the number of group MM, reported in a format of foreground/background (validation splits, average score of compositional alignment/misalignment).

L1↓×E-04_{\times \textrm{E-04}}PSNR↑SSIM↑LPIPS↓FID↓FID-VID↓FVD↓
w/ IP (MM=1)3.92/3.2528.61/29.320.709/0.7670.281/0.24929.1/26.136.7/27.6177/153
w/ IP (MM=10)3.46/3.0628.88/29.370.715/0.7690.279/0.24728.6/25.935.2/27.4171/151
w/ IP (MM=20)3.12/2.8329.06/29.480.731/0.7760.274/0.24427.3/25.631.5/26.9164/145
w/ IP (MM=30)3.08/2.7929.12/29.500.734/0.7820.273/0.24126.9/25.131.2/26.4162/142
w/ IP (MM=40)3.10/2.8129.08/29.460.728/0.7770.275/0.24327.0/25.331.5/26.8165/145
w/ IP (MM=50)3.09/2.7929.09/29.430.731/0.7770.274/0.24226.9/25.231.8/27.2170/149
w/o IP (w/ IP MM=video length)3.11/2.8029.09/29.470.731/0.7780.275/0.24327.0/25.233.2/27.1169/150

[Q4] Are there any trade-off observed between the temporal smoothness and motion precision? For example, if MM=1, the temporal consistency is the best while the pose precision is the worst?

[A4] Yes, we observed a slight trade-off between consistency and precision within an effective operating region for iterative propagation (IP). Figure A in our attached PDF presents a sensitivity analysis of the IP according to MM in terms of synthesizing precision (LPIPS) and temporal consistency (FVD). We identified the effective operating region as 20 < MM < 50 (for an average 120-frame video), where the IP shows a slight trade-off.

However, when we selected MM = 1, both precision and consistency were at their lowest. This outcome is attributed to the insufficiency of calibrated images, as only a single calibrated frame is selected to propagate all the target frames at each denoising step. The selected calibrated image failed to provide accurate calibration to the target frame due to significant temporal misalignment. Consequently, this reduced precision, and the series of inaccuracies also negatively impacted consistency. Therefore, the trend in consistency changes led us to determine that the optimal choice is MM = 30.

[Q5] How do you apply TPC in DisCo and MagicAnimate?

[A5] In Figure B of our attached PDF, we illustrate the applications of TPC on top of DisCo and MagicAnimate. For DisCo, the image encoder in Figure 5 is the CLIP Encoder, and the pose encoder is the pose ControlNet of DisCo. For MagicAnimate, the image encoder corresponds to the Appearance Encoder of MagicAnimate, and the pose encoder corresponds to ControlNet.

评论

Thanks for the detailed response. I have carefully read this rebuttal. It has addressed all of my concerns. Thus, I will increase my rating. The reason for not selecting a higher score is that this method is actually an enhanced version of the retargeting technique proposed in AnimateAnyone and its novelty is not strong enough. In general, the analysis and experiments in this work are solid and that makes me vote for a weak accept.

作者回复

We have uploaded a PDF file containing figures. Please refer to this PDF along with our rebuttal for a clear understanding.

评论

Dear Reviewers:

Thanks for reviewing this work. Since we have a mixed review (4 6 6 8), I would like to start an internal discussion. Specifically, Reviewer KKQ2 raised two concerns: 1) novelty, and 2) artifacts in the results. Can other reviewers share your thought on these two questions?

Best, AC

最终决定

This paper was reviewed by four experts in the field. The initial reviews were mixed, as reviewers are concerned about the validation of the experiment. The main concern was: 1) the novelty of the proposed approach, as alignment is a common solution in 2D-3D human pose estimation, 2) the motivation of Procrustes Calibration is not clear, as a pre-calibration is also a possible alternative solution.

After discussion, the second concern is mostly resolved and reviewers remain concerned about the novelty of this work. Still, all reviewers agreed that this is well-presented work, the idea is simple and effective, and the performance is good, even though it is not a fully novel approach. Therefore, the area chair would suggest acceptance.

We recommended the authors to carefully read all reviewers’ final feedback, and revise the manuscript as suggested in the final camera-ready version. We congratulate the authors on the acceptance of their paper!