W1: Experimental details

FaceShot is a training-free portrait animation framework that does not have training parameters. We use a single H800 to generate animation results at 512 512 resolution. Furthermore, we provide the time overhead of incorporating the appearance-guided landmark matching module and the relative motion transfer module for various frame counts, as shown in the table below:

frames	Target Matching	Motion Transfer
50	0.860(s)	0.382(s)
100	0.858(s)	0.751(s)

Target Matching: Detecting the target image landmarks using the appearance-guided landmark matching module. The time cost of landmark matching remains almost identical regardless of the number of frames because the matching is required only once for the target image given any driving video. The 0.8-second time includes both the DDIM inversion and the argmin operation in Eq(4).

Motion Transfer: Transferring landmark motion using relative landmark motion transfer module with very low time cost.

Q1: Details of appearance gallery

First, the target image is cropped into five facial parts i.e., eyes, mouth, nose, eyebrows and face boundary as:

Next, each part is matched to the closest domain in the appearance gallery by calculating the average CLIP image score for each domain:

where

(markdown in openreview does not support some latex input, so we split this formula into two lines), where represents the number of domain in part , denotes the number of images in given domain and CLIP-S denotes the clip image score. Finally, the reference image is formulated as .

Q2: Details of relative landmark motion transfer

For better understanding, we provide an illustration of relative landmark motion transfer in Figure R5. Specifically, our module consists of two stages: global motion transfer and local motion transfer. For global motion transfer, we focus on the overall positional changes of the entire face, represented by the discrepancy in the origin and angle of the rectangular coordinate systems between the -th frame and the -th frame. Next, we perform similar operations on each local facial part, but incorporating a scale factor to constrain the translation of the origin . Finally, we use the transformation of landmark points within the corresponding rectangular coordinate system as the final local translation for each part.

How to get the angle of the global rectangular coordinate system ?

The angle of the global rectangular coordinate system at frame is calculated using two endpoints of face boundary as:

How to determine the origin and angle of the local rectangular coordinate system of each part of the face?

It is determined by the endpoints of each part. The origin is calculated as follows:

When motion occurs, the origin and angle of different local rectangular coordinate systems in the reference and target image will also change. How to deal with this?

As shown in Figure R5, the local motion is defined as the changes / discrepancies in the origin and angle of different local coordinate systems between two reference images. When motion occurs, we add the discrepancies of origin and angle to target image's local rectangular coordinate system as the motion transfer process.

Q3: Details of CABench

We provide the types and quantities of characters in CABench as follows:

Anime Characters (small eyes)	Anime Characters (normal eyes)	Anime Characters (big eyes)	Emojis	Animals	3D characters	Human-like Characters	Toys
8	8	5	4	7	6	5	3

Additionally, we present the specific distribution of emotions and intensities in the driven videos, as shown below:

Intensity	Neutral	Calm	Happy	Sad	Angry	Fearful	Disgust	Surprised
normal	2	2	2	1	1	1	2	1
strong	1	1	1	2	2	2	1	2

In CABench, we have included a total of 46 images and 24 driven videos, with each video consisting of 110 to 127 frames. All videos (.mp4) and images (.jpg) are processed into a resolution of 512 512.