Thank you for your continued valuable feedback and for carefully reviewing the revised manuscript. We appreciate your thoughtful comments and are glad that most of the concerns have been addressed. We understand that a few questions remain unresolved, and we have conducted further experiments to provide clearer answers.

Quantified explanation in Q1

To better evaluate the stability of the model when dealing with anomalous data in the training set, we aim to identify anomalous ground truth data within the RAVDESS and HDTF datasets and compare the generation stability of Faceformer, Codetalker, and our model.

Error calculation. To select anomalous data from the training set, we use the camera parameters estimated by EMOCA to obtain the predicted landmark point locations on the image. Specifically, after driving the FLAME model, we can obtain 68 3D landmark points and project them to the image space using the camera parameters to obtain the predicted landmark point locations on the image. We can then use relevant algorithms (e.g. face_alignment) to estimate the 68 landmark point locations on the GT image. The L2 error is calculated for all landmark points and the maximum value of this error is used as an indicator for identifying anomalies. We define this error as LE (landmark error). The formula is as follows:
Identifying Anomalous Data in the Training Set. We use the Interquartile Range (IQR) method to detect anomalous data. If the L2 error exceeds the threshold defined by , where is the third quartile, we consider that data point to be anomalous.
Comparison of Generation Stability. Using the identified anomalous data, we calculate the LE for Faceformer, Codetalker and our model. As there are no anomalous points in the HDTF dataset, we omit this in the results. The results are as follows. It can be seen that despite the anomalies in the training data, our model is still able to have a more stable generation.

	RAVDESS
GT	681.759
FaceFormer	455.312
CodeTalker	408.980
Ours	185.125

Another way to induce stability. EmoTalk[1] proposes a loss function: to smooth the impact of anomalous data during training. We also adopted this approach and trained our model using this loss to observe any improvements when handling anomalous data. The results are as follows. Although the introduction of the smoothing loss yields a slight improvement, the difference is minimal. This suggests that our generative method is inherently robust to extreme cases and helps mitigate the effects of anomalies in the data.

	RAVDESS
GT	681.759
Ours	185.125
Ours w/	178.722

[1]Peng Z, Wu H, Song Z, et al. Emotalk: Speech-driven emotional disentanglement for 3d face animation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 20687-20697.