CemiFace: Center-based Semi-hard Synthetic Face Generation for Face Recognition
diffusion model to generate semi-hard samples for synthetic face recognition
摘要
评审与讨论
This paper proposes CemiFace, a novel diffusion-based approach for generating synthetic face images with varying levels of similarity to their identity centers. The authors argue that semi-hard negative samples, those with moderate similarity to the center, are crucial for training effective face recognition models. The core of CemiFace lies in its ability to control the similarity between generated images and the input (identity center) during the diffusion process. This is achieved by injecting a similarity controlling factor condition (m) that regulates the similarity level. The paper presents a comprehensive analysis of the relationship between sample similarity and face recognition performance, showing that semi-hard samples, generated with m close to 0, achieve the best accuracy. CemiFace demonstrates significant improvements over previous methods in terms of accuracy, particularly on pose-sensitive datasets. The paper further validates its effectiveness through qualitative visualizations and ablation studies that examine the impact of various factors, including training data, inquiry data, and the similarity controlling factor. Overall, this paper contributes a valuable approach to generating synthetic face datasets for face recognition with enhanced discriminative power. The method shows promise in mitigating privacy concerns associated with collecting and using real-world face data while maintaining robust recognition performance.
优点
Discovery of the importance of similarity control in synthetic face generation: CemiFace is motivated by the discovery that face images with certain degree of similarities to their identity centers show great effectiveness in the performance of trained FR models. This is an important discovery to the community of synthetic dataset generation. Unique use of similarity control: CemiFace introduces a similarity controlling factor (m) within the diffusion process, enabling the generation of faces with varying levels of similarity to the input image. This provides a fine-grained control over the generated data distribution, which is a unique feature compared to existing methods. Comprehensive analysis of similarity: The authors present a thorough analysis of the impact of different similarity levels on face recognition performance, validating their hypothesis about the importance of semi-hard samples. This analysis provides valuable insights into the relationship between data distribution and model effectiveness. Rigorous experimental evaluation: The paper conducts comprehensive experiments across various benchmark datasets and data volumes, comparing CemiFace with other state-of-the-art synthetic face generation methods. The ablation studies provide a detailed understanding of the influence of different parameters and factors on the model's performance. Robustness of CemiFace: The experiments demonstrate the robustness of CemiFace to different training data, inquiry data, and similarity controlling factors. The method consistently achieves superior results, demonstrating its effectiveness and generalizability.
缺点
- An in-depth discussion on why face images with certain similarity is more beneficial as a training dataset for the face recognition model would strengthen the paper. For example, an analysis such as a similarity comparison the the real dataset and checking if the difficulty of the CemiFace synthetic dataset becomes closer to that of the real dataset would be nice. Other analysis that offers insights as to why certain similarity control is important would also be welcome.
问题
- Written in weakness section.
局限性
- This is a well written paper with a meaningful discovery on training with synthetic dataset. The paper would be better positioned in the venue NeurIPS if it would offer more insightful analysis on why similarity control is beneficial, on top of the empirical benefits.
W1: Thank you for your insightful suggestion and positive feedback. We assume the benefits of the semi-hard training face images could be attributed to:
**(1) **easy training samples are typically images where the face is clear, well-lit, and faces the camera directly, and thus training on such easy samples would not allow the trained FR models to be able to generalize for face images with large pose/age/expression variations and different lighting conditions/backgrounds that are frequently happened in real-world applications. AdaFace[3] also mentioned that easy samples could be beneficial to early-stage training, while hard sample mining is needed for achieving generalized and effective FR models;
(2) hard samples normally contain noise data. Specifically, FaceNet[28] demonstrate that the hardest sample mining using a large batch size leads to hard convergence and produces inferior performance. This is because training with very hard samples may not allow FR models to learn effective features but focus on cues apart from facial identities;
(3) Semi-hard samples generated by CemiFace mostly contain large posed faces but less face-unrelated noises. We also evaluate the training epochs needed to reach the highest AVG performance for easy samples (), semi-hard samples() and extreme hard samples (). Easy samples take 10 epochs to reach the best AVG and 20 epochs to produce the training loss of 0; Semi-hard samples take much longer (38 epochs) to provide the highest AVG while the final training loss is around 3; and FR models training on extreme hard samples could not converge.
Due to the real inquiry center is not available, we further calculated the average similarity (using ArcFace to fairly compute them, i.e., avoiding information from adaface and cosface) between randomly paired images, based on random selected 200 identities from CASIA-WebFace, DCFace and CemiFace, where real face images in CASIA-WebFace provide average similarity of 0.51, while face images generated by our CemiFace has a similarity of 0.48. However, face images generated by DCFace are more different from the real face images. with a much easier average similarity of 0.57.
We have also additionally added an experiment to demonstrate the actual similarity to the inquiry data in the PDF Tab. 2. It shows that our generated face images exhibit larger distances (lower similarity) to the inquiry centers than the DCFace. We have also calculated the number of face images belonging to different similarity groups for CemiFace and DCFace in the PDF Tab. 3, indicating that our CemiFace tends to generate images showing lower similarities to their identity centers (i.e. all samples are semi-hard), while DCFace containing more easy samples.
Thank you for the rebuttal. My questions have been answered. Also, after reading other reviews and rebuttals, it seems that the authors have sufficiently answered the questions. I hold my position as it is, as I believe it is a paper with a meaningful discovery on training with synthetic dataset.
Dear Reviewer Crof,
Thank you for your appreciation of our work and your efforts in reviewing this manuscript. We will try to include the insightful discussion you suggested in the revised version.
Best regards, The Authors of Paper 11025
The paper introduces an approach called CemiFace for generating synthetic face images to enhance face recognition (FR) models. The paper provides the first in-depth analysis of how FR model performance is influenced by samples with varying levels of similarity to the identity center, focusing particularly on center-based semi-hard samples. The authors propose a unique diffusion-based model that can generate face images with different levels of similarity to the identity center. This model can produce infinite center-based semi-hard face images for synthetic face recognition (SFR). The method can be extended to leverage large amounts of unlabeled data for training, providing an advantage over previous methods. Experimental results demonstrate that CemiFace significantly outperforms existing SFR methods, reducing the GAP-to-Real error by half and showcasing promising performance in synthetic face recognition.
优点
-
Focusing on center-based semi-hard samples to enhance face recognition performance is a fresh problem formulation that addresses a notable gap in current methodologies.
-
The paper provides a solid experimental validation of its proposed approach. The authors investigate factors affecting performance degradation in synthetic face recognition and offer a hypothesis about the importance of mid-level similarity samples.
缺点
-
The method for determining GtR remains unclear. Justification regarding how the proposed model yields a low GtR is absent. Is this low GtR attributed to the utilization of real inquiry images? If so, what measures guarantee that the synthetic facial images remain uncorrelated with the real facial images? In other words, I have a reservation that the method may not generate "true" synthetic data but highly relies on an inquiry image. Therefore, it is reasonable to see why a low GtR is obtained.
-
Figure 5 demonstrates that different identities (such as different genders) can be obtained with different m, even with the same input query. There seems to be no way to control the "number of identities" generated from this model. If so, how was the supervised loss applied to train a face recognition model?
-
How can one ensure high inter-class and large intra-class variations as required for SFR?
-
B.3.3. The assertion that high-quality data is not indispensable for achieving markedly accurate facial recognition performance is somewhat counterintuitive and perplexing.
-
The method's reproducibility raises concerns, particularly with respect to the training of the model, which lacks clarity. Specifically, the functions F_1 and F_2 in Equations (6) and (7), as well as the role of C_att, are not explicitly defined, and these elements are absent from Figure 3.
The proposed model generally lacks controllable factors to generate true synthetic face images that favor high inter-class and intra-class variations.
问题
See above
局限性
No issues are found in this aspect.
W1: Low GtR is not attributed to real inquiry data. For instance, even with the same synthetic data DDPM, our CemiFace would surpass the previous state-of-the-art method DCFace inquired by DDPM data (clearly illustrated in the upper part of Tab. 4), and DDPM provides close performance to real-data. In fact, low GtR is attributed to the way of constructing the facial dataset, i.e. containing center-based semi-hard samples.
To calculate GtR with fair comparison with other SOTA methods, we train the DCFace released data to generate close AVG performance to their paper by using CosFace loss. This is because the training SFR code of previous methods is not available which prevents us from reproducing the results. Notably, our reproduced DCFace is higher than the results reported in [23]
W2: Only images belonging to one identity can be obtained from each input query regardless of the input m value. Despite that face images generated from the same query exhibited large differences, our approach defines them with the same identity label for the later SFR model training. Consequently, the number of identities is fully decided by the number of inquiry face images, which is mentioned in Fig. 1 stating that: "With our proposed CemiFace, each inquiry image finally forms a novel subject.". We will further explain this in the revision.
W3: (1) high inter-class variations: Each inquiry face image is selected to be highly independent on other inquiry images. Specifically, we follow DCFace to use a pre-trained FR model to keep samples with a threshold of lower 0.3. We also elaborated this in lines 287-289.
(2) high intra-class variations: high intra-class variations are ensured by (a) changing the similarity condition , as a small input similarity results in the generated semi-hard images belonging to the same identity having long distances to the identity center; and (b) the face images of the same identity generated by CemiFace are distributed in all directions from the identity center, which can be observed from supplementary material TSNE Fig. 7. This is guaranteed by randomly sampled Gaussian noises input to the diffusion model, which exhibit a large variation. As a result, both properties would ensure the generated face images of the same identity are almost evenly distributed in a sphere that has a relatively large radius, and thus they would have high intra-class variations.
W4: Despite that our approach achieved slightly worse FID performance than the previous state-of-the-art DC face, face images generated by our approach still result in a better face recognition performance, suggesting that the semi-hard face images generated by our CemiFace compensate for their slightly worse FID compared to face images generated by DCFace with a mix of easier and hard samples. However, CemiFace still significantly outperforms DigiFace in this FiD. On the other hand, high-quality data is essentially needed, as discussed in Section 4.2.2 (lines 263-282, Table 4) and supplementary material Sections B.1 & B.2
We rephrase the last sentence in B.3.3 as: 'Our method doesn’t intend to generate images similar to the distribution of CASIA-WebFace, but to construct a discriminative dataset that is conducive to providing highly accurate FR performance'
W5: and are two stacked linear layers. Here, projects the input similarity m to a latent feature , and then projects the feature concatenating and identity embedding to a condition vector including both the input similarity and identity information. We have provided more details about Figure 3 and the training pipeline in general response G1 and the PDF.
W6: we kindly disagree. High inter-class variation is ensured by filtering the inquiry data with a cosine similarity lower than 0.3 provided by pretrained FR model, following DCFace (mentioned in lines 288-289). As for high intra-class variations, our proposed diffusion model can generate high-variation samples belonging to each identity. Please refer to the answer of W3.
Thank you for your response. The authors have addressed my questions satisfactorily.
Dear Reviewer 5xKY,
Thank you for your positive feedback in reviewing our paper. We will try to change the manuscript according to your suggestion in the updated version.
Best Regards, The Authors of Paper 11025
The paper proposes a novel approach named C to address privacy concerns in face recognition technology. The authors propose CemiFace, a diffusion-based method that generates synthetic face images with controlled similarity to a subject's identity center, enhancing the discriminative quality of the samples. This approach allows for the creation of diverse and effective datasets for training face recognition models without the need for large-scale real face images, thus mitigating privacy risks. CemiFace outperforms existing synthetic face recognition methods, significantly reducing the performance gap compared to models trained on real datasets. The paper also discusses the potential limitations and privacy implications of the approach, highlighting the need for ethical considerations in synthetic face generation for face recognition applications.
优点
- The use of a diffusion-based model for generating semi-hard samples is an innovative approach that has not been extensively explored in the field of face recognition.
- The approach can be extended to use unlabeled data for training, which is an advantage over previous methods that often require some form of supervision.
缺点
- The paper is not well organized. This paper should be reorganized to make it easier for the reader to understand the contributions and technical details of this paper.
- Eq. 10 seems to be inconsistant to its description. According to the description, it is highly related to the time step.
- Fig. 3 is hard to understand. The training losses are not illustrated in the figure.
- Despite aiming to reduce privacy issues, CemiFace still uses a pre-trained model that could have been derived from datasets without user consent, raising ethical and privacy concerns.
问题
Refer to weakness
局限性
Refer to weakness
W1: In the last part of the Introduction section, we have clearly and specifically listed four contributions of our work, including: 1. a new and crucial finding; 2. a technical contribution (i.e., CemiFace face image generator) inspired by the finding; 3. an application contribution of our proposed technical approach; and 4. the effectiveness of our approach. Based on your suggestion, we will additionally add a paragraph at the beginning of the Method section to guide readers as: In Section 3.1, we first investigate the relationship between sample similarity and their effectiveness in training FR models, presenting the finding that samples with certain similarities (i.e., center-based semi-hard samples) to their identity centers are more effective for training FR models on a real dataset and subsequently devise a toy experiment to validate it. Inspired by our findings, we propose a novel CemiFace, a conditional diffusion model that produces images with various levels of similarity to an inquiry image in Section 3.2. Specifically, Section 3.2.1 introduces how we construct the similarity condition which is fed to diffusion model to guide the generation, and discusses the to require the generated sample to exhibit a certain similarity degree to the inquiry image. In Section 3.2.2, we then present how to use our diffusion model to generate a synthetic face dataset given a fixed similarity condition and a set of inquiry images
W2: As clearly presented in line 174 (above the Eq. 10), Eq. 10 is not related to the time step, instead it represents the loss for reconstructing the identity of the inquiry image when , which is the first part of our Time-step Dependent loss. As presented in line 177, Eq. 12 (rather than Eq. 10) defines our inspired by the Time-step Dependent loss (DCFace[23]), which employs to make the sample have different similarity property at different time step . We will re-phases lines 173-174 as: we employ the Time-step Dependent loss [23] with different time step t at Eq. 12, specifically firstly an identity loss for recovering the identity of the original inquiry image x, which will be applied to produce original facial embedding when the time step .
W3: Please refer to the general response G1, where we have rephrased the pipeline and Fig. 3.
W4: Ethical and privacy issues are the top priority of our study. While a crucial goal for developing SFR is to eliminate the privacy issue that lies in the face recognition dataset, current SFR approaches are not capable of refraining from privacy risk while providing effective recognition performance. Compared with previous SOTA approaches DCFace[23] and IDiffFace[27] approaches, our method already reduces the privacy risk by avoiding using labelled face recognition images for the model training. These are clearly presented in supplementary material Section C. More importantly, we emphasize that the model development protocol in this paper strictly follows the same protocol of previous peer-reviewed related studies (e.g., the DCFace and IDiffFace) published in top journals and conferences. Thus, we believe our study would not trigger ethical and privacy-related issues.
Thank you for answering my questions. I have no further questions.
Dear Reviewer L3p9,
Thank you for your positive feedback in reviewing our paper. We will re-organize Fig.3 and the overall structure to make it easier for the reader according to your suggestions.
Best Regards, The Authors of Paper 11025
The paper titled "CemiFace: Center-based Semi-hard Synthetic Face Generation for Face Recognition" addresses a critical issue in face recognition (FR) related to privacy and performance degradation when using synthetic face images. The authors propose a diffusion-based approach, CemiFace, which generates facial samples with varying levels of similarity to an identity center. This method aims to enhance the discriminative quality of synthetic samples, thereby improving the performance of FR models trained on these datasets.
优点
Introducing Similarity controlling factor in synthetic face generation using a diffusion based approach.
缺点
(a) Due to the introduction of this similarity control conditioning in the diffusion process there must be a change in total sampling time ( certainly it will also depend on the number of time steps considered in the diffusion process also) – An illustration/analysis on computational complexity of the proposed algorithm is needed.
(b) Seems like the overall process is dependent on how(using which method) the value of m was determined during the diffusion process!
(c) A complete pseudo-code on the proposed method would have helped the reader to understand the whole process.
(d) Figure 3 could have been much more elaborated and in more details.
问题
While comparing the proposed work with other SOTA methods - how did you generate the results of the SOTA method?
局限性
Limitations were mentioned only in the last few sentences of the conclusion section but not otherwise stated separately.
W1: We first define some basic calculation complexities:
Time Step : This represents the total number of time steps required for a complete diffusion process.
UNet Complexity : The UNet model accepts the input image and outputs the estimated noise.
Pretrained Face Recognition Model and Similarity Condition : These are processed by two sequential linear layers.
The backpropagation and weight updating time complexity for the UNet is given as . Since the pretrained face recognition model is fixed, there is no computational complexity associated with it during backpropagation.
Forward Diffusion Process:
The forward diffusion process consists solely of adding noise to the clean image over time steps, resulting in a calculation complexity of .
Denoising Process:
The denoising process also consists of time steps to gradually denoise a noisy image. This process involves a UNet and conditions to output the estimated noise, resulting in a calculation complexity of .
Training Step:
A training step includes a forward diffusion step, a denoising step, and model training (including loss calculation and backpropagation). Note that the model training only involves one time step. The overall complexity of training a CemiFace model by one step is given by:
where is the number of samples, note the loss calculation involves the similarity comparison between the estimated image and the original image , thus the pretrained FR is adopted 3 times in each training iteration. For a standard diffusion model, the complexity is:
Accumulated Training Iterations:
When accumulating all the training iterations , the calculation complexity becomes:
Generating One Sample:
To generate one sample, the process involves denoising random noise from time step to 0. Since the forward diffusion process and loss calculation are no longer needed, the complexity is:
Generating a Dataset:
For generating a dataset with identities and samples for each identity, the complete calculation complexity is:
On a single A100 GPU, (1) generating a sample with 50 time steps takes 0.9 seconds; (2) One CemiFace training step (one inquriy image) takes 0.6 seconds; (3) Training for one epoch takes 45 minutes; (4) It takes 10 epochs to converge with a total time of 7.5 hours. (5) It takes 16 hours to generate a dataset of 10,000 identities, with each having 50 images. We have mentioned the computational cost in supplementary material Sec. A.1, lines 473-480.
W2: The overall process not only depends on m but also relies on the input inquiry image. The conditions sent to diffusion models are determined by and the inquiry data using a linear projection . The selection of inquiry images is dependent on the rule that the inquiry images should be unblurred, non-occluded, appropriately posed and independent of each other (discussed in Sec. 4.2.2, supplementary material B.1 and B.2). It is also required by SFR to ensure high inter-class variation. Then the generation directly determines the final similarity property of the samples inside each identity group, we find the optimal is scalar number 0 and mixing the generation would bring worse performance(lines 232-243).
W3: The pseudo-code is given in the PDF, Algorithm 1 and Algorithm 2
W4: The revised figure is appended in the PDF Fig. 1, while its description is provided in the general response G1, which will be added to the main text.
Q1: In Tab. 6, the results of all competitors except DCFace are obtained from their original publications. Here, we additionally train a SFR model based on the released synthetic face images/dataset generated by the SOTA DCFace model and the CosFace loss, to facilitate a fair comparison with ours. This is because DCFace hasn't released AdaFace-based SFR training code and details, and thus we were not able to reproduce it for our SFR model training. For this SFR training, as described in lines 215-229, we use IR-SE-50 as backbone and CosFace loss for learning facial embedding, where the hyperparameters are set the same as the standard CosFace.
L1: Although most papers only discussed their limitations in their conclusion section due to page limitation, our submission has detailed limitations not only in the conclusion but also: (1) Sec. 4.2.2 (lines 276-295), Sec. B.1 and B.2 of the supplementary material of the main submission (i.e., high-quality inquiry data is essential); (3) Sec. C of the supplementary material (i.e., privacy issues); and (4) Sec. D.1 of the supplementary material (i.e., the training of our CemiFace depends on the pre-trained FR model, and thus its performance also relies on the performance of this model).
Dear Authors,
Thanks for your detailed explanation on all my queries. I don't have any further questions.
With Regards, Reviewer fjPL
Dear Reviewer fjPL,
Thank you very much for your kind reply. We have noticed that the rating has not been changed. Please kindly let us know if there is any additional concern we can clarify. We sincerely appreciate your valuable suggestions and will try our best to meet your criterion.
Best regards, The authors of Paper 11025
Dear Authors,
I have now changed my scoring from 4 to 5.
With Regards, fjPL
The paper proposes a new Face Recognition diffusion-based generation method. The diffusion process is completed with a semi-hard constraint on the synthetic reconstructed image: for each inquiry image of the (real) training set, the reconstructed image after the forward-backward diffusion process must have a specific cosine similarity with the inquiry image. As it is usual for such methods in Face Recognition, the resulting synthetic dataset is then used for training a Face Recognition model. This model is evaluated across diverse real datasets.
优点
The tackled problem is quite hard and needed at the same time. Current SOTA Face Recognition generation methods lead to a significant gap in terms of performance, compared to real Face Recognition datasets (of the same size). The idea of controlling the similarity to design semi-hard samples is also interesting.
缺点
-
In Fig. 1, is the displayed similarity really the cosine similarity ? In the generated samples, the line with perfect similarity (equal to 1) seems to provide synthetic images which would not have a perfect similarity with the inquiry images displayed above the hypersphere.
-
[minor] In Eq. 3, the probability distribution of epsilon is not specified.
-
The authors should cite explicitly the works that use the training loss (Eq. 2) in this precise form, as there are alternative loss functions for diffusion models. A discussion on the reasons of this particular choice of diffusion loss might be a plus (e.g. in the appendix).
-
[minor] Although the lines 114-116 are accurate, they are misleading the reader. The widely known representation of Face Recognition embeddings is that they lie onto a hypersphere of dimension N, where each embedding is a point of the hypersphere. Those embeddings are clustered by identity on this sphere and the identity centers are roughly at the center of those clusters. The hypersphere mentionned in this paper is a hypersphere of dimension N-1, where the identity center is at the center of the sphere.
-
In lines 120-125, the authors should detail the range of similarities to the identity center, for each of the 5 splits of the CASIA training set. Only the average similarity of each split is specified.
-
[minor] Figure 3 should be a bit more explained than just its caption.
-
[major] Lines 155-162 are not well written and it is hard to understand how the margin m is used to guide the diffusion process. In particular, F_1 and F_2 are not defined, while some unused F is mentionned. C_sim seems to be a vector of unknown size. Also, the temporal guidance is too briefly described.
-
[minor] Some hyperparameters' values (alpha_t/beta_t, lambda) are not specified.
-
The right part of Fig. 4 displays two curves that do not have the same meaning for the x-axis. For AVG, the similarity is a constrained similarity (m) for training CemiFace (i.e. a similarity between a real inquiry image and a synthetic image). For CASIA, it is the similarity between one real image and its identity center (not a real image). To sum up, for AVG it is a similarity between 2 images, while for CASIA it is between 1 image and its identity center. Thus, comparing the two curves does not seem meaningful.
-
[major] The CosFace loss is used to train on synthetic datasets, while AdaFace is used to produce (identity-oriented) embeddings for the CemiFace training set generation. There should be only one model for both tasks, for fair comparisons. On Table 6, training on CASIA with AdaFace gives better results than with CosFace, so one could attribute the good performance of CemiFace to the fact that the authors used a stronger model (AdaFace) to generate the synthetic dataset than the model used to train on this dataset (CosFace). In addition, there should be a part studying the impact of this AdaFace choice (i.e. another loss), at least in the appendix.
-
The ROC curve on IJB-B/IJB-C for all synthetic methods of Table 6 would be a plus, as the accuracy is easily saturated, and not really used in industrial use-cases. Previous papers (related works) provide such ROC plots.
问题
-
Could you explain the last sentence of Section 4.2.1 (lines 260-261) ?
-
In Section 4.2.2, why is the range of training m equal to [0,1] while the previous subsection concludes with an optimal range [-1,1] ?
局限性
-
[major] The SimMat loss seems to be an interesting idea to lead towards m-similarity to the inquiry image, during training. But the derivation of the MSE loss (Eq. 3) assumes that the reconstructed image should be the inquiry image, and not a new image having a m-similarity with the inquiry image. I may be wrong here but I think that the diffusion loss of Eq. 3 is mathematically valid if the forward diffusion process is symetric to the reverse diffusion process, which is not the case here.
-
[major] In Section 4.2.1, there should be a part studying the difference between the required m and the estimated m post-training. That means that, for any required m, it is easy to compute the estimated m, i.e. the similarity between the resulting reconstructed image and the inquiry image. This estimated m must be quite different than the required m because Figure 4 shows that the best m is m=0, meaning that there is a 90 degrees angle between the inquiry image and the synthetized image. If m was truly equal to 0, the performance of the model would be very poor. So, there must have a difference between the required m and the real estimated m (post-training). This is also reflected in the conclusion saying that the best setting for m is to train CemiFace with m randomly sampled from [-1,1]. A similarity truly equal to -1 would bring a model with an astonishingly poor performance.
** Update ** I have increased my score from 4 to 5.
W1: The displayed similarities are the input cosine similarities, based on which the displayed face images were generated. However, the actual similarities between the generated images and their inquiry images may not be exactly the same as the input cosine similarities as DL models typically cannot generate perfect/exact outputs. The actual cosine similarities between generated and inquiry images are provided in the PDF Tab. 2, measured by the IR-50 network pretrained using AdaFace loss (line 154).
W2: The is a random Gaussian noise image fed to our diffusion model (Line 145). We will explain this in the revision.
W3: We assume the mentioned training loss is Eq. 3 rather than Eq. 2. We will modify and cite as: we follow the previous SOTA SFR studies (DCFace[23] and IDiffFace[27]) to choose the same generic diffusion loss [20,21,22], ensuring the reproducibility of our approach and its fair comparison with DCFace and IDiffFace.
W4: The theories of yours and ours are the same but described in different forms. We treat all face images of each subject as an N-1 dimensional sphere with its center representing the subject-level identity center. Then, the spheres of all subjects can be combined in an N-dimensional sphere, where each subject-level sphere is a cluster. We will follow your suggestion to rephrase these sentences
W5: The ranges for each group are listed in the Tab. 1 of the PDF. Here, the splitting boundary for each group varies across different identities.
W6: Please refer to the general response G1.
W7: should be for representing the linear projection operation, and thus and are two stacked linear layers. Here, projects the input similarity m to a latent feature , and then projects the feature concatenating and identity embedding to a condition vector . Subsequently, the is fed to our diffusion to control its face image generation via a cross-attention (CA) as:
$
CA(Q,K,V,K_{c},V_{c})=SoftMax(\frac{QW_{q}([K,K_{c}]W_{k})^{T}}{\sqrt{d}})W_{v}[V,V_{c}]
$
where is treated as the key and value (same as DCFace) to influence the latent representation extracted from the input noisy image (treated as the query, key and value ) for generating the final face image. Please also refer to the general response G1 for details.
W8: starts from 0.0001 to 0.02 controlled by time step -based linear schedule, while (line 101)
W9: Sorry for the confusion. We put two curves in the same figure due to the limited space. We will separate the two curves into two smaller figures.
W10: As DCFace hasn't released its AdaFace-based SFR training code and details, we were not able to reproduce it for our model training. Thus, lines 216-220 and lines 300-311 fairly compare ours with DCFace by adopting the same pre-trained AdaFace model to train our diffusion generator, and then employing the same CosFace loss for both ours and DCFace's SFR models training. Results show that our CemiFace still outperformed the SOTA DCFace. Based on your suggestion, Tab. 4 in the PDF additionally provides results achieved by using unified CosFace. Specifically, we apply a model pre-trained by CosFace to train both our generator and DCFace generator, and employ the same CosFace loss for their SFR models' training. Due to limited rebuttal time, we only include the results for the 0.5M setting, but will present results for all data volumes in the revision.
W11: We provide ROC curves for CASIA, DCFace and CemiFace in 3 data volumes in the PDF Fig. 2. The FR model trained on our CemiFace generated dataset with the best curve (largest area), while the real face dataset provided inferior results than these SFR methods (CemiFace and DCFace). When FAR=1e-3, the TAR result achieved by our self-implementation on real-dataset is around 90, which exceeds previous works (ArcFace gives around 60 in [a]]). Importantly, the ROC of the IJB-B and IJB-C are typically not provided in neither the related SFR studies [23,24,25,27] nor many of the discriminative FR methods (e.g., CosFace[1], Face Transformer[b], boundaryface[c]) when using CASIA-WebFace for training.
[a]Federated Learning for Face Recognition with Gradient Correction
[b]Face Transformer for Recognition
[c]BoundaryFace: A mining framework with noise label self-correction for Face Recognition
Q1: We rephrase it as: We did not provide the AVG result achieved for setting the similarity interval (Table 2) close to zero (continuous similarities), as this training setting leads our model to generate same images when inputting different values. We assume that an extremely small similarity interval prevents the model from effectively learning the differences between each level of similarity
Q2: It is a typo and should be [-1,1]
L1: Eq.3 does not directly compare the input and reconstructed images. Instead, it compares the input noise and the estimated noise (Line 98). The overall loss Eq. 13 adopts to balance between the diffusion focuses on comparing input and estimated noises for generating clean images (Eq.3 ) and the similarity between the generated samples and inquiry image (Eq. 12). Given the relationship between the estimated and real as follows, it is symmetric and valid if the is accurately estimated.
$
\hat{x} _{t-1} \approx \frac{x _t - \sqrt{1 - \alpha _t} \epsilon'}{\sqrt{\alpha _t}} = x _{t-1}
$
L2: Please refer to W1, where produced images of an actual average similarity of 0.2854, making the generated dataset semi-hard. We will update this table in the revised version.
Dear Reviewer tciQ,
We are deeply grateful for the time and effort you spent reviewing our paper. We have carefully tried to address each of your questions based on your valuable feedback.
Could you please take a brief moment (approximately 2-3 minutes) to review our responses? We appreciate your feedback, regardless of whether our answers have addressed your primary concerns. We are willing to provide additional details if you need further explanation.
Best regards, The authors of Paper 11025
Thank you for your detailed answer.
I will increase my score to 5, as you correctly answered half of my major concerns. However, I am still not convinced by your answer to Weakness 10 and Limitation 1, so my score will not be higher.
Dear Reviewer tciQ,
We are deeply appreciated for your inspiring suggestion. We will try our best to improve the manuscripts in the updated version according to your opinion.
Best Regards, The Authors of Paper 11025
We thank all reviewers for their valuable feedback. Reviewers acknowledged that our: (i) method is new/innovative (tciQ, L3p9, 5xKY, Crof), interesting (tciQ), effective for SFR (fJPL, L3p9, 5xKY, Crof), and addresses privacy concerns (fJPL, L3p9, Crof); (ii) discovery is important (Crof); and (iii) experiments are solid (5xKY, Crof) and comprehensive (tciQ)
General response to all reviewers: we denote the Author Rebuttal pdf file as the PDF
G1: Improvement and explanation of the Fig. 3 (@tciQ& @fJPL& @L3p9& @5xKY):
We have updated the Fig. 3 in the PDF with more detailed elaboration. Besides, we will also modify and add the following contents to the beginning of the Sec. 3 as:
Methodology overview: As illustrated in the left side of Fig. 3, the training process starts with adding a noise to the clean input image using Eq. 4 (result in ). Meanwhile, similarity conditions are fed to the linear layer , whose output is then concatenated with the inquiry identity condition to generate a joint representation. This joint representation including both identity and similarity conditions is then processed by the linear projection layer to output the combined condition representation (the lower right part of Fig. 3).
The is further processed by a cross-attention operation with the intermediate latent representation of diffusion UNet learned from the input noisy image as:
$
CA(Q,K,V,K_{c},V_{c})=SoftMax(\frac{QW_{q}([K,K_{c}]W_{k})^{T}}{\sqrt{d}})W_{v}[V,V_{c}]
$
where is treated as the key and value (same as DCFace) to influence the generated face images. are the query, key and value, representing the latent feature of UNet . Consequently, the diffusion Unet outputs the estimated noise for denoising the image as a clean estimated image (Eq. 9). Based on the obtained estimated image , original and condition , the whole model is optimized by the loss defined in Eq. 13 in an end-to-end manner.
At the inference stage (the upper right part of Fig. 3), random noise and the time step are first fed to a CD block, jointly with the similarity and inquiry image that are undergoing linear layers and and concatenation to produce the condition representation . This results in an estimated noise . Then, a denoise step is adopted to generate from using Eq. 12 in DDIM[21] for efficient interface speed. This process is repeatedly conducted on the obtained denoised latent images () until , where is treated as the final generated face image.
Here, we assign the same identity label as to all face images generated from the inquiry image . To ensure high inter-class variation, our inquiry images are filtered by a pretrained FR ( IR-101 trained on the WebFace4M[11] dataset by AdaFace.), which enforces the similarity between each pair of query images is lower than 0.3.
Five experts in the field reviewed this paper. Their recommendations are 4 x Borderline Accepts and a Very Strong Accept. Overall, the reviewers appreciated the paper because it proposes an innovative diffusion-based generation method for face recognition that addresses privacy concerns and shows effectiveness in comprehensive experimental evaluations. The idea of controlling the similarity in synthetic face generation to generate semi-hard samples for training is interesting. Based on the reviewers’ feedback and the authors’ satisfactory rebuttal that addresses most concerns, I recommend it for acceptance. The reviewers raised some important issues and concerns in the Weaknesses that should be addressed in the final camera-ready version of the paper. In particular, the final version should significantly improve the overall structure and clarity of the paper (e.g., Figure 3). It should also provide an analysis of the computational complexity of the proposed method, and a more insightful analysis of why similarity control is beneficial. The authors are encouraged to make the necessary changes to the best of their ability. We congratulate the authors on the acceptance of their paper!