ReF-LDM: A Latent Diffusion Model for Reference-based Face Image Restoration
摘要
评审与讨论
- This paper addresses face image restoration using multiple reference images, employing the CacheKV technique to utilize these images effectively. To overcome the issue of loss of appearance information in latent diffusion models, a timestep-scaled identity loss is proposed.
- The design choices for CacheKV are explained with experimental results in Section 5.2.1, and extensively discussed in Section 3.2.2.
- Additionally, the superiority of the proposed methodology is demonstrated through various experiments using datasets with different levels of degradation, such as FFHQ-Ref-Severe and FFHQ-Ref-Moderate.
优点
- The proposed framework effectively utilizes multiple reference images without information loss and provides a suitable framework for leveraging this information within latent diffusion models.
- Sufficient experiments validate each proposed method, and both quantitative and qualitative results support the claims.
缺点
- The use of a pre-trained diffusion model redundantly during training and inference results in slower speeds and higher computational costs, posing an obstacle for applications such as video face restoration.
问题
- How is the time embedding selected when extracting CacheKV from reference images?
- I wonder the results of blind face image restoration of Ref-LDM which get the zero reference image.
局限性
Please refer to the weakness section.
Q4-1. The redundant use of the diffusion model will cause slower speed and more computation, limiting applications like video face restoration.
During inference, the CacheKV mechanism pre-computes and saves the keys and values (KVs) for each reference image with only one pass of diffusion network; then, the saved reference KVs are reused in each step of the multi-step diffusion denoising process. In a similar way, for video face restoration, we can pre-compute and save the CacheKV and reuse the same CacheKV for all frames of a video without redundant computation.
Q4-2. How is the time embedding selected when extracting CacheKV from reference images.
We choose the time embedding of timestep 0 (the same one used by the main diffusion U-net) when extracting CacheKV. Our motivation is that the reference images are clean, so we use the last and cleanest diffusion timestep . We have also experimented with a learnable time embedding, while the results are similar to using that of timestep 0. So we stick to the original choice. The motivation of using timestep 0 is mentioned in paper L93-94, we will explain it more clearly in the revision.
Q4-3. The results with no reference image.
We show the outputs of ReF-LDM without reference image in PDF figure 5, which suggests that ReF-LDM can restore an LQ image to a reasonable HQ image without reference images. The reason may be twofold:
- our denoising U-net has been pretrained with only LQ condition (paper L218-220)
- we drop the reference condition during training with probability 0.1 (paper L419).
Thank you for the more detailed information. I will keep my score.
To enhance the performance of blind face image restoration, this paper proposes a reference-based method named CacheKV, which adapts LDM to restore lowquality (LQ) conditioned on several high-quality (HQ) images. As there is no dataset available for reference-based image restoration, therefore, the authors construct a reference-based dataset FFHQ-Ref specifically tailored for this task.
优点
-The authors collected a new dataset, which could facilitate related research topics. -The paper proposes CacheKV that effectively leverages the information provided by the reference images, significantly improving the performance of blind face image restoration.
- The proposed method performs better than other referencebased methods.
缺点
-The paper does not explicitly explain how CacheKV works. -How does it use varying numbers of reference images when calculating qkv? -How does CacheKV provide a more aligned feature space compared to cross attention? -There are many GAN based blind face image restoration papers (reference based & non-reference based), however, the authors did not compare with them.
问题
What will happen if we use face image from another identity as refrence image?
局限性
see weakness.
Q3-1. Explain how CacheKV works.
We include an illustration in the PDF figure 3 (which is a simplified version of the paper figure 2) and explain as follows.
During inference,
- We first extract the CacheKV from reference images. Specifically, for a reference image, we encode the image to latent with VAE-encoder. Then, we pass the latent to the same U-net used by the main diffusion process, where we save the reference-image keys and values of each self-attention layer.
- In the main diffusion process, the saved reference-image keys and values (i.e. CacheKV) can be accessed for each denoising timestep . Specifically, in each self-attention layer of the main U-net, we concatenate the reference-image keys (of the corresponding attention layer) with the main keys along the token dimension, and the same goes for the values. By doing so, the main U-net can utilize keys and values of the reference images while denoising.
We appreciate your question and will refine the description (paper L90-105) in the revision.
Q3-2. How does CacheKV use varying numbers of reference images when calculating QKV?
We separately pass each reference image through the VAE-encoder and the U-net. For each attention layer in the U-net, we append and save the keys/values from all the reference images along the token dimension, i.e., one reference () → multiple references (). Similarly, in the main diffusion process, we concatenate the reference keys/values with the main U-net keys/values along the token dimension. We appreciate your question and will explain this in a clearer way in the revision.
Q3-3. Why does CacheKV provide a more aligned feature space than ablation setting CrossAttention?
We include an illustration of CacheKV and CrossAttention in PDF figure 3.
- CacheKV uses the same U-net as the main diffusion process. So in the main diffusion process, each attention layer can access the reference-image keys and values of the corresponding attention layer (which is an aligned feature space).
- CrossAttention only encodes reference images with VAE-encoder and passes the same reference-image latent {} to all cross-attention layers. The VAE’s latent space is less aligned with the U-net of the main diffusion process.
We appreciate your question. We will include the illustrations and refine the explanation in the revision.
Q3-4. No comparison with GAN-based papers.
We have compared with some recent GAN-based BFR methods in Table 6, where DMDNet, DAEFR, and VQFR use discriminator and GAN loss; VQFR and CodeFormer use pretrained VQGAN. We didn't run some earlier GAN-based models on our benchmarks (ex. 2021 GFP-GAN, 2020 ASFNet), as the papers we have compared with (ex. 2024 DAEFR, 2022 CodeFormer, 2022 DMDNet) outperformed these earlier methods. Please let us know if the reviewer suggests that we compare with specific papers, we will try to provide additional comparison during the reviewer-author discussion period.
Q3-5. What will happen if we use reference images from another identity?
We included example results of using input reference images from two different people in PDF figure 4. ReF-LDM outputs reasonable high-quality images based on the LQ image with facial features affected by the reference identity (ex. wrinkles). We appreciate the reviewer’s interesting question, and we will include related results in the revision.
Dear reviewer ZeBa,
We appreciate your valuable review. We hope our responses address your concerns. Please let us know if you still have any further questions. Thank you.
Best,
Authors
-
This paper proposed ReF-LDM for reference-based face image restoration with CacheKV to utilize the input reference images.
-
This paper introduced a timestep-scaled identity loss, which considers the characteristics of diffusion models and helps ReF-LDM better learn the discriminating features of human identities.
-
This paper constructed FFHQ-Ref, a dataset comprising 20,406 high-quality face images and their corresponding reference images, to facilitate the advance of reference-based face image restoration.
优点
-
This paper proposed ReF-LDM, which incorporates the CacheKV mechanism and the timestep scaled identity loss, to effectively utilize multiple reference images for face restoration.
-
This paper constructed the FFHQ-Ref dataset, which surpasses the existing dataset in both quantity and quality, to facilitate the research in reference-based face restoration.
-
Evaluation results demonstrate that ReF-LDM achieves superior performance in face identity similarity over state-of-the-art methods.
缺点
- The idea of replacing the key/value in self-attention layers with that of reference images has been widely used in style transfer work. This work follows the same scheme - extracting style features (key/value) from multiple reference images and inserting them during denoising. However, from the results, it seems that the style of reference images affects less on the generation. I wonder the reason behind it.
- At the first stage of training, i.e., train an LDM with only LD condition, which loss function is adapted here?
- Since reference images aim to provide ID information during restoration, why not just inject face embeddings of multiple HQ faces as a condition? What are the advantages of the proposed method over the above strategy?
- For the ablation study of the timestep-dependent identity loss, have the authors tried to apply such a loss after T steps?
- The proposed dataset does not contain new data, which is basically a re-labeling and cleaning work on FFHQ. Overall, the novelty is not significant.
问题
Please refer to the weakness section and try to answer these five questions during rebuttal.
局限性
Limitations are discussed in this paper.
When the face region is occluded by other objects, the proposed model may generate artifacts. For certain face poses (e.g, side face), the reconstructed eyes may appear unnatural.
Q2-1. Different impacts of reference images in comparison to style transfer works.
The reference images affect model outputs differently because face image restoration and style transfer have different objectives. In our task, the model output is an HQ image, and the model inputs are reference images and an LQ image (which is synthesized by applying degradation to the HQ image). Therefore, our model is encouraged to follow the image layout and style of the input LQ image and to integrate the distinctive facial features (e.g., eye shape) of the input references by the LDM loss and timestep-scaled identity loss.
Q2-2. Loss function used to pretrain LDM with only LQ input.
We use only the typical LDM loss (without the proposed timestep-scaled identity loss) to pretrain LDM. We will state this clearly in the revision.
Q2-3. Why not use face ID embeddings of reference images as network condition?
We conducted an additional ablation experiment which uses pretrained ArcFace to encode input reference images. The example images (in PDF figure 2) and the lower face similarity metric (IDS↑ CacheKV 0.65 vs. ArcFace 0.24) suggest that it is more challenging for ReF-LDM to integrate the reference identity information with ArcFace than with CacheKV. This difficulty may arise because ArcFace embedding space is very different from that of the main diffusion U-net. In contrast, CacheKV uses the same diffusion U-net to provide reference-image keys and values with the corresponding attention layers, making it easier for the main diffusion U-net to integrate. Another intuitive explanation is that ArcFace is trained to distinguish different identities, thus, it is a good discriminator and provides effective supervision as the identity loss. However, ArcFace is not trained to generate a face image, so it may not contain enough appearance information for the restoration task.
Q2-4. Ablation study of timestep-scaled identity loss, apply loss after T steps.
In a training iteration of LDM-based models, we typically sample a and pass it though a denoising network once then compute the loss. To compute identity loss on a multi-step prediction, we need to pass an input through the denoising network recurrently for at most passes. This will greatly increase the required computational time and GPU memory, so we didn’t experiment with such a setting.
Q2-5. The proposed FFHQ-Ref dataset contains no new images.
While FFHQ-Ref is indeed based on the images collected by FFHQ dataset, we believe its value lies in the boost of image quality and quantity over previous reference-based dataset (compared in paper Table 1). In addition, the careful curation and organization for the task of reference-based restoration (i.e., ID-based data splitting and FPS-sorted reference images) also enable fair and practical evaluation for future works in this field.
Dear reviewer 5Xiy,
We appreciate your valuable review. We hope our responses address your concerns. Please let us know if you still have any further questions. Thank you.
Best,
Authors
I have read the comments from other reviewers and the rebuttal from the authors. The answers to Q1, Q2, and Q3 resolved my concerns. Thanks for providing the figures (Q3) in the pdf. The answers to Q4 and Q5 are not very satisfying. I raised the rating from 4: Borderline reject to 5: Borderline accept.
Thank you for the feedback.
Regarding Q4 (ablation study of timestep-scaled identity loss, apply loss after T steps), we would like to provide a further response:
Although computing identity losses on multi-step predictions might bring further improvement, it requires considerably more GPU memory for storing the extra gradients caused by multiple U-net passes. In contrast, our solution of applying a timestep-dependent loss scale requires no extra computational resources.
We attempted to run the experiment using four-step predictions but encountered out-of-memory issue on our currently available 3090 GPUs. We will try to use GPUs with larger memory or a smaller batch size to compare the multi-step ablation setting with the conventional one-step setting in the revision.
This paper investigated the reference-based face restoration model based on latent diffusion model. It proposed a ReF-LDM method involving a CacheKV module and a timestep-scaled identity loss. A FFHQ-Ref dataset consisting of 20,405 HQ face images with corresponding reference images. Experiments are conducted to evaluate it.
优点
- A reference-based face restoration model based on latent diffusion model.
- A FFHQ-Ref dataset consisting of 20,405 HQ face images with corresponding reference images.
- A CacheKV module and a timestep-scaled identity loss.
缺点
-
While the of reference-base face restoration is interesting, from Table 6, the improvement by ReF-LDM is limited in comparison to no-reference based methods. This issue may be alleviated by further improving the fusion mechanism or introducing more plausible metrics.
-
In the ablation study (Table 3), the effect of timestep-scaled identity loss is given. However, more experiments are required to assess the results by other possible variants of timestep-scaled identity loss.
-
From the visual results, it seems that ReF-LDM may change the illumination of the degraded images (see Fig.s 8&10.
问题
One major concern is that the novelty seems to be limited. For example, the authors are suggested to discuss CacheKV with MasaCtrl [1], and discuss timestep-scaled identity loss with the identity loss used in customized generation [2]. [2] is a concurrent work and can also be ignored.
[1] MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing, Arxiv 2023. [2] PuLID: Pure and Lightning ID Customization via Contrastive Alignment, Arxiv 2024.
For the other questions, please refer to the weakness.
局限性
Yes. The authors have addressed the limitations.
Q1-1. Limited improvement over no-reference methods in Table 6.
ReF-LDM achieves a significant boost in face similarity metric (IDS↑ 0.675 vs. 0.323 for the best no-reference method CodeFormer). As suggested by the reviewer, we evaluated an additional metric FID (with target distribution as FFHQ 70k images) and found that our ReF-LDM has better FID especially for severely degraded LQ images (FID↓ 37.4 vs. 49.1 for the best no-reference method DAEFR). We appreciate the reviewer’s suggestion and will include the FID evaluation in the revision.
| FID-FFHQ↓ | FFHQ-Ref-Severe | FFHQ-Ref-Moderate |
|---|---|---|
| CodeFormer | 51.5 | 38.8 |
| VQFR | 53.0 | 36.8 |
| DAEFR | 49.1 | 33.9 |
| ReF-LDM (ours) | 37.4 | 32.9 |
We agree that we do not have the best LPIPS scores and that additional fusion modules (like CodeFormer’s CFT or VQFR’s dual decoders) may further improve LPIPS, while in this work we would like to focus on a simple network design to integrates an input LQ and multiple reference images into an LDM-based model.
Q1-2. More ablation studies of variants of timestep-scaled identity loss.
We conducted an additional experiment with variant setting (timestep-dependent scale = 1 when otherwise 0), which shares the same motivation as the default setting to downscale the identity loss when a larger and noisier timestep t is sampled. The example images (in PDF figure 1) and the metrics in the table suggest that also works but the default yields better identity similarity and quality. We appreciate the reviewer’s suggestion and will train and include more variants of timestep-dependent scales in the revision.
| IDS↑ | NIQE↓ | |
|---|---|---|
| no ID loss | 0.52 | 4.56 |
| constant | 0.69 | 6.56 |
| default √āt | 0.65 | 4.38 |
| variant t<500 | 0.61 | 4.43 |
Q1-3. Illumination Changes.
Thank you for pointing this out. We checked the input reference images used in the examples where the illumination changed. In the case of Figure 8, the reference images have strong orange lightning and our output has a warmer illumination. Therefore, the cause may be the influence from the reference images. We will include case studies and discuss possible future work (such as illumination-preservation loss or different model architecture design) in the revision.
Q1-4. Our CacheKV vs. related work MasaCtrl.
MasaCtrl proposes a training-free method to exploit StableDiffusion for reference-based text-to-image synthesis. They also utilize the denoising U-net to extract keys and features (KVs) from a reference image (termed as source image in their paper). However, we are different in several aspects:
- CacheKV passes the references through the U-net only once and reuses the saved KVs in each main diffusion denoising step without redundant computation, while MasaCtrl passes the reference through the U-net for every diffusion step.
- We have an ablation experiment “spatial-concatenation” that is closer to MasaCtrl, as it also passes references to U-net for each step (paper L125-135). As shown in paper Table2, our CacheKV (without the redundant U-net passes) is more efficient, especially when we have multiple reference images.
- Our model implicitly learns how to naturally integrate the references’ facial features into LQ’s image layout, while MasaCtrl explicitly uses object masks (computed with text condition’s cross-attention scores, ex. “dog”) to prevent the access of irrelevant reference features and messy results.
We appreciate the suggestion to discuss the related work MasaCtrl and will include it in the revision.
Q1-5. Our timestep-scaled identity loss vs. concurrent work PuLID (Arxiv 2024).
The concurrent work PuLID shares the same motivation with us, which is, to avoid using identity loss on a flawed one-step output from a larger and noisier timestep . PuLID exploits a few-step distilled diffusion model and runs four denoising steps to obtain a better output for identity loss. They adopt off-the-shelf SDXL-lightning for their SDXl-based model. However, for a non-SD-based work (like our ReF-LDM), PuLID’s solution requires an additional few-step diffusion distillation training process. In contrast, we propose a simple and effective solution by applying a timestep-dependent scale to identity loss.
The rebuttal has addressed my major concerns. I have raised my score to Borderline accept.
Response to all reviewers and area chairs for a brief summary
We sincerely thank all reviewers for their careful reading and insightful comments. We are encouraged by the positive feedback and recognition of our work's key strengths noted by the reviewers:
- A new reference-based face restoration model based on latent diffusion model. [E2Jp, 5Xiy, G6n4]
- CacheKV mechanism, effectively integrating multiple reference images. [E2Jp, ZeBa]
- An effective timestep-scaled identity loss. [E2Jp, 5Xiy]
- New FFHQ-Ref dataset, surpassing existing reference-based dataset. [E2Jp, 5Xiy, ZeBa]
- Superior performance in face identity similarity over previous methods. [5Xiy, ZeBa]
- Experiments to validate each proposed component. [G6n4]
We have responded to each reviewer individually to address any comments. We would like to give a brief summary.
- Improvements: We address the limited improvement over no-reference methods by highlighting the superior IDS and presenting an additional FID metric that shows better performance.
- Ablation studies: We conduct additional experiments on a variant of the timestep-scaled identity loss and demonstrate CacheKV's advantage over face embeddings.
- CacheKV explanation: We provide a more explicit explanation of how CacheKV works and its advantages over cross-attention.
- Comparison to related work: We clarify the differences between our designs and those of MasaCtrl and PuLID.
- Dataset value: We defend the value of FFHQ-Ref, despite it not containing new images, by highlighting its improved quality and organization for reference-based face tasks.
- Additional experiments: We present results of using reference images from different identities and in no-reference scenarios.
- Computational efficiency: We explain how CacheKV can be optimized for applications like video face restoration.
Again, we thank all reviewers and area chairs!
Best,
Authors
This paper proposes a latent diffusion model, named ReF-LDM, for reference-based face image restoration. The authors design a CacheKV mechanism to effectively and efficiently integrate information from mulitple reference images. They also proposes a timestep-scaled identity loss to better preserve the identity in the restored image. Finally, they construct a new dataset based on the FFHQ datatset for reference-based face restoration.
The authors have clearly explained the working principal of their CacheKV mechanism and its difference from other similar designs. They have also included further ablation study for their timestep-scaled identity loss. These have resolved most of the concerns raised by the reviewers. The major concern is the overall novelty of this work seems minor.