PaperHub
5.0
/10
Poster4 位审稿人
最低3最高6标准差1.2
6
3
6
5
3.3
置信度
正确性2.5
贡献度2.5
表达2.5
NeurIPS 2024

FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06

摘要

关键词
rendered-to-realphoto-realism enhancementimage-to-image translationdiffusion model

评审与讨论

审稿意见
6

This paper proposes a diffusion-based method for enhancing the realism of fashion images generated through computer graphics pipelines. To this end, the paper introduces a Texture Preserving Attention Control for improving the appearance of the textures in garments in the generated images, and a domain-knowledge injection for improving the realism of the images. The paper presents SynFashion, a synthetic dataset for research on this topic, and provides an open-source implementation. Qualitative and quantitative analyses are provided.

优点

  • The paper tackles an important problem, which is that of enhancing the realism of images generated with computer graphics applications.
  • This paper presents a new dataset for research on this topic. The dataset looks very interesting, comprising a diverse set of textures and garments. I believe this dataset is a salient contribution of this work.
  • The proposed methods are sound, well motivated in the context of the application, and well presented.
  • Supplementary material provides additional insights on the results of these methods.
  • Code is provided which should enhance reproducibility.
  • Evaluation provides qualitative and quantitative results against some baselines, showing improved realism and preservation of the texture in the garments.

缺点

  • Some of the contributions of this work are very tied to a specific application, which is that of generating images of textured garments. While the methods are sound, I believe this somewhat narrow application strongly limits the scope of this work. It is unclear how the proposed contributions, particularly the Texture-Preserving Attention Control could be leveraged for other applications.
  • The analysis of the related work is somewhat lacking. Please see "Questions" for a few suggestions.
  • The paper is sometimes lacking in clarity, being overly convoluted in its writing.
  • The proposed method is very computationally costly. While this is acknowledged by the authors, I think it is a significant limitation of this work. In this regard, the long computational times of this method make it unsuitable for many applications, including interactive virtual try-on. Further, I believe that a more powerful computer graphics engine (eg Unreal Engine 5) could achieve better and faster than the baselines utilized by this paper. In this regard, a state-of-the-art graphics pipeline could make the method in this paper less relevant, as they are slower and less controllable than traditional graphics software.
  • I believe the title "Make Fashion Real" is confusing. I understand what it tries to mean, but it is not clear and makes little sense.

问题

  • Are there any applications for which the proposed modules could be used, that are not specific to garment generation?

Missing references:

  • SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image (BMVC23)
  • Single-image Full-body Human Relighting (EGSR21)
  • Learning to relight portrait images via a virtual light stage and synthetic-to-real adaptation (TOG22)
  • Paint-it: Text-to-texture synthesis via deep convolutional texture map optimization and physically-based rendering (CVPR24)

局限性

The paper is up-front about its limitations. I appreciate this part of the paper.

作者回复

We appreciate the reviewer for acknowledging the importance of the problem, the contribution of our dataset, the sound method and our presentation, the effective evaluation, and additional information in code and supplementary. The suggestions and questions also inspire us to improve our work. Below we address them in detail:

Some of the contributions of this work are very tied to a specific application. It is unclear how the proposed contributions, particularly the TAC could be leveraged for other applications. Are there any applications for which the proposed modules could be used, that are not specific to garment generation?

In general response, we further elaborate the scientific contributions of our work. In regard to other applications, our method can apply to more general rendered-to-real translation problems. We choose fashion images as they are the combination of highly articulated, complicated, structured, and meaningful content, which is also grounded in more direct commercial applications. In 3D games and animation, our method also have some potential if the problem of computational cost is mitigated in the future, as the reviewer mentioned and detailed in our discussion below. Besides, our method could also apply in other synthetic-to-real image generation applications, such as collecting data for model training in autonomous driving or robotics.

The proposed TAC module enables training-free control during the diffusion generation process. This feature allows for its application to a broader range of tasks, including text-to-image and image-to-image generation, such as personalized generation and image editing. As the number of model parameter becomes bigger and bigger, the training-free control can be utilized in an even wider variety of scenarios.

The analysis of the related work is somewhat lacking. Please see "Questions" for a few suggestions.

We thank the reviewer for suggesting these related works and will add discussion in our paper.

The proposed method is very computationally costly and makes it unsuitable for many applications, including interactive virtual try-on. A more powerful computer graphics engine (eg Unreal Engine 5) could achieve better and faster than the baselines utilized by this paper, and make the method in this paper less relevant.

We conduct some analysis on computational cost comparing to other SOTAs, in the rebuttal to Reviewer SbYx. Our method cannot handle real-time applications for now, but has potential for improvement with future integration with SD Turbo or SD Lightning. Note that comparing to other diffusion-base methods such as VCT, our method takes much less memory and time during testing as we do not need to perform additional optimization for each testing image.

As for rendering baselines, we build the 3D projects with Style3D Studio and use its integrated rendering tool based on rasterization. Using UE5 could potentially improve the rendering quality but will not diminish the effectiveness of our method. To verify this, we use more advanced rendering techniques via ray tracing (based on V-ray) to obtain rendered images, and our method consistently demonstrates its advantages in realism. Two visual examples are shown in Figure 2 of the PDF. We do not apply massive evaluation due to lack of time.

The paper is sometimes lacking in clarity, being overly convoluted in its writing. The title "Make Fashion Real" is confusing.

We thank the reviewer for bringing up these issues and will refine our writing. We consider to change "Make Fashion Real" into "FashionR2R" for better reference to the main focus of our paper.

评论

Dear R-L3jv,

Could you check the authors' responses, and share your thoughts?

Thank you! Your AC

审稿意见
3

This paper proposes a method to translate synthetic renderings of fashion shots into photorealistic images. It is based on Stable Diffusion; SD is first fine-tuned on the target domain; then a negative guidance token for the rendered domain is learnt; then the image is translated by DDIM-inversion in original SD followed by reconstruction with the fine-tuned SD under the learnt negative guidance (i.e. encoraging the image to look 'less synthetic'). Results are presented on two datasets, FaceSynthetics and SynFashion; the latter is proposed in the present work. Quantitative and qualitative results show some improvement over four baselines.

优点

  • The proposed pipeline is novel. The individual components are fairly standard (DDIM inversion; fine-tuning; dreambooth-style token optimisation), but combined in a sensible (and clearly motivated) way to solve the relevant task.

  • Curated qualitative results presented in the main paper are significantly than four baselines (CUT, SANTA, VCT, UNSB), with good adherence to textural details in garments but significantly more realism than the source renderings

  • Quantitative metrics on two synthetic datasets (FaceSynthetics and the proposed SynFashion) are better than the four baselines. In particular SSIM is higher (meaning details from the rendered images are preserved) while KID is lower (meaning the distribution of generated images is more realistic)

  • The paper is generally well-written, well-structured, and clear

缺点

  • The paper feels rather insubstantial in terms of scientific contribution. It is not very surprising that the presented techniques can be combined into an effective pipeline, and there are not many insights for the reader. While it appears to be a sound piece of engineering, interest for the NeurIPS community is probably not so great.

  • Compounding the above point, the ablation study combines together under 'DKI' the two separate ideas of fine-tuning on the target domain, and learning a guidance signal 'away' from the source domain; this makes it unclear what their relative benefits are. Moreover, there are no quantitative ablation results, only two visual examples

  • Only SSIM and KID are given as metrics – why no other reconstruction metric (e.g. LPIPS, which is better correlated with human perception), and why not FID (which is still much more widely employed than KID)?

  • Measures of spread / statistical significance are not present, despite authors answering 'yes' to the relevant checklist question. The lack of variance measures in the quantitative results make it impossible to draw comparisons from these – particularly for SSIM where the difference between proposed method and best baseline is exceedingly small. For KID, 7500 images is on the low side for robustness of the metric.

  • The method appears somewhat 'conservative', in the sense that it favors preserving appearance over enforcing realism. See for example the outputs in fig. 8, where the model makes only minimal changes to the renderings, failing to reach photorealism for most of the rows

问题

  • The description of TAC (L211-226) is somewhat unclear; in particular it should be stated up-front where the two sets of self attention features are coming from. Are the CG-domain features from the DDIM inversion pass on the input image? And the R-domain those in the reverse process, which differ due to the negative guidance?

  • What are the standard deviations on the quantitative results?

  • What are the quantitative results from the ablations, including removing source & target DKI separately?

  • Figure 6 is not in fact an ablation experiment – it doesn't remove any component of the model to demonstrate importance; it merely shows the effect of varying certain hyperparameters. Please adjust the text/caption accordingly

  • Table 2 in App. A3 would be much easier to read if it were transposed. Also it is wrongly named in the text as "Table A3"

  • Reference to Table 3 at L279 is also broken

局限性

There is adequate discussion of limitations, and a very brief discussion of broader impacts.

作者回复

We thank the reviewer for the recognition and encouragement on our novel pipeline, significantly more real qualitative results, effective quantitative results, and generally well-written paper. We would also like to thank the reviewer for the rigorous analysis and suggestions to help us improve our paper. Below we address the concerns and questions in detail:

Scientific contribution, insights, and interests for the community

We believe this is a misunderstanding and maybe due to somewhat lack of elaboration in our paper. In general response, we further explain our scientific contributions. We do not intend to compose an engineering pipeline to somehow manipulate diffusion generation, but propose to study a fundamental research problem, which might not be so obviously achieved and could be interesting to the NeurIPS community.

Quantitative results from the ablations

We conduct ablation study on two datasets in a drop-one-out manner. The quantitative results are in the table below and two visual examples are shown in Figure 2 of the attached PDF.

FaceSynthetic Datasetw/o source DKIw/o target DKIw/o TACours
SSIM (std)0.8572 (0.0592)0.8447 (0.0634)0.7199 (0.0849)0.8314 (0.0679)
LPIPS (std)0.1072 (0.0285)0.1137 (0.0313)0.2529 (0.0703)0.1206 (0.0349)
KID (std)77.376 (2.063)78.927 (2.134)69.349 (1.485)73.831 (1.973)
SynFashion Datasetw/o source DKIw/o target DKIw/o TACours
SSIM (std)0.9033 (0.0654)0.8971 (0.0556)0.7937 (0.0741)0.8809 (0.0553)
LPIPS (std)0.0587 (0.0194)0.0635 (0.0217)0.1827 (0.0472)0.0671 (0.0251)
KID (std)58.520 (1.902)60.186 (1.623)51.392 (1.083)54.720 (1.362)

Without source DKI (embedding), the model tends to recover the input rendered image with DDIM inversion noise.

Without target DKI (finetuning), the rendering effect slightly decreases but the output is still not real enough due to lack of concentrated knowledge on real human and clothing. Lacking either component in DKI will make the output closer to the rendered input (higher SSIM and lower LPIPS), but much less real (higher KID).

Without TAC, the semantic structure such as face identity and clothing design can significantly deviate from the input. Although the output is clearly real (lower KID), it is almost completely different from the intended human and clothing (lower SSIM and higher LPIPS).

Other reconstruction metric (e.g. LPIPS, FID), standard deviations

As discussed in our paper and agreed upon by Reviewer SbYx, defining and quantifying "realism" still remains an open question. SSIM and KID roughly reflect the objective in enhancing photorealism while maintaining consistency. We follow the reviewer's suggestion and add LPIPS to evaluate texture preservation regarding rendered images, which is closer to human perception. The results are shown in the table below. We follow [2] and adopt KID rather than FID, for KID's unbiased nature and less sensitivity to sample sizes. Note that a good SSIM or LPIPS score does not guarantee better photorealism if the KID score is worse.

We' d also like to thank the reviewer for pointing out the missing information in statistical significance and add the standard deviations of the results in the table. The relatively low stds demonstrate the robustness of our method.

Additionally, as suggested by Reviewer SbYx, we conduct user studies to augment the quantitative evaluation. Our results in "real human perception" turn out be overwhelmingly better than other methods. Please refer to related rebuttal for more details.

FaceSynthetic DatasetCUTSANTAVCTUNSBOurs
SSIM (std)0.6635 (0.0786)0.6176 (0.1039)0.8072 (0.0723)0.8176 (0.0696)0.8314 (0.0679)
LPIPS (std)0.3649 (0.0725)0.3872 (0.0788)0.0957 (0.0270)0.2287 (0.0685)0.1206 (0.0349)
KID (std)80.553 (2.447)90.390 (2.929)74.445 (2.273)76.389 (2.465)73.831 (1.973)
SynFashion DatasetCUTSANTAVCTUNSBOurs
SSIM (std)0.8467 (0.0670)0.7408 (0.0824)0.8070 (0.0846)0.8909 (0.0542)0.8809 (0.0553)
LPIPS (std)0.1699 (0.0603)0.2940 (0.0669)0.1783 (0.0577)0.1299 (0.0403)0.0671 (0.0251)
KID (std)59.238 (1.599)61.636 (1.628)59.489 (1.499)59.496 (1.453)54.720 (1.362)

Preserving appearance or enforcing realism.

The results we present in the paper do emphasize more on preserving appearance, under the assumption that the human model identity and clothing design should remain largely consistent to the input. However, the realism enhancement of our method could be more 'aggressive' given more tolerance to modifications to the input content. In Figure 6, we show an example of the trade-off in using different hyper-parameters.

The description of TAC is somewhat unclear.

We will modify the description of TAC. The CG-domain self-attention features are derived from the reverse sampling process starting from the noisy latent, which is obtained by performing DDIM inversion on the input image latent. In contrast, the R-domain self-attention features differ due to the incorporation of negative domain guidance and the self-attention injection.

Figure, table captions and typos

We thank the reviewer for pointing out these issues and will carefully modify our paper.

[2] Richter, Stephan R et al. "Enhancing photorealism enhancement." IEEE Transactions on Pattern Analysis and Machine Intelligence 45.2 (2022)

评论

I thank the authors for the detailed rebuttal. The additional ablation results certainly help to clarify the benefit of each component.

How are the standard deviations here calculated? Are they variance over training random seeds, or over test inputs for a fixed model, or something else? It would seem that the improvement versus UNSB is not statistically significant -- can you comment further on this?

评论

Dear R-cE2m,

Could you check the authors' responses? Do you have further questions for the authors?

Thanks, Your AC

评论

We thank the reviewer for the follow-up.

The standard deviations here show the variance over test inputs for a fixed model. For each method, the standard deviations of SSIM/LPIPS are calculated with this formula σ=(xiμ)2/N\sigma = \sqrt{\sum{(x_i - \mu)^{2}}/{N}}, where xix_i is the value in SSIM/LPIPS for the test input ii, NN is the number of test input samples (7500), and μ\mu is the corresponding mean of this metric over NN test inputs. For KID, we adopt the public API torch-fidelity [1]. We do not conduct calculation over training random seeds due to lack of time and have not noticed obvious instability over training seeds in experiments.

Additionally, we want to kindly remind the reviewer that defining and quantifying "realism" is still an open question, which means the metrics do not accurately align with the objective. A better score in a stand-alone metric, SSIM/LPIPS for preservation to rendered inputs, or KID for realism enhancement, does not always lead to advantages in reaching the goal of the task, due to the trade-off between these two goals. Given the roughly aligned metrics, the further statistical analysis might be a bit less informative than in other well-defined problems. Indeed, we have to admit that we could not emphasize or claim statistical significance in this work but only roughly demonstrate the stability and generalization ability, which is probably why the relevant information is missing in many generation/translation works. We want to thank the reviewer for noting this issue and will include discussion in our paper.

In the statistical comparisons, our SSIM is indeed very close to UNSB, but we do have a much lower LPIPS (0.1206 VS 0.2287 for FaceSynthetics, 0.0671 VS 0.1299 for SynFashion) and a much lower KID (73.831 VS 76.389 for FaceSynthetics, 54.720 VS 59.496 for SynFashion). As the reviewer suggested, LPIPS is better correlated with human perception than SSIM. The user studies in response to Reviewer SbYx also show significant statistical preference for our results over UNSB.

[1] High-fidelity performance metrics for generative models in PyTorch. (Zenodo,2020), github.com/toshas/torch-fidelity, Version: 0.3.0, DOI: 10.5281/zenodo.4957738

评论

Thank you once again for taking the time to review our response. As the discussion stage is coming to an end, we would like to ask if all the issues have been satisfactorily addressed. If there are any remaining concerns, we would be more than happy to discuss them further. If you find that your concerns have been resolved, we would greatly appreciate it if you could consider raising your score. Thanks!

If the reviewer is interested in further analysis between our work and UNSB, we believe one main advantage of our method is that, we propose to control certain attributes concerning color and geometric information through self-attention injection with our TAC. This leverages the prior in the diffusion UNet structure and somehow "disentangles" the gaps between rendered image and real image domain to some extent. Note that the gaps between these two domains include not only fine-grained material properties and illuminaion, which are directly related to the "realism" we want to "enhance", but also other variables such as human poses and clothing styles that we want to maintain and preserve. This makes it difficult for general unpaired image-to-image translation methods to handle if there are no specific designs to deal with it, especially for images with higher resolution.

审稿意见
6

The paper presents an advancement in the field of rendered-to-real image translation, particularly for fashion images. Its novel approach, focus on texture preservation, and contribution of a new dataset make it a valuable addition to the literature.

优点

The Texture-preserving Attention Control (TAC) mechanism effectively maintains fine-grained details in clothing textures, which is crucial for realistic image generation.

Introduction of the SynFashion dataset fills a gap in the availability of high-quality rendered fashion images, facilitating further research in this area.

Extensive experimental results demonstrate the superiority of the proposed method over existing techniques, highlighting its effectiveness in achieving realistic image translation.

缺点

The paper acknowledges that defining and quantifying "realism" remains an open question. More robust metrics or user studies could provide better insights into the perceived realism of the generated images.

问题

Have you conducted any user studies to assess the perceived realism and quality of the generated images compared to existing methods?

Can you discuss the trade-offs involved in maintaining texture details versus achieving overall visual coherence in the generated images?

局限性

The two-stage process involving Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG) can be computationally intensive. Training and deploying the framework may require substantial computational resources, which could limit its applicability in real-time or resource-constrained environments.

作者回复

We appreciate the reviewer's positive comment and acknowledgement on the TAC design of our method, the value of our SynFashion dataset, and the experimental results. The suggestions on user studies and further discussion are very helpful. Below we address them in detail:

The paper acknowledges that defining and quantifying "realism" remains an open question. More robust metrics or user studies could provide better insights into the perceived realism of the generated images.

Defining and quantifying "realism" is indeed an open question, given the nuanced change between the input rendered image and its realistic counterpart, which is unknown and non-unique. We compute SSIM and LPIPS (as suggested by Reviewer cE2m, please refer to the related rebuttal) to roughly evalute texture preservation to rendered images, and use KID to measure the distance between the outputs and real photos in distribution. However, these metrics still cannot accurately reflect the objective of translation in photorealism.

We thank the reviewer for suggesting user studies to provide better insights quantitatively, regarding the perceived realism, image quality, and consistency to input rendered images. We follow [1] StyleDiffusion (ICCV 2023) in style-transfer and compare our method to previous works in pairs. Specifically, we sample 100 image pairs from each dataset for a user to choose, where each pair consists of one image generated from our method and a corresponding image generated with anthor random method side by side in random order. The users are asked to choose (1) which result is more realistic and (2) which result has overall better image quality (3) which result has better consistency with the rendered image.

We obtain approximate 2000 votes for each question from 20 users and show the percentage of votes that existing methods are prefered to ours in the following Table. The lower numbers indicate our method is more prefered than the competitors. Our method achieves overwhelming preference in overal realism and image quality, as well as obvious advantage in consistency.

FaceSynthetic DatasetCUTSANTAVCTUNSB
Overall realism0.529%0.922%5.952%4.511%
Image quality0.529%1.383%14.286%6.767%
Consistency13.175%12.304%20.714%21.278%
SynFashion datasetCUTSANTAVCTUNSB
Overall realism8.994%3.333%2.041%9.821%
Image quality6.878%5.238%6.122%9.821%
Consistency16.931%11.571%18.367%26.607%

Can you discuss the trade-offs involved in maintaining texture details versus achieving overall visual coherence in the generated images?

Maintaining centain parts of an image during generation, such as texture details, can lead to potential visual incoherence. One example is inpainting with diffusion models. However, in our work, our proposed TAC tackle this issue resorting to attention injection from a dual diffusion branch. Emperically, we discover the shallow layers of diffusion UNet preserves texture details while deep layers are more related to semantic information, and the injection of self-attention from shallow layers can blend smoothly with other features in the generation branch. In Figure 6, we show the effect of different ratio of TAC control steps during diffusion generation. We can see that the coherence in the generated image is not an obvious issue in our framework.

The two-stage process involving Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG) can be computationally intensive. Training and deploying the framework may require substantial computational resources, which could limit its applicability in real-time or resource-constrained environments.

Computational cost and resource requirements are indeed relevant issues of our method, and probably similar to many diffusion-based methods. We thank the reviewer for mentioning this and will add it into our discussion on limitation and future work. Besides, we conduct some analysis on computational cost comparing to other SOTAs. We test the inference time and resource consumption for a 512x512 image on an RTX 3090, as shown in the table below. Note that comparing to VCT, which is also based on diffusion, our method takes much less memory and time during testing as we do not need to perform additional optimization for each testing image.

CUTSANTAVCTUNSBOurs
Memory Required / GB3.3 GB4.5 GB22 GB7.4 GB7.7 GB
Testing Time / s0.38 s0.33 s62.47 s0.53 s7.98 s

[1] Wang, Zhizhong, Lei Zhao, and Wei Xing. "Stylediffusion: Controllable disentangled style transfer via diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

评论

Thank you for the thorough response and for implementing the suggestions regarding user studies to evaluate the perceived realism, image quality, and consistency of your method. The approach taken, following StyleDiffusion and comparing against previous works, provides valuable quantitative insights into the strengths of the proposed method.

The discussion on the trade-offs between maintaining texture details and achieving overall visual coherence is insightful. The empirical findings regarding the preservation of texture details in shallow layers of the diffusion UNet, coupled with attention injection, are particularly interesting and contribute to understanding how these aspects are effectively balanced in your framework.

Regarding computational considerations, the analysis comparing memory consumption and inference times with other state-of-the-art methods on a 512x512 image provides a clear perspective on the efficiency of your approach, especially in comparison to similar diffusion-based techniques like VCT.

I appreciate the authors' diligent efforts and valuable contributions to the field. My rating of the paper will remain the same.

评论

Thank you for taking the time to review our rebuttal. We appreciate your positive re-evaluation and look forward to integrating your insights into the final version of our paper.

审稿意见
5

This paper proposed a modified framework for generating realistic fashion photos. The framework enhances existing diffusion models through a knowledge injunction pre-training process and an attention control mechanism during the generation process. Additionally, the author introduced a new dataset including high-resolution textures patterns of digital clothes. This dataset, combined with improvements in texture identification, enables the new framework to perform better compared to other frameworks.

优点

  1. The author created a high-resolution dataset with more than ten thousand digital cloth texture patterns, which fills a gap in relevant fields.
  2. Compared to other frameworks, this framework had a better performance in terms of clothing texture and facial naturalness for given source image. It is an improvement in application of AIGC in everyday life.
  3. The framework impressively achieved transforming a clearly unrealistic image containing human faces into a plausible and realistic one.

缺点

  1. There is no updating on diffusion structure but relies on pre-training and post-processing.
  2. The approach is more engineering application than academic paper.
  3. When translating images containing human faces and meaningful backgrounds, the framework seems to lose or even change the background.
  4. The structure and grammar of the paper could be further improved for better clarity and readability.

问题

  1. There is a problem regarding the diffusion models that where generated images exhibit significant randomness given the same segment mask. How does the author’s work address this problem?
  2. There is a confusion that in the introduction section, for DKI pre-training, the author states they applied a text-to-image diffusion model. However in the rest sections of the paper, it seems the pre-trained model is trained by images. How to understand this difference?
  3. Does this framework perform differently on various patterns of texture? For example, what is the outcome when a textual texture, such as a sentence, is given?

局限性

The authors have addressed some limitations of their framework, however it would be more comprehensive if they can add some in-depth discussion on potential negative societal impacts and provide relative solutions.

作者回复

We thank the reviewer for recognizing our contributions in the proposed dataset and "impressive" performance of our framework, as well as the valuable feedback and suggestions. Below we address the concerns and questions in detail:

There is no updating on diffusion structure but relies on pre-training and post-processing.

The main contribution of our proposed method is in regard to novel domain knowledge tranfering with established T2I diffusion-based models, combining with specific content control through attention injection. The TAC module is an exploration on diffusion structure and a slight modification of the structure to achieve fine-grained texture preservation. Our training and inference process is standard but convenient, without needing additional optimization or hand-crafted post-processing during testing.

The approach is more engineering application than academic paper.

As discussed in general response, our paper studies an important scientific problem, brings out a new perspective, proposes a novel pipeline, establishes a comprehensive benchmark, asks some essential questions, has potential in introducing more resources, and would work as an inspiration for future work.

The framework seems to lose or even change the background.

Changing image content is a well-known side-effect of image-to-image translation with generative models, especially for diffusion models.

In our paper, we propose a TAC module to mitigate this problem through attention injection from specific layers of the diffusion UNet. In particular, the high-level semantic structure such as face identity and clothing structure, as well as low-level texture patterns are largely preserved.

The background can consist semantic layout structure and texture details, which would lie in similar distribution in the image space and thus be preserved by our method, as shown in Figure 4.

In Figure 6, the background changes significantly when increasing denoising strength and decreasing ratio of TAC control steps, which is to show an effective range of the hyper-parameters and the trade-off between content preservation and realism enhancement.

The structure and grammar of the paper could be further improved for better clarity and readability.

We appreciate the suggestion and will modify our paper to improve its clarity and readability.

There is a problem regarding the diffusion models that where generated images exhibit significant randomness given the same segment mask. How does the author’s work address this problem?

Given the same input, the randomness can decrease when the denoising strength is lower and less noise is added to the input. However, to leverage the generative power for image translation, the denoising strength cannot be too low, as shown in Figure 4. In our work, we propose a TAC module to help preserve fine-grained structural image content, which can decrease its related randomness while better exploiting the diffusion model's generative power in other layers. Empirically, we found a combination of denoising strength 0.3 and TAC ratio 0.9 achieves the best balance.

There is a confusion in model training.

In Introduction Line 41, we mention that the T2I diffusion model is trained on real photos with derived captions from BLIP. In Method Line 166, we note that the text embedding is also an input in model training. We will further clarify this in our paper writing and Figure 1.

Does this framework perform differently on various patterns of texture? Such as a sentence?

Our framework can robustly handle various patterns including textual texture. The second row in Figure 3, the last row in Figure 8, and two additional examples in Figure 1 of the attached PDF are some visual examples.

Add some in-depth discussion on potential negative societal impacts and provide relative solutions.

We thank the reviewer for suggesting this and will add more discussion on potential negative societal impacts, including potential violation of portrait rights, racial bias in generation, NSFW content, etc. Relative solutions can include but are not limited to using authorized, diverse and balanced training data and training detection models to prevent inappropriate content generation.

评论

Thanks for the careful rebuttal, author's feedback have addressed my concern well. I think the importance of contribution has been fully explained, and the author has also acknowledged the flaws and problems of the method and task. So I am willing to increase the rating to 5

评论

Dear R-m8pA,

Could you take a look at the authors' responses, and share your thoughts? Do you have further questions for the authors?

Thanks, Your AC

评论

Thank you for updating your score. We look forward to incorporating your thoughtful feedback into the final version of the paper.

作者回复

We thank all the reviewers for their valuable time and insightful comments to help us improve our paper. In particular, we are encouraged by the positive feedbacks:

  1. "The paper tackles an important problem". (Reviewer L3jv)
  2. The proposed high-quality SynFashion dataset "fills a gap in relevant fields", "looks very interesting" and "is a salient contribution of this work", "facilitating further research in this area". (Reviewer m8pA, L3jv, SbYx)
  3. "The proposed methods are sound and well-motivated", "novel", with "fairly standard components but combined in a sensible way". The TAC mechanism is "effective and crucial". (Reviewer L3jv, cE2m, SbYx)
  4. "The framework impressively achieved transforming a clearly unrealistic image containing human faces into a plausible and realistic one". "Extensive experimental results demonstrate the superiority of the proposed method over existing techniques". "Curated qualitative results presented in the main paper are significantly than four baselines" and "quantitative metrics on two synthetic datasets are better". (Reviewer m8pA, SbYx, cE2m, L3jv)
  5. "The paper is generally well-written, well-structured, and clear". "The proposed methods are well-motivated and well-presented." (Reviewer cE2m, L3jv)

Additionally, we are also grateful and very much inspired by the suggestions and questions. The shared concerns are about the academic/scientific contribution of this work (Reviewer m8pA, cE2m) and more robust quantitative evaluation (Reviewer SbYx, cE2m). Below we address these issues in detail:

The academic/scientific contribution of this work.

  1. This paper studies an important scientific problem. Modeling and producing lifelike clothed human images has attracted researchers' attention from different areas for decades.

    · In computer graphics, researchers decompose and model the imaging process of a camera in real physical world. The related variables include 3D geometry, material (texture, reflectance, elasticity, etc.), illumination and camera parameters. All these variables are crucial and put into ray tracing algorithm to obtain the final image. To achieve photorealistic rendering, the variables and computation need to be as accurate as possible. However, this takes enormous time when the content is quite complicated, and may not be possible yet due to the limitation and somewhat plateaued ability of current techniques, especially for clothing.

    · From the perspective of machine learning, researchers directly model the distributions of clothed human images. This is an implicit way to model the complicated variables and can leverage abundant knowledge through training on large datasets. With recent advancements in generative AI, the learning-based methods has achieved impressive quality in generated human images, comparing to rendering-based methods. However, the well-known limitations of generative methods lie in its generation diversity, stability, and most importantly, controllability and editability.

    · This paper intends to establish on the controlled basis of rendering, and propose to leverage rich knowledge from abundant data and generative power from learning methods, to reach the goal of lifelike clothed human generation. We believe this is a fundamental and difficult scientific problem, and deserves researchers' attention.

  2. The proposed method raises some fundamental questions in regard to current diffusion-based methods. The diffusion-based methods have demonstrates their power in various image generation and translation problems, such as style-transfer and super-resolution as discussed in our related work. However, it is unknow before, whether it is able to model the gaps between variables in rendering and physical world. This is particularly difficult due to the high expectation on the realism of the output images and the consistency in all levels of information with the rendering, given no paired data for training.

    · Our proposed DKI novelly adopts a positive finetuning for generation and a negative embedding for guidance, which achieves the realism enhancement to some extent. Our TAC further imposes texture and structure preservation without sacrificing too much realism.

    · We believe this is not only an effective and novel discovery, but also proposes some fundamental questions on diffusion models: What are the inherent knowledge encoded by the UNet attention layers? Can we disentangle these structure to somehow explicitly model the gaps between variables in rendering and physical world? These questions might be non-trivial to answer, but our work would be an inspiration for future research in this area.

  3. The proposed method is sound, the results are impressive and significant, and the dataset is a salient contribution, as acknowledged by the reviewers. We establish a comprehensive benchmark to help pave the way for future works to look deeper into remaining problems.

  4. Our work is grounded in real-world applications, which is close to commercial influence and can potentially attract more resources to help develop this area. We believe this could also be beneficial to our NeurIPS community.

More robust quantitative evaluation.

As suggested by Reviewer SbYx and cE2m, we have added user studies and conducted LPIPS as quantitative evaluation. The detailed results are provided in the individual response.

We also include experiments on the textual related features as suggested by Reviewer m8pA, more advanced rendering method as suggested by Review L3jv and detailed visual ablation study on source DKI and target DKI as suggested by Reviewer cE2m, please refer to the attached PDF file.

As for other suggestions and questions, we answer them in the rebuttal to each reviewer individually. We sincerely look forward to further discussion and feedbacks, and thank you for your time in advance!

评论

Dear authors and reviewers,

I have read the reviews and the rebuttal. My concerns regarding computational cost and analysis of related work have been partly addressed. However, some reviewers are unconvinced about the contribution in the paper. I am leaning positive about this submission provided the changes promised are included in the final version of the paper. However, I am curious about the thoughts of the other reviewers, particularly those that were less positive.

Best,

评论

Thank you for taking time to read the response and being positive about our submission. The reviewers updated their comments and we engaged in discussion. If there are any further questions or additional suggestions, please feel free to share them.

最终决定

After the rebuttal, most reviewers agree that the paper tackles an important problem, the presented dataset will stimulate the future research along this direction. In the meantime, the algorithm proposed in this paper is effective, motivated, and extensive experiments are conducted to verify the proposed approach. The paper is well-written.

Agreeing with the reviewers, the AC accepts the paper, and encourages the authors to include the discussion, results presented during the rebuttal stage to the paper.