PaperHub
6.0
/10
Poster3 位审稿人
最低6最高6标准差0.0
6
6
6
3.7
置信度
正确性2.7
贡献度2.7
表达2.3
ICLR 2025

InstantSwap: Fast Customized Concept Swapping across Sharp Shape Differences

OpenReviewPDF
提交: 2024-09-23更新: 2025-02-22
TL;DR

We propose InstantSwap, a novel training-free customized concept swapping framework, which enables efficient concept swapping across sharp shape differences.

摘要

Recent advances in Customized Concept Swapping (CCS) enable a text-to-image model to swap a concept in the source image with a customized target concept. However, the existing methods still face the challenges of $inconsistency$ and $inefficiency$. They struggle to maintain consistency in both the foreground and background during concept swapping, especially when the shape difference is large between objects. Additionally, they either require time-consuming training processes or involve redundant calculations during inference. To tackle these issues, we introduce InstantSwap, a new CCS method that aims to handle sharp shape disparity at speed. Specifically, we first extract the bbox of the object in the source image $automatically$ based on attention map analysis and leverage the bbox to achieve both foreground and background consistency. For background consistency, we remove the gradient outside the bbox during the swapping process so that the background is free from being modified. For foreground consistency, we employ a cross-attention mechanism to inject semantic information into both source and target concepts inside the box. This helps learn semantic-enhanced representations that encourage the swapping process to focus on the foreground objects. To improve swapping speed, we avoid computing gradients at each timestep but instead calculate them periodically to reduce the number of forward passes, which improves efficiency a lot with a little sacrifice on performance. Finally, we establish a benchmark dataset to facilitate comprehensive evaluation. Extensive evaluations demonstrate the superiority and versatility of InstantSwap.
关键词
Text-to-image generationImage editingCustomized concept swapping

评审与讨论

审稿意见
6

This paper proposes a training-free customized concept swapping framework. It derived bounding boxes from the Attention to map to help preserve the background information during optimizing the latents. The gradient is updated periodically for a better tradeoff between quality and inference time. A semantic enhanced module is further proposed to improve foreground consistency. Both quantitative and qualitative experiments are conducted to validate the effectiveness of this approach.

优点

  • The experimental results are comprehensive and promising.

  • A benchmark dataset designed for CCS task is proposed.

  • The writing is fluent and easy to understand.

缺点

  • It would be better to give an introduction on the customization methods. This introduction could also help readers to understand the difference brought by integrating customization method into image editing.

  • Since this approach needs to modify the cross attention, I wonder if it could be applied to the DiT-based architectures, like SD3.

  • It seems that the semantic-enhanced operation enhances the semantic of the source/target object while mitigates the object’s interaction with the surrounding objects and background. Will it make the image unnatural?

  • I am curious why P2P fails completely at this task. From the results displayed in the paper of P2P, I would expect P2P to be able to fulfill the task of changing objects. Is it because the customization method does not fit well with P2P?

问题

Please see the questions in weakness section.

评论

We sincerely thank the reviewer for their insightful comments and recognition of our work. We greatly appreciate the acknowledgment of our comprehensive and promising experimental results, fluent writing, and the benchmark we constructed. We have refined the paper, incorporated additional experiments, and clarified the following points in the revised version. The changes relevant to your concerns are marked in brown and we will remove these colors in the final version.

W1: Detailed introduction on customization methods

Thanks for pointing this out. We have added a detailed introduction of the customization methods we used in Section 3.2.3 and Appendix L.

  • Specifically, in the target branch of our method, we first convert the target concept into semantic space with DreamBooth. We use a specific rare token (e.g., sks) to represent the concept.

  • During concept swapping, we can use this rare token in the target prompt to semantically represent our target customized concept and perform the customized concept swapping.

W2: Apply our method to DiT-based architectures

  • Thank you for your illuminating suggestions. Our SECR enhances cross-attention maps of concepts using semantic information. This technique can be applied to any generative model with a cross-attention layer, regardless of architecture, to enrich semantic information in regions of interest.

  • Since there is no explicit cross-attention layer in SD3 [4] and it only contains a self-attention layer, we have included a detailed analysis in Appendix M to explore the architecture of SD3 and investigate how our SECR could be applied to it.

  • We focus on the interaction between images and text within SD3. The results indicate that the attention layer in SD3 can be decomposed into a cross-attention operation between image and text and a self-attention operation of the image or text. Our SECR can be applied to the cross-attention part of the attention layer in SD3 to enhance semantic information in regions of interest.

  • However, since SD3 is trained based on Rectified Flow, existing score distillation methods (e.g., SDS [5], DDS [6]) cannot be directly integrated with it. In future work, we will continue to explore this potential direction.

[4] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In ICML, 2024.

[5] Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.

[6] Delta denoising score. In ICCV, 2023.

W3: Impact of semantic-enhanced operation on the naturalness of images

We employ the semantic-enhanced operation on foreground concepts within the cross-attention layers to derive an enhanced representation of the source and target objects. This enhanced representation effectively interacts with the features of surrounding objects and the background in the self-attention layers, mitigating inconsistencies and unnaturalness in the target image.

We have provided detailed results in Appendix N. The qualitative results in Fig. 23 show that:

  • Our method seamlessly inpaints objects in the background that the foreground concept occludes (upper left examples).
  • It generates reflections of the target concept in water (upper right examples).
  • It accurately places the target concept on background objects (lower left examples).
  • It also generates natural shadows of the target concept that are consistent with the environmental lighting angles (upper left and lower right examples).

Moreover, considering that image naturalness is a subjective human assessment, we use the human preference metric [7] to quantify the naturalness of images. A higher score indicates a better alignment with human preferences. The results show that our method is the only one that receives a positive human evaluation score, demonstrating a significant advantage over other methods.

MethodHuman Preference Score ↑
SDS-1.41
DDS-0.21
CDS-0.01
PhotoSwap-0.86
PnPInv+P2P-0.14
Ours0.51

[7] Imagereward: Learning and evaluating human preferences for text-to-image generation. In NIPS, 2023.

评论

W4: P2P in customized concept swapping

P2P seems to be able to fulfill the task of changing objects?

  • P2P mainly focuses on synthesized image editing and struggles with real image editing. This is because P2P relies on inverting the source image into the noise space. This inversion results in significant distortion for real images as stated in the second paragraph of the Conclusion in the original P2P paper [8]:

"The current inversion process results in a visible distortion over some of the test images."

  • In Fig. 11 of the P2P paper, P2P struggles to accurately reconstruct real images even with DDIM inversion, making effective editing even more difficult.

Does P2P fit well with customization methods?

  • P2P can fit well with customization methods. One of our baselines, PhotoSwap [9], combines P2P and DreamBooth.

  • PhotoSwap adopts null-text inversion [10] to mitigate inversion distortion in real-image scenarios. It can effectively handle customized concept swapping without obvious shape differences.

Why P2P fails completely at this task?

  • P2P itself and P2P-based methods (e.g., PhotoSwap and PnPInv [11] + P2P in our paper) heavily rely on self and cross attention maps to maintain background consistency. Although this improves their background consistency, it limits their ability to change objects with shape differences.

  • However, in the task of customized concept swapping, the target concept is highly customized and naturally has a sharp shape difference from the source concept. This presents a significant challenge for P2P-based methods.

  • As shown in Fig.6 in our paper, P2P-based methods can only edit within the existing shape and cannot change the shape of the source concept, leading to their failure in this task.

[8] Prompt-to-prompt image editing with cross attention control. In ICLR, 2023.

[9] Photoswap: Personalized subject swapping in images. In NIPS, 2023.

[10] Null-text inversion for editing real images using guided diffusion models. In CVPR, 2023.

[11] Pnp inversion: Boosting diffusion-based editing with 3 lines of code. In ICLR, 2024.

评论

Dear Reviewer VFJh:

Thank you once again for dedicating your valuable time to reviewing our paper and providing constructive comments! As the end of the discussion period approaches, we kindly ask if our responses have satisfactorily addressed your concerns. If you have any further inquiries about our method, or if you still have any concerns about our paper, please do not hesitate to inform us. We are more than willing to engage in timely discussions with you.

Sincerely,

The Authors

评论

Dear Reviewer VFJh:

We would like to thank you again for your valuable feedback on our paper.

As the period for the Author-Reviewer discussion is closing very soon, we would like to use this opportunity to kindly ask if our responses sufficiently clarify your concerns. We sincerely appreciate your time and consideration.

Best Regards,

The Authors

评论

Dear Reviewer VFJh:

We notice that all other reviewers have posted their post-rebuttal comments to our response but we still have not yet received any further information from you. We greatly appreciate your initial comments and fully understand that you may be extremely busy at this time.

As the deadline for the Reviewer-Author discussion phase is fast approaching (less than two days left), we respectfully ask whether we have addressed your questions and concerns adequately. Your feedback would be greatly appreciated, and we would be delighted to engage in further discussions if needed.

Best Regards,

The Authors

评论

Most of my concerns have been addressed in author responses. Thus I will increase my rating.

评论

Thanks for your response. We appreciate your engagement in our discussion and your recognition of our efforts!

审稿意见
6

Recent advances in Customized Concept Swapping (CCS) enable text-to-image models to swap concepts, but existing methods struggle to maintain foreground and background consistency, particularly with large shape disparities, and often require time-intensive processes. InstantSwap addresses these challenges by using bounding box analysis and cross-attention mechanisms to enhance both foreground and background consistency while limiting modifications to the background and enhancing foreground focus. This method reduces computation time by periodically calculating gradients, enhancing efficiency with minimal performance loss, and extensive evaluations show InstantSwap's effectiveness and adaptability.

优点

The motivation is clear.

The results appear promising and solid.

The experiments are thorough.

The writing is easy to follow.

缺点

For each concept replacement, the method first needs to train a DreamBooth model and then perform score distillation, which is time-intensive.

Both the source and reference branches use DreamBooth-tuned UNet. It would be beneficial to validate the method using text inversion to demonstrate its generalization capability.

What about the failure cases?

It's interesting that the method can handle concepts with significant shape changes. If the original image's concept is very small, resulting in a small bounding box, how does the target image’s foreground region expand without additional processing?

问题

Please see the weakness

评论

We sincerely thank the reviewer for their insightful comments and recognition of our work, especially for acknowledging our comprehensive experiments, solid results, clear motivation, and smooth writing. We have refined the paper, incorporated additional experiments, and clarified the following points in the revised version. The changes relevant to your concerns are marked in blue and we will remove these colors in the final version.

W1: Customization is time-intensive

  • The overall time added is very mild. As shown in the table below, DreamBooth requires about 8.5 minutes to learn a single concept. Meanwhile, our method can be combined with other more efficient customization methods such as Custom Diffusion [2] which only requires about 3.5 minutes for each concept.
DreamBoothCustom Diffusion
Training Time~8.5min~3.5min

We have added the above discussion in Appendix J. The visual results of our method combined with Custom Diffusion can be found in Fig.22.

[2] Multi-concept customization of text-to-image diffusion. In CVPR, 2023.

W2: Combine our method with Textual Inversion

Thanks for your suggestion. Our method can indeed be combined with other customization methods such as Textual Inversion (TI) [3]. We have included results in Appendix K and present our quantitative results here. Please refer to our modified manuscript for more qualitative results and analysis.

MethodCLIP-I ↑PSNR ↑LPIPS ↓MSE ↓SSIM ↑CLIP-T ↑Time ↓
Ours75.0027.3947.6827.8786.5825.7419.83
Ours with TI72.2028.1645.9124.2086.7624.5719.92
  • The results show that our method integrates well with Textual Inversion, achieving similar performance in background consistency and inference time.
  • Due to the limited customization capabilities of Textual Inversion, its foreground consistency is not as high as with our original method.

[3] An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.

W3: More failure cases

We have presented failure cases in the limitations part of Appendix P. We have added more failure cases to the modified manuscript and conducted a detailed analysis. Specifically, due to the limited capability of the base diffusion model (e.g., Stable Diffusion), it is unable to perceive the relative sizes between specific concepts. We visualize this phenomenon in Fig. 25:

  • First row: an extremely large cup.

  • Second row: a toy duck the size of a little girl.

  • Third row: a pair of sunglasses nearly as large as the table.

Similar issues occur in all compared methods. Nevertheless, our method still achieves the best concept swapping results.

W4: How to expand the foreground region

  • The goal of Customized Concept Swapping is to smoothly replace the foreground object with the customized object. We care about (1) whether the foreground region is re-occupied with the customized object and (2) whether the transition between the foreground and background remains natural and smooth. As a 2D image editing task, whether the physical size of an object is consistent with the surrounding environment (highly dependent on the perspective angle) is not a focus in this field.

  • As we mentioned in the limitation part of Appendix P, perceiving the relative size of specific concepts is very challenging for the current base diffusion model. Previous image editing methods also struggle to effectively manage this issue.

  • If we really want to handle this scenario, we can expand our obtained bounding by adjusting the threshold β\beta adaptively during the swapping process to make the bounding box larger, thereby expanding the foreground region.

评论

Dear Reviewer iqew:

Thank you once again for dedicating your valuable time to reviewing our paper and providing constructive comments! As the end of the discussion period approaches, we kindly ask if our responses have satisfactorily addressed your concerns. If you have any further inquiries about our method, or if you still have any concerns about our paper, please do not hesitate to inform us. We are more than willing to engage in timely discussions with you.

Sincerely,

The Authors

评论

Dear Reviewer iqew:

We would like to thank you again for your valuable feedback on our paper.

As the period for the Author-Reviewer discussion is closing very soon, we would like to use this opportunity to kindly ask if our responses sufficiently clarify your concerns. We sincerely appreciate your time and consideration.

Best Regards,

The Authors

评论

Thanks authors for responses. My concerns have been addressed so I kept my acceptance score.

评论

Thanks for your response. We appreciate your engagement in our discussion and your recognition of our efforts!

审稿意见
6

This paper proposes InstantSwap, a training-free framework for Customized Concept Swapping(CCS). CCS works on transfers the target concept described by target images and target prompt to the location of source concept in the source image. This paper utilizes the cross attention map and the self-attention map of U-Net in diffusion model for source image and source prompt to extract bounding box of the source concept automatically. Then they apply the bounding box to filter the gradients in background from a refined SDS loss. In this way, they can achieve an optimization preserving the background information. To emphasize the concepts in the images, they also use the semantic information of corresponding prompts and the estimated bbox to augment representation of concepts. Additionally, this paper also presents a step-skipping gradient update strategy which reuse previous gradients for current iteration to increase the inference speed. Experiments present the advantages of the proposed method over previous works.

优点

  1. The proposed method presents a complete pipeline for improve effectiveness and efficiency in Customized Concept Swapping(CCS) task.
  2. The proposed method obtains state-of-the-art performance compared to previous works.
  3. This paper also contributes benchmark for Customized Concept Swapping(CCS) task.

缺点

  1. Theoretical analysis about why we can directly apply mask on gradient computing is missing. Masking will produce a distribution shift, why it can converge to a reasonable solution requires some analysis.
  2. How combining self-attention and cross-attention for automatic bbox generation affects the performance seems not be discussed.
  3. How the number of target images affect the performance is not mentioned.

问题

  1. How does the proposed method work for multi-object scenario?
评论

We sincerely thank the reviewer for their insightful comments and recognition of our work, particularly for acknowledging the integrity, effectiveness, and efficiency of our method. We have refined the paper, incorporated additional experiments, and clarified the following points in the revised version. The changes relevant to your concerns are marked in red and we will remove these colors in the final version.

W1: Theoretical analysis of background gradient masking (BGM)

Thanks for your suggestion.

  • As shown in Fig. 3 of our paper, a complete optimization step of our method consists of two stages: forward pass and backward propagation. During the forward pass, we input the source and target images and compute gradient zL\nabla_{z} \mathcal{L} as zL=w(t)(ϵϕ(zt,t,τ(Pt))ϵ^ϕ(zt^,t,τ(Ps))\nabla_z \mathcal{L}=w(t)\left(\epsilon_\phi\left(z_t, t, \tau\left(P_t\right)\right)-\hat{\epsilon}_\phi\left(\hat{z_t}, t, \tau\left(P_s\right)\right)\right. (Eq. 10 of the main paper). No mask is applied during gradient computing, allowing sufficient interaction between the foreground and background within the self-attention layer. This ensures that no distribution shift occurs between the foreground and background in the gradient zL\nabla _{z} \mathcal{L}.

  • Furthermore, the gradient zL\nabla_{z} \mathcal{L} shares the same dimensions as the image latent zz (see more gradient visual results in Fig. 18 of the Appendix). The backward propagation of the gradient zL\nabla_{z} \mathcal{L} is a pixel-wise update on zz. Based on this, we directly apply the mask on the gradient of the background pixels in zL\nabla_{z} \mathcal{L} before backward propagation, resulting in our zLBGM\nabla_{z} \mathcal{L}_{BGM}.

  • The update of our gradient on the foreground is the same as that of zL\nabla_{z} \mathcal{L} on the foreground. Since there is no distribution shift in zL\nabla_{z} \mathcal{L}, our zLBGM\nabla_{z} \mathcal{L}_{BGM} similarly avoids introducing distribution shifts in the foreground update.

  • Besides, as shown in Fig. 23 in the Appendix, our method consistently generates results that are natural and coherent in both the foreground and background, further proving that there is no distribution shift between the foreground and background in our zLBGM\nabla_{z} \mathcal{L}_{BGM}.

  • If any explanation remains unclear, please do not hesitate to reach out. We are very happy to clarify and address any misunderstandings.

评论

W2: Ablation study on the automatic bbox generation

We set the element-wise exponentiation of self-attention map AsA^s and cross-attention map AcA^c in Eq.8 to αs\alpha_s and αc\alpha_c respectively and conduct an ablation study on the combination of self-attention and cross-attention. We have added the results in Appendix H and present our quantitative results here. For more qualitative results and analysis, please refer to our modified manuscript. Specifically, we divide their combination into the following scenarios:

  • αs=0\alpha_s=0, αc=1,2\alpha_c=1, 2: In this case, AsA^s is an all-ones matrix, so each element in A^c\hat{A}^c has the same value, which is equal to the sum of all elements in AcA^c. After normalization and applying a threshold, the entire bounding box is activated.

  • αs=1\alpha_s=1, αc=0\alpha_c=0: In this case, AcA^c is a vector of all ones, so each item in Ac^\hat{A^c} is the sum of the elements in the corresponding row of AsA^s. Since AsA^s is the output of softmax, the sum of each row's elements in AsA^s is 1. Therefore, Ac^\hat{A^c} is also a matrix of all ones, which ultimately results in the entire bounding box being activated.

  • αs=1\alpha_s=1, αc=1\alpha_c=1: In this case, the foreground region (FR) in the image cannot be highlighted, resulting in a larger bbox, which causes the background of the source image to be unnecessarily altered.

  • αs=2\alpha_s=2, αc=0\alpha_c=0: In this situation, using the self-attention map alone cannot effectively highlight the FR, resulting in a very imprecise bounding box.

  • αs=2\alpha_s=2, αc=1,2\alpha_c=1, 2: In this case, the self-attention map takes a leading role in the bbox generation process, producing a smaller bbox, thereby reducing foreground consistency.

  • αs=1\alpha_s=1, αc=2\alpha_c=2 (Ours): Our setting achieves a proper balance between foreground and background consistency, fully covering the source concept while minimizing background modification.

αs\alpha_s, αc\alpha_cBboxCLIP-I ↑PSNR ↑LPIPS ↓MSE ↓SSIM ↑CLIP-T ↑
0, 1full image72.7015.25327.23350.0164.8021.23
0, 2full image72.7015.25327.23350.0164.8021.23
1, 0full image72.7015.25327.23350.0164.8021.23
1, 1larger than FR74.1823.8393.1282.4482.9525.19
1, 2 (Ours)Properly cover FR75.0027.3748.4028.0386.5825.74
2, 0larger than FR73.6626.4965.5656.8184.8124.97
2, 1smaller than FR73.6730.0837.3419.1587.3025.12
2, 2smaller than FR72.6630.8634.3815.3887.5424.78

W3: Analysis of the number of target images

Before concept swapping, our method utilizes DreamBooth and a set of images (typically fewer than 5) Xt=xii=1M\mathcal{X} _t=\\{ x _i \\} _{i=1}^M for customizing certain concepts, where MM is the number of target images used in the customization process.

We have added a detailed analysis of the number of target images MM in Appendix I and present our quantitative results here. Please refer to our modified manuscript for more detailed qualitative results and analysis.

M=1M=1M=2M=2M=3M=3M=4M=4
CLIP-I ↑74.7875.4075.5775.65
CLIP-T ↑28.9428.9329.1329.44

Q1: How can InstantSwap handle multi-object scenario

  • As shown in Sec 4.6 Multi-concept swapping of the main paper, our method can perform multi-object swapping by sequentially performing multiple single-concept swaps. We also provide a visual example in Fig. 10.

  • Besides this simple strategy, we can employ multiple bounding boxes and implement our SECR strategy simultaneously on multiple objects within an image. Furthermore, combining our method with Gligen [1] can facilitate handling multi-object scenarios with implicit grounding conditions. We will explore this as a future work.

[1] Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.

评论

Dear Reviewer nPYy:

Thank you once again for dedicating your valuable time to reviewing our paper and providing constructive comments! As the end of the discussion period approaches, we kindly ask if our responses have satisfactorily addressed your concerns. If you have any further inquiries about our method, or if you still have any concerns about our paper, please do not hesitate to inform us. We are more than willing to engage in timely discussions with you.

Sincerely,

The Authors

评论

I acknowledge I have read authors' responses and the reviews from other reviewers. Most of my concerns are addressed. The theoretical analysis doesn't look convincing but I understand it is not the main focus of this paper. After reading other reviews, I have some new comments:

  1. For a training-free method, it is important to prove its generalizability on different architecture. Although the proposed method doesn't work for Stable Diffusion 3, it would also be better to present some results for other stable diffusion models like stable diffusion 1.5 and stable diffusion XL.
  2. I don't think the failure case and the limitation is fully discussed. According to Figure 28 and Figure 29, InstantSwap seems to be limited to handle concepts with complicated texture. For example, the details of barn look different for different images and the identity of cat is not preserved.
  3. The proposed benchmark ConceptBench only provides 10 concepts and most of concepts are covered by DreamBooth, which seems to be limited. Considering the proposed method builds upon pre-trained DreamBooth, more challenging unseen concepts might be necessary.
评论

We appreciate your invaluable feedback! We shall exert our utmost efforts to address your inquiries:


Q1 Adapt InstantSwap to other Stable Diffusion models

  • We combine our method with Stable Diffusion 1.5 and conduct comprehensive experiments. We present the quantitative results in the table below. We also provide qualitative results in: https://anonymous.4open.science/r/ICLR_InstantSwap-3030/SD1.5.pdf

  • The results show that our method can integrate well with Stable Diffusion 1.5.

MethodCLIP-I ↑PSNR ↑LPIPS ↓MSE ↓SSIM ↑CLIP-T ↑Inference Time (on a single A100) ↓
Ours w/ SD1.574.2727.9847.0425.2486.0126.1015.09s
Ours75.0027.3748.4028.0386.5825.7413.38s

Q2 Handle concepts with complicated textures

  • Generating high-fidelity concepts with complicated textures is indeed a challenging problem. We analyze that it is primarily caused by customization methods like DreamBooth, which struggles to consistently generate complex concepts in different images.

  • We provide an example in the anonymous link below, where we directly use DreamBooth with the target prompt to generate images. It is evident that concepts generated by DreamBooth are different for different images. https://anonymous.4open.science/r/ICLR_InstantSwap-3030/more_failure_case.pdf

  • However, even if DreamBooth cannot effectively customize these complex concepts, our method still completes the concept swapping precisely (see qualitative results in Q3 below). Besides, our method can be integrated with other customization methods despite DreamBooth, which is discussed in W1 and W2 of Reviewer iqew.


Q3 More challenging concepts

  • During the experiment, we pair each concept from ConceptBench with each source image from SwapBench, resulting in a total of 1,600 images for evaluation. This is larger than the 700 images in PnPInv [11] and the 55 images in PnP [12].

  • Additionally, we include more uncommon challenging concepts from DreamBench++ [13]. Due to time constraints, we only conduct experiments on the first 50 images in SwapBench. We present the quantitative results in the table below. We also provide qualitative results in: https://anonymous.4open.science/r/ICLR_InstantSwap-3030/challenging_concept.pdf

  • The results show that these challenging concepts indeed reduce the foreground consistency of our method. Nevertheless, our approach still faithfully completes the concept swapping and surpasses all compared methods on all metrics. We will include more challenging concepts, conduct experiments on the whole SwapBench, and report the results in the final version.

MethodCLIP-I ↑PSNR ↑LPIPS ↓MSE ↓SSIM ↑CLIP-T ↑
SDS67.0919.47307.43140.3973.7625.16
DDS64.2821.73130.4889.8281.5426.28
CDS65.5020.45139.67120.7880.2826.56
PhotoSwap59.0123.54139.3969.0979.7223.66
PnPInv+P2P60.7024.41102.7651.9583.5925.77
Ours69.1326.5654.1032.8487.5327.33

[11] Pnp inversion: Boosting diffusion-based editing with 3 lines of code. In ICLR, 2024.

[12] Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In CVPR, 2023.

[13] Dreambench++: A human-aligned benchmark for personalized image generation. In Arxiv, 2024.


Lastly, we thank you once again for your response. If you have any further questions, we would be delighted to continue the discussion with you.

评论

Thank authors for the responses. My concerns are well addressed. I would encourage authors to consider including more challenging concepts to make the proposed benchmark compatible to more scenarios. I would like to keep my acceptance score.

评论

Thanks for your response. We appreciate your engagement in our discussion and your recognition of our efforts!

评论

We extend our sincere gratitude to all the reviewers (R1-nPYy, R2-iqew, and R3-VFJh) for their insightful and considerate reviews, which help us to emphasize the contributions of our approach.

We are very encouraged to hear that the reviewers recognized the clear motivation (R2) and effectiveness (R1, R2) of our method, our comprehensive and solid experiments (R2, R3), the advantageous results we presented (R1, R2), the benchmark we proposed (R1, R3), and the fluent writing of our paper (R2, R3).

We would also like to express our sincere gratitude to the reviewers for their insightful identification of areas where our manuscript could be strengthened. We have taken all the suggestions carefully and updated our previous version. In the revised manuscript, we have made the following changes and highlighted them in different colors for different reviewers (R1-red, R2-blue, R3-brown):

  1. Added the ablation study on automatic bounding box generation in Appendix H. (R1)
  2. Added the analysis of the number of target images in Appendix I. (R1)
  3. Added the discussion on the time cost of the customization process in Appendix J. (R2)
  4. Added the results of combining our method with Textual Inversion in Appendix K. (R2)
  5. Added more failure cases and analysis in Appendix P. (R2)
  6. Refined Section 3.2.3 and added a detailed introduction of the customization methods we used in Appendix L. (R3)
  7. Added a detailed analysis of how our SECR can be applied to DiT-based architecture (e.g., SD3) in Appendix M. (R3)
  8. Added more qualitative and quantitative results to demonstrate the naturalness of our resulting image in Appendix N. (R3)

We sincerely hope to engage in further discussion with the reviewers to ensure all concerns have been fully addressed. If any aspects of our work remain unclear, we welcome any further feedback to help improve our manuscript. Thank you very much again!

AC 元评审

The paper proposes a training-free framework for Customized Concept Swapping(CCS). It transfers a target concept described by images and a prompt to the location of a source concept in the source image.

The paper is well written, and the method achieves state-of-the-art performance across various thorough experiments. The work additionally contributes a benchmark for Customized Concept Swapping(CCS) task.

After the post-rebuttal discussion, all reviewers unanimously agree that the paper should be accepted. Remaining minor concerns are the inclusion of more challenging concepts to the proposed benchmark. We encourage the authors to address this before submitting the camera-ready version of the paper.

Overall, this is solid work that is relevant to the community and should be presented at ICLR.

审稿人讨论附加意见

All reviewers were responsive during the rebuttal. The authors formulated an extensive rebuttal that led to a unanimous acceptance post-rebuttal, with minor concerns remaining.

最终决定

Accept (Poster)