Thank you very much for your constructive comments. We addressed you comments and questions and revised the paper. We summarize the answers below.

[W1] Input Constraint

We respect but disagree with your view regarding the proposed method's setup. Essentially, our method is not intended to enhance the effectiveness of general image editing tasks, but rather aims to accurately replicate the transformations between images pairs and apply them to new images. This setup has been thoroughly explored in the 'image analogy' task [1][2]. It uses pairs of images before-after transformation as conditions and provides an input image as a query to analogize to the result. Therefore, providing example image pairs to represent editing effects is necessary. Moreover, some existing works have explored representing transformations through example image pairs (image prompts) to guide image editing [3][4][5], thus the setup of this study has a solid theoretical background.

Moreover, because the example image pairs are responsible for demonstrating the transformation effects, they naturally contain similar parts. If the two images differ too much, it may be challenging to intuitively reflect the transformation concepts involved. Regarding the practicality of our method, we have tested various types of transformation effects. In addition to the editing covered in TOP-Bench, such as global editing, local editing, multi-object editing, shape editing, style editing, identity replacement, object replacement, and face attribute manipulation, we have also introduced image tone modification (Figure 7), hybrid editing (Figure 8), and real-world image editing (Figure 11). The results indicate that the method has strong generalization capabilities.

[W2] Redundancy of Exemplar Image Pairs

For the reviewer's concern about the necessity of using image pairs as input, we have already explained this in detail in [w1].

Furthermore, it is necessary for us to reiterate the implementation process of our method. The primary purpose of using image pairs as input is to serve as a supervisory signal, aiding the feature optimization process of the editing instructions. The role of using image pairs to acquire unique phrases is to introduce semantic information into our optimization process. Therefore, our method is essentially based on optimization to learn image pair-based editing, similar to the image analogy task [1][2]. This optimization approach can directly learn pixel-level or even mask-level transformations between image pairs, which are difficult for text-guided editing methods like GPT-4+IP2P to capture.

While it might be possible to introduce semantic priors into our optimization process through slight modifications to GPT-4, our unique phrase extraction method is more lightweight and efficient. In addition, it supports customization of the phrase dictionary and CLIP models to better suit downstream tasks.

[W3] Lack of Novelty

We fully understand the reviewer's emphasis on innovation. but as Reviewer 8Vpr stated in Strength 2, we emphasize that the innovation of our method does not lie in proposing a novel module. Instead, it delves into the extraction of editing concepts based on image prompts. We leverage the prior implementation of Instruction-based image editing models and innovatively introduce the KV token optimization in Attention for image prompts learning to represent the target editing instructions, enhancing the representational power of the concept. Furthermore, to further explore the impact of instructions on editing process, we introduce a time-aware method to optimize the instructions at different denoising times, comprehensively investigate the representation of editing instructions, and provide a clear solution for visual prompt editing.

[1] Hertzmann et al., Image analogies, SIGGRAPH 2001.

[2] Liao et al., Visual attribute transfer through deep image analogy, SIGGRAPH 2017.

[3] Nguyen et al., Visual instruction inversion: Image editing via image prompting, NIPS 2023.

[4] Yang et al., Imagebrush: Learning visual in-context instructions for exemplar-based image manipulation, NIPS 2023.

[5] Gu et al., Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model, SIGGRAPH 2024.