Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner
We introduce a methodology for semantic-aware image dragging with high image fidelity.
摘要
评审与讨论
This paper proposes a novel insight to transform the “how to drag” issue into a two-step “what-then-how” by introducing an intention reasoner and a collaborative guidance sampling mechanism. They also raise the problem of image quality issues and design quality guidance to enhance performance. Experiments show their superiority in semantic-aware drag-based editing.
优点
- The issue of ‘‘inherent ambiguity of semantic intention’’ seems crucial and interesting. And shifting “how to drag” issue into two-step “what-then-how” can improve the controlling effect of the dragging operation.
- The paper is well-written and easy to follow.
缺点
Major:
- How were the single experimental results (like Figure 4) selected from the diverse results conforming to the intention? If they were manually chosen, could this lead to unfair comparisons?
- Providing the inferred potential intentions, source, and target prompts corresponding to each generated example may help explain the reasons for superior performance. For instance, in the third row of Fig. 5 on the left side, considering that the hand and the dragging operation should not be related to the corresponding prompt, why do other methods struggle with handling the hand information, while the proposed method can manage it effectively?
Minor:
- Fig. 5 caption, misspell of DragonDif(f)usion.
问题
I believe that, fundamentally, the user's dragging operation has an intuitive single expectation, but the ambiguity of the operation itself leads to an ill-posed one-to-many mapping. Moreover, this mapping is difficult to traverse and may not even include the user's actual expectation among the diverse versions. Therefore, is it possible to further narrow down the range of expected dragging results, or allow the user to explicitly indicate their needs through additional operations? For other detailed questions and suggestions please refer to the major weakness.
局限性
The authors have adequately addressed the limitations.
Thank you for your thoughtful and constructive feedback. We are encouraged that you find our ''what-then-how'' paradigm to be novel and effective. Here is our response to address your concerns.
W1: How were the single experimental results (like Figure 4) selected from the diverse results conforming to the intention? If they were manually chosen, could this lead to unfair comparisons?
As discussed in Section 3.1, the intentions of the single experimental results are selected based on confidence probabilities to make a fair comparison. We will include relevant explanations in the revised version to avoid misunderstandings.
W2: Providing the inferred potential intentions, source, and target prompts corresponding to each generated example may help explain the reasons for superior performance. For instance, in the third row of Fig. 5 on the left side, considering that the hand and the dragging operation should not be related to the corresponding prompt, why do other methods struggle with handling the hand information, while the proposed method can manage it effectively?
(1) Thanks for your suggestion. We present these prompts in Fig. 13 and Fig. 14 in rebuttal PDF. We will incorporate the corresponding prompts in the updated version of our paper.
(2) The intention, source prompt and target prompt of the third row of Fig. 5 are "move the cup to the top right.", "a cup of coffee on the bottom left." and "a cup of coffee on the top right.", respectively.
-
Semantic guidance focusing more on the cup means less variation in other areas (hands). By using semantic-aware prompts, semantic guidance directs the editing process to focus on the cup area, thus avoiding changes to the hand.
-
Quality guidance guarantees better image quality and avoids hand distortion. We propose quality guidance, which maintains image quality via a score-based classifier in the editing process. This approach improves overall image quality and prevents hand distortion.
W3: Misspelling of DragonDif(f)usion in Fig. 5 caption.
Thank you for pointing out the misspelling error. We will correct it in the updated version.
Q1: The user's dragging operation has an intuitive single expectation, but the ambiguity of the operation itself leads to an ill-posed one-to-many mapping. Moreover, this mapping is difficult to traverse and may not even include the user's actual expectation among the diverse versions. Therefore, is it possible to further narrow down the range of expected dragging results, or allow the user to explicitly indicate their needs through additional operations?
(1) Covering the user needs. Our approach explores diverse plausible user intentions through repeated sampling with LLMs, significantly increasing the likelihood of covering the user's request. For users without clear intentions, we could present a diverse set of plausible intentions for them to choose from, inspiring new ideas, without requiring further input from the user. For users with clear intentions, we consider various scenarios (rigid, non-rigid, rotating, etc.) and use repeated sampling with LLMs to explore diverse possibilities, as outlined by Brown B, et al. [r14]. This approach significantly increases the likelihood of covering the user's actual intention.
(2) Allowing flexible interaction. The intention reasoner allows the users to further narrow down the range of expected dragging results. Specifically, the user can input external constraints to the LLM to limit the scope of generated intents. They can also select the desired intent from a variety of generated intents.
(3) Advantages of the intention reasoner. It offers significant benefits over user-providing intentions. The intention reasoner can handle vague requests, express complex needs, discover potential needs, and reduce cognitive load.
[r14] Brown B, et al. Large language monkeys: Scaling inference compute with repeated sampling. arXiv, 2024.
Thank you for your responses. After considering other reviews and feedback, I'm inclined to maintain my score of 5 and am leaning toward accepting the paper. However, it could still go either way.
This paper aims to address the limitation of current dragging-based image editing methods that understand the intentions of users. To this end, the proposed method leverages the reasoning ability of LLMs to infer possible intentions, which are used to provide (asymmetric) semantic guidance in editing. Furthermore, a collaborative guidance sampling method that jointly exploits semantic guidance, quality guidance, and editing guidance is proposed. Among these three types of guidance, quality guidance is provided by a self-trained discriminator with an aesthetic score and images generated by a baseline. The experimental results on the DragBench dataset indicate that the proposed method outperforms several existing methods.
优点
-
This paper is well-written and the ill-posedness of dragging-based image editing is indeed a problem that should not be ignored.
-
The proposed LLM-based reasoner and the collaborative guidance sampling strategy improve the baseline performance.
缺点
- Despite the performance gain, there are observable artifacts in the synthesized images as follows:
- Page 7, Figure 4, the last row, windows are missing.
- Page 8, Figure 5, 1st row, right column, the shape and the texture of the mailbox are changed.
- Page 16, Figure 7, 2nd row, wheels are changed; the shape of the target in the second last row cannot be preserved either.
- Utilizing LLMs to infer intentions is interesting. However, it requires further clarification:
- How can we ensure that the predicted outputs of LLMs (even with the highest confidence) really coincide with the intentions of users?
- To the best of my knowledge, the selected baselines do not involve semantic priors from LLMs. Hence, the comparison might be unfair. To fully demonstrate the advantages of the proposed method, text/instruction-based methods should be included for comparison.
- Even with the provided visualization of the ablation study (Fig. 6), it is still unclear how the introduced components affect the results. Sometimes the full model generates artifacts that do not appear in the results of the variants of the proposed method (e.g., the round bottom of the spinning top in Fig. 6 is changed mistakenly by the full model).
问题
Q1: In A.8 the limitations section, why the proposed method is training-free as it includes a trainable quality discriminator?
Q2: How is the computational complexity of the proposed method?
Q3: How do we select the optimal one from the sampled outputs from the LLM?
Q4: Except for the GScore metric proposed by GoodDrag, has any other commonly-used quality metric (e.g., LPIPS) used for evaluation?
Q5: What will the results look like if the predicted intentions and dragging movements are contradictory?
I might change my score depending on the responses to the above questions.
局限性
Yes, the authors have discussed the limitations and the potential negative societal impact of the proposed method in the appendix.
We sincerely appreciate your valuable feedback. Below is our response.
W1: Despite the performance gain, there are observable artifacts.
We acknowledge these artifacts, a common challenge for drag editing [DragDiffusion, Shi et al. [2024]], but they do not diminish the overall effectiveness of our method. (1)Diverse Generation Capabilities: Our "what-then-how" paradigm offers a novel insight, demonstrating significant improvements in the diversity, reasonability of generated intentions, and interpretability of editing outcomes. (2)Comparative Analysis: Quantitatively, our method achieves superior editing accuracy and image quality compared to others. Qualitatively, as Fig.4 shows, we successfully positioned all objects in the target area, enhancing overall harmony, and reducing distortions. (3)Adaptive adjustment: Our method navigates the trade-off between editing accuracy and image quality [DiffEdit, Couairon, et al. [2023]]. Users can utilize adaptive controls to prioritize editing precision or visual quality based on their preferences.
W2-1: How to ensure that the predicted outputs coincide with user intentions.
We use repeated sampling to explore various plausible user intentions, increasing the probability of covering their true intentions [r10]. For users with clear intentions, we utilize advanced intention reasoning techniques and repeated sampling to explore different scenarios (e.g., rigid, non-rigid, rotating). This repeated sampling can boost the probability of LLMs' outputs [10] aligning with the user's true intentions. For users with unclear intentions, our system automatically generates plausible semantic intentions by analyzing context and inferring implicit needs, reducing the need for explicit user input and providing reasonable intention estimates. The interactive intention reasoner also allows users to adjust and refine their intentions in real-time, helping align the generated results with user preferences and evolving needs, ensuring relevance and accuracy.
Through these integrated strategies, our method effectively covers and aligns with user intentions, delivering high-quality predictive results.
[10] Brown B, et al. Large language monkeys: Scaling inference compute with repeated sampling. arXiv, 2024.
W2-2: Comparisons with text/instruction-based methods.
We present quantitative results (Table r3) and qualitative results ( Fig. 13 in Rebuttal PDF). The results show that we outperform them by a significant margin. This advantage may be attributed to the fact that drag editing primarily involves spatial manipulation, whereas the training data of existing text/instruction-based methods [r11, r12, r13] do not include samples with spatial position changes [Drag, Pan et al., 2023].
Table r3: Comparisons with text-based methods.
| InstructPix2Pix [r11] | MagicBrush [r12] | ZONE [r13] | Ours | |
|---|---|---|---|---|
| Mean Distance (↓) | 52.34 | 55.77 | 56.83 | 20.46 |
| GScore (↑) | 7.14 | 7.16 | 6.83 | 7.37 |
[r11] Brooks T, et al. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
[r12] Zhang K, et al. Magicbrush: A manually annotated dataset for instruction-guided image editing. In NeurIPS, 2023.
[r13] Li S, et al. Zone: Zero-shot instruction-guided local editing. In CVPR, 2024.
W2-3: How the introduced components affect the results.
(1) Effects of the components. Removing the intention reasoner results in semantically incorrect results (e.g. the shape of the spinning top and wheel). Removing the quality guidance reduces image quality (e.g. artifacts in the spinning top and the red frame of the front wheel)
(2) Reasonableness of the shape change of the spinning top. The source and target prompts are "a photo of a wide spinning top" and "a photo of a narrow spinning top." Since the red bottom is part of the spinning top and included in the editing area, narrowing it is semantically consistent. Moreover, the collaboration between the intention reasoner and quality guidance results in a more pronounced narrowing effect in the full implementation. If the user wants to preserve the bottom part, they can exclude it from the editing area(see Fig. 15 in the Rebuttal PDF).
Q1: The confusion of the expression training-free.
Thanks. We will correct it in the revised version.
Q2: The computational complexity.
Our method has a relatively small inference time and comparable memory requirements.
Table r4: Computational complexity.
| DragDiffusion | FreeDrag | DragonDiffusion | DiffEditor | Ours | |
|---|---|---|---|---|---|
| Time (s) ↓ | 80 | 92 | 30 | 35 | 48 |
| Memory (GB) ↓ | 12.8 | 13.1 | 15.7 | 15.7 | 15.8 |
Q3: Details of how to select the optimal one from the LLM outputs.
We utilize LLM to infer N times. Each output contains text and corresponding confidence probabilities which reflect the quality of the text. We then sample based on the confidence probabilities. Details will be added to the revised version.
Q4: Performance on LPIPS?
Our method achieves better or comparable performance. However, we would like to point out that as LPIPS is trained on limited images, it may not effectively differentiate the performance of different models [GoodDrag, Zhang et al. [2024b]].
Table r5: Quantitative comparisons.
| DragDiffusion | FreeDrag | DragonDiffusion | DiffEditor | Ours | |
|---|---|---|---|---|---|
| 1-LPIPS (↓) | 0.137 | 0.116 | 0.124 | 0.113 | 0.114 |
Q5: What if the predicted intentions and dragging movements are contradictory?
We admit that semantic intent will have a side effect if they are contradictory. Fortunately, no contradictions occur in our experiments, which we attribute to the powerful reasoning ability of LLM [r8]. Even if the contradictions do arise, users can reply to the LLM which will incorporate additional information to satisfy the needs [r9].
I thank the authors for their efforts in addressing my concerns. I tend to maintain my score as borderline accept, but as mentioned by the other reviewer, it can go either way.
This paper presents a novel framework called LucidDrag for drag-based image editing. Compared to previous methods, LucidDrag first reasons the intention of the draging operation using LLM (GPT 3.5) and provides a semantic guidance for the following editing process. For the better image fidelity, the authors also design a GAN-based discriminator as the quality guidance, obtaining improved results.
The authors perform sufficient comparison experiments and ablation studies to support their claims and verify the efficiency of their proposed components. And the authors promise to release the codes.
优点
-
The writing is well except for some minor issues.
-
The authors note the problem that the drag-based editing has the inherently ill-posed nature, which is because multiple editing results may correspond to the same input image and draging conditions. Based on this observation, the authors attempt to introduce additional prompts to provide semantic guidance for better editing results.
-
The comparisons and ablation studies are sufficient to support the claims.
缺点
-
It's a good idea to introduce additional prompts to help the editing process. However, why using an Intention Reasoner to "guess" multiple intentions and then sampling between them? Maybe a better way is that the users provides their true intention and LLMs are just used to normalize the prompts? So, I'm skeptical about the utility of the Intention Reasoner.
-
Some confusions:
-
Line 162-163, how do you get reasonable intentions only taking the generated description of the object of interest 'O', the original image caption 'C', and drag points 'P' as input? Take Fig.2 as an example, 'O' is "The nose of a woman", 'C' is "A woman", 'P' is the draging points. There is no information about the original image, so how can LLMs output intentions like "A woman looking up"? Where does the "looking up" come from?
-
Line 173, should be ? or the on the left in Fig.2 should be ? Maybe the notations in Fig. 2 should be modified.
-
Fig. 4, LucidGrag needs additional generated prompts as inputs. You should also demonstrate these prompts to help us understand the editing process. BTW, due to sampling generated intentions based on the confidence probabilities (Line 169), the generated results of LucidGrag should be different. Maybe showing different results can also help us to understand.
问题
Refer to the weakness.
局限性
Yes.
Thank you for your thoughtful and constructive feedback. We are encouraged by your recognition of the value of our work and your acknowledgment that our experiments sufficiently support our claims. Below is our response addressing your concerns.
W1: It's a good idea to introduce additional prompts to help the editing process. However, why using an Intention Reasoner to "guess" multiple intentions and then sampling between them? Maybe a better way is that the users provides their true intention and LLMs are just used to normalize the prompts?
Thank you for your suggestion. While allowing users to directly provide their true intent and using LLMs to normalize prompts is indeed feasible, especially when users can clearly articulate their needs, the intention reasoner offers several significant advantages:
(1) Handling vague requests. Users' requests often lack specificity, such as "Drag the horse's head to the top right" in Fig. 1. The intention reasoner can analyze context data to infer multiple potential intentions ("long necks" or "heads up" or "closer"), allowing users to select the most appropriate one for execution.
(2) Expressing complex needs. Articulating complex manipulations involving multiple points or objects can be challenging for users. The intention reasoner can generate precise descriptions automatically, ensuring accurate and consistent adjustments, thereby simplifying the user's task.
(3) Discovering potential needs. As shown in Fig.1, the intention reasoner can present multiple possibilities, enabling users to discover and explore various editing choices they might find beneficial, thus enhancing the overall experience.
(4) Reducing cognitive load. Many users, particularly beginners, may struggle to provide their intentions precisely. The intention reasoner can infer the potential intentions, alleviating the need for detailed instructions and significantly improving operational efficiency.
Additionally, by leveraging the reasoning ability of LLMs, we can generate a variety of reasonable intentions and produce high-quality results. Generating diverse results is a challenging task [r7]. Our method can be used to construct an image editing dataset with diverse editing outcomes, which may inspire future work.
[r7] Corso G, et al. Particle guidance: Non-IID diverse sampling with diffusion models. In ICLR, 2024.
Confusion1: How do you get reasonable intentions only taking the generated description of the object of interest 'O', the original image caption 'C', and drag points 'P' as input?
The intention reasoner can produce reasonable results because large models possess in-context learning, space understanding, and reasoning abilities [r8]. Additionally, previous work has demonstrated that LLMs can generate reliable results through these abilities [r9].
For the case in Fig. 2, we first combine the inputs with detailed task descriptions and in-context examples to ensure the rationality of the output. Then, given the position of the source point and target point, the LLM can comprehend the dragging direction. Finally, combined with the region of interest ("the nose of a woman") and the whole image description ("a woman"), the LLM can deduce a reasonable intention, e.g. "looking up". This result is logical because dragging the face to the upper left may be an attempt to make the person look up. LLM can easily accomplish this kind of logical reasoning [r8, r9].
[r8] Zhang Y, et al. LLM as a mastermind: A survey of strategic reasoning with large language models. arXiv, 2024.
[r9] Lian L, et al. LLM-grounded Video Diffusion Models. In ICLR 2024.
Confusion2: Notations of and in Fig. 2 and Line 173.
Thank you for highlighting the difficulty in interpreting the notations of and . To clarify the pipeline, the input image is inverted to and the generation process starts with . In terms of value, we initialize with , i.e., . We will improve the clarity of these notations in the updated version.
Confusion3: Should demonstrate generated prompts in Fig. 4. Suggest to show different results of various generated intentions.
(1) We present these prompts in Fig. 13 in Rebuttal PDF. We will incorporate the corresponding prompts in the updated version of our paper.
(2) In Fig. 3 of the main paper, we present the results of various intentions with different editing targets. We supplement the results of various intentions with the same editing target in Fig. 13 in the Rebuttal PDF file. The results show that various text intentions generated by the intention reasoner with the same editing target will result in similar results and have some differences in details.
Dear Reviewer 7sLK,
We have tried our best to address all the concerns and provided as much evidence as possible. May we know if our rebuttals answer all your questions? We truly appreciate it.
Best regards,
Author #9041
Thank you again for reviewing our manuscript. We have tried our best to address your questions (see our rebuttal in the top-level comment and above). Please kindly let us know if you have any follow-up questions or areas needing further clarification. Your insights are valuable to us, and we stand ready to provide any additional information that could be helpful.
The paper introduces a novel framework for semantic-aware drag-based image editing. Specifically, to address the limitations in understanding semantic intentions and generating high-quality edited images, it utilizes an intention reasoner to deduce potential editing intentions and a collaborative guidance sampling mechanism that integrates semantic guidance and quality guidance. Experimental results validate the effectiveness of the proposed method in producing semantically coherent and diverse image editing outcomes.
优点
1.The paper is well-organized and easy to follow.
2.The proposed what-then-how" paradigm for drag-based editing is novel and sound.
3.The proposed collaborative guidance is also interesting and has shown to be effective.
4.The proposed method is shown to outperform the existing methods on various editing tasks.
缺点
1.It seems that the used LVLM and LLM models are not finetuned, and I am wondering how different models perform on the task. Are the confidence probabilities reliable or meaningful without any fine-tuning?
2.The efficiency of different methods should be provided for comparison.
3.The limitations are recommended to be added to the main paper.
问题
Please refer to the weakness part.
局限性
The limitations are discussed in the supplementary material.
Thank you for your thoughtful and constructive feedback. We are encouraged that you find our ''what-then-how'' paradigm and collaborative guidance mechanism to be novel and effective. Below are our responses to your concerns.
W1: How different LVLMs and LLMs perform on the task? Are the confidence probabilities reliable or meaningful without any fine-tuning?
(1)We conduct experiments to examine the performance of different LVLMs and LLMs in the Intention Reasoner module. Specifically, we utilize Osprey [r1] and Ferret [r2] for LVLM and Vicuna [r3], LLama3 [r4], and GPT 3.5 [r5] for LLM. We test various combinations, with Osprey+GPT3.5 being the default setting in our paper. As shown in Table r1, all combinations outperform the experiment without the Intention Reasoner, confirming its reliability without fine-tuning. This reliability stems two-fold: the LVLMs are trained with large-scale point-level labeled data and can easily achieve point-level understanding [r1]. Therefore, they can understand the user-given points without further fine-tuning. For the LLMs, state-of-the-art LLMs have been proven to possess strong spatial reasoning abilities [r6], enabling them to deduce reasonable intentions without fine-tuning.
(2) The confidence probabilities are reliable and meaningful without fine-tuning. We present qualitative analysis in Fig. 12 in Rebuttal PDF. As discussed above, the LLM's powerful reasoning capabilities guarantee its reliability without fine-tuning. The confidence probability reflects the quality of the output text in LLM. As shown in Fig. 12, a higher confidence probability indicates that the intention of the output is more reasonable, leading to better editing results.
Table r1: Results with different LVLMs and LLMs
| w/o Intention Reasoner | Ferret+Vicuna | Ferret+LLama3 | Ferret+GPT3.5 | Osprey+Vicuna | Osprey+LLama3 | Osprey+GPT3.5 (Ours) | |
|---|---|---|---|---|---|---|---|
| Mean Distance (↓) | 23.66 | 22.49 | 21.96 | 20.65 | 20.84 | 20.48 | 20.46 |
| GScore (↑) | 6.76 | 7.12 | 7.11 | 7.35 | 7.27 | 7.13 | 7.37 |
[r1] Yuan Y, et al. Osprey: Pixel understanding with visual instruction tuning. In CVPR, 2024.
[r2] You H, et al. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
[r3] Chiang WL, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. In URL: https://vicuna.lmsys.org, 2023.
[r4] Touvron H, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[r5] OpenAI. Chatgpt. In URL: https://openai.com/blog/chatgpt, 2022.
[r6] Gurnee W, et al. Language models represent space and time. In ICLR, 2024..
W2: The efficiency of different methods.
We present the efficiency of different methods in Table r2. Our method has a relatively small inference time and comparable memory requirements.
Table r2: Efficiency of different methods.
| DragDiffusion | FreeDrag | DragonDiffusion | DiffEditor | Ours | |
|---|---|---|---|---|---|
| Time (s) ↓ | 80 | 92 | 30 | 35 | 48 |
| Memory (GB) ↓ | 12.8 | 13.1 | 15.7 | 15.7 | 15.8 |
W3: The limitations are recommended to be added to the main paper.
Thanks for your valuable suggestion. We will move the limitations section from the supplementary materials to the main paper in the updated version.
Thank you for addressing the initial concerns raised. I will keep my score unchanged.
We thank all the reviewers for their time and insightful comments which have helped improve our paper. We are encouraged that the reviewers found the importance of addressing the ill-posedness of dragging-based image editing which we aim to solve(Reviewer 17V7, 6UnY). We appreciate their positive feedback on the novelty of our 'what-then-how' paradigm (Reviewers ND9N, 6UnY), the effectiveness of our collaborative guidance (Reviewers ND9N, 7sLK, 17V7, 6UnY), and the sufficiency of our experiments in supporting our claims (Reviewers ND9N, 7sLK, 6UnY). We have provided detailed answers to each question below. We hope that our response addresses these concerns.
Thanks for your response. The prompts in PDF Fig 14 and 15 seem to replicate.
Thank you for your thoughtful feedback and attention to detail regarding Fig. 14 and Fig. 15.
(1) We appreciate you pointing out the replication issue for Fig. 14. The correct prompts are as follows, and we will include these prompts in the revised paper.
-
A photo of many cups on the table, with a cup on the right.
A photo of many cups on the table, with a cup on the left.
-
An astronaut plays football. The football is on his left.
An astronaut plays football. The football is on his right.
-
A cup of coffee on the bottom left.
A cup of coffee on the top right.
-
A photo of a mailbox.
A photo of a mailbox on top left.
-
A photo of four apples.
A photo of four apples with an apple on top right.
-
A donut on a square board.
A donut on the left of a square board.
(2) Regarding Fig. 15, both cases indeed use the same prompt. This example demonstrates that even with the same prompt, users can control which parts to be edited by adjusting the editing area. Our approach enables semantic-aware editing within the selected area. Specifically, in the left case, the editing area includes the red bottom part and middle part, resulting in a narrowed bottom part and middle part in the outcome. In contrast, the right case only selects the middle part for editing, so only the middle part becomes narrow.
This paper introduces an intriguing framework called LucidDrag for semantic-aware drag-based image editing. LucidDrag comprises an Intention Reasoner and a Collaborative Guidance Sampling mechanism. The Intention Reasoner leverages both an LVLM and an LLM to infer semantic intentions, which are then used to facilitate semantic-aware editing. The paper received one weak accept (Reviewer ND9N), two borderline accepts (Reviewers 6UnY and 17V7), and one borderline reject (Reviewer 7sLK).
Most reviewers acknowledged that addressing the ill-posedness of drag-based image editing is crucial and should not be ignored. The main concerns raised by the reviewers relate to 1) whether using an LLM to infer intentions during editing is effective and whether it accurately integrates the user’s true intentions (as noted by Reviewers 7sLK, 17V7, and 6UnY), and 2) the experimental evaluations.
Many of the concerns raised by Reviewers ND9N, 17V7, and 6UnY were addressed during the rebuttal, unfortunately Reviewer 7sLK did not have a chance to respond. After carefully reviewing the paper, reviews, and rebuttal, while the overall technical novelty is limited, as no novel core modules are introduced, the framework’s use of LVLM and LLM to infer possible intentions for drag-based image editing is inspiring. Therefore, this paper is recommended for acceptance.