PaperHub
5.5
/10
Poster4 位审稿人
最低4最高7标准差1.1
7
4
6
5
3.3
置信度
正确性3.0
贡献度2.8
表达3.0
NeurIPS 2024

FastDrag: Manipulate Anything in One Step

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06

摘要

Drag-based image editing using generative models provides precise control over image contents, enabling users to manipulate anything in an image with a few clicks. However, prevailing methods typically adopt $n$-step iterations for latent semantic optimization to achieve drag-based image editing, which is time-consuming and limits practical applications. In this paper, we introduce a novel one-step drag-based image editing method, i.e., FastDrag, to accelerate the editing process. Central to our approach is a latent warpage function (LWF), which simulates the behavior of a stretched material to adjust the location of individual pixels within the latent space. This innovation achieves one-step latent semantic optimization and hence significantly promotes editing speeds. Meanwhile, null regions emerging after applying LWF are addressed by our proposed bilateral nearest neighbor interpolation (BNNI) strategy. This strategy interpolates these regions using similar features from neighboring areas, thus enhancing semantic integrity. Additionally, a consistency-preserving strategy is introduced to maintain the consistency between the edited and original images by adopting semantic information from the original image, saved as key and value pairs in self-attention module during diffusion inversion, to guide the diffusion sampling. Our FastDrag is validated on the DragBench dataset, demonstrating substantial improvements in processing time over existing methods, while achieving enhanced editing performance.
关键词
Diffusion Model;Drag-based Image Editing

评审与讨论

审稿意见
7

This paper presents a method that enables fast drag-based image editing using diffusion models. The proposed method uses a latent warpage function to obtain the dragged latent representation. The additional nearest neighbor interpolation and content-preserving strategy further improve the result.

优点

The strengths of this paper are:

  • The proposed method seems straightforward, and the results are convincing.
  • A large speed performance gain compared to the previous method.
  • The evaluations support the paper’s claim.

缺点

The weaknesses of this paper are:

  • I think the discussion of limitation is too shallow. I recommend providing more failing cases and discussing the potential directions to enhance the results.

问题

  • I wonder why the interpolation is called “Bilateral” nearest neighbor interpolation. For me, “Bilateral” term should be used for a scheme that considers the value as well. And the proposed interpolation scheme is just find the nearest non-null value. This is not critical but I think it is misleading.
  • It is unclear to me what the brighter region around the drag arrow means. Is that a mask? Are those region/mask included and used in drag instructions?
  • I am confused about Figure 8. What is the desired effect and result of the drag edit? Semantically, the user is moving “the dog’s” hand, not “the cat’s”. However, the result seem to generate cat’s hand for all different steps. Is this some limitation of all drag-based method? If yes, I think it is worth discussing.
  • I wonder if there are metrics or evaluation processes that can evaluate the content preserving quantitatively. Regarding image editing, I think this is a very important factor and the current paper only provides a couple of images as evidence.

局限性

As I mentioned above, I think the limitation discussion is not enough (I also raised another potential limitation). Moreover, I strongly recommend including the limitation discussion in the main paper, not supplemental material.

作者回复

We would like to thank you for the positive feedback, helpful comments, and the support of our work. Following are our responses to each individual comment (which are highlighted in italics).

Response for Weaknesses (RfW)

W1: I think the discussion of limitation is too shallow. I recommend providing more failing cases and discussing the potential directions to enhance the results.

RfW1: Thank you for the constructive suggestion. We have provided more failing cases to illustrate the limitations of our study, as shown in Fig. 1 and Fig. 2 of the attached PDF with our author rebuttal. We also provided in-depth discussion about these limitations for potential future directions, please see the detailed analysis of the ''Limitation'' section in General Response to reviewers.

Response for Questions

Q1: I wonder why the interpolation is called ''Bilateral'' nearest neighbor interpolation. For me, ''Bilateral'' term should be used for a scheme that considers the value as well. And the proposed interpolation scheme is just find the nearest null value. This is not critical but I think it is misleading.

We apologize for any confusion caused by our terminology. The term ''Bilateral'' was chosen to emphasize that the interpolation considers nearest neighbors in two directions (i.e., both the X and Y axes), highlighting that the interpolation process involves adjacent non-null data across two dimensions. We will explain this terminology in the camera-ready version for better clarity.

Q2: It is unclear to me what the brighter region around the drag arrow means. Is that a mask? Are those region/mask included and used in drag instructions?

Yes, you are correct, the brighter region is a mask, indicating the image area to be edited. It is used as drag instruction with the drag arrows, which is widely adopted in drag-based editing [4, 18, 19, 27].

We will explain this in the camera-ready version for better clarity.

Q3: I am confused about Fig.8. What is the desired effect and result of the drag edit? Semantically, the user is moving ''the dog’s'' hand, not ''the cat’s''. However, the result seem to generate cat’s hand for all different steps. Is this some limitation of all drag-based method? If yes, I think it is worth discussing.

We apologize for any confusion regarding Fig.8. This case was randomly selected to illustrate an ablation study on number of inversion steps in terms of drag effect, which is used to determine the number of inversion steps in diffusion inversion. It is not intended to demonstrate a specific editing result. To avoid such confusion, we will replace Fig. 8 with a more proper case in the camera-ready version.

Regarding your question, the observed phenomenon is common for diffusion models. In this case, when ''the hand'' is moved onto ''the cat'', it semantically associates with ''the cat'', resulting in the generation of ''the cat’s hand''. This is because the U-Net architecture used in diffusion models contains numerous CNN layers, which fuse features of the cat and the hand in the latent space when their positions are close together. For generative artificial intelligence, the ability to generate image with contextual semantic association is actually considered a strength of diffusion models, such as Stable Diffusion.

Q4: I wonder if there are metrics or evaluation processes that can evaluate the content preserving quantitatively. Regarding image editing, I think this is a very important factor and the current paper only provides a couple of images as evidence.

Indeed, we have provided evaluation results with regard to content preserving in terms of ''1-LPIPS'' metric, as shown in Table 1 of our paper. ''1-LPIPS'' metric is widely used to for image content consistency evaluation for drag editing [4, 14, 15, 27]. Our experiments are conducted on DragBench dataset, and the results can demonstrate the effectiveness of our method, as explained in general response (i.e., evaluation metrics and statistical rigor) to reviewers.

Response for Limitations

As I mentioned above, I think the limitation discussion is not enough (I also raised another potential limitation). Moreover, I strongly recommend including the limitation discussion in the main paper, not supplemental material.

Thank you for the constructive suggestions. Due to page limitation, we did not include limitations in the main paper of the submitted paper. However, we acknowledge the importance of this discussion and will try our best to enhance and include it in the main paper of the camera-ready version.

评论

Thx authors for the detailed rebuttal. Most of my concerns are addressed. I am still in favor of accepting this paper, as long as authors revised the paper according to their rebuttal.

评论

Thank you for the constructive suggestions and supportive comments.

审稿意见
4

The paper introduces "FastDrag," a novel one-step drag-based image editing method that significantly accelerates the editing process compared to existing n-step iterative approaches. The core of FastDrag is the Latent Warpage Function (LWF), which simulates the behavior of stretched material to adjust pixel locations within the latent space, enabling single-step latent semantic optimization. This results in substantially faster editing speeds while maintaining quality and precision.

The paper's contributions include:

  1. The proposal of FastDrag, a diffusion-based image editing approach that employs LWF for one-step semantic optimization, greatly enhancing editing efficiency.
  2. The introduction of Bilateral Nearest Neighbor Interpolation (BNNI), a new interpolation method that addresses null regions and enhances the semantic integrity of edited content.
  3. The incorporation of a consistency-preserving strategy that maintains image consistency during the editing process by using original image information to guide diffusion sampling.

Experiments on the DragBench dataset demonstrate FastDrag's superiority in processing time, being nearly 700% faster than the fastest existing method and 2800% faster than the typical baseline method, with comparable editing performance. The paper also includes rigorous ablation studies to validate the strategies used in FastDrag. The authors aim to continue refining and expanding their approach for enhanced capabilities and applications in the future.

优点

Originality: The paper introduces "FastDrag," a novel one-step drag-based image editing method that significantly accelerates the editing process compared to conventional n-step iterative methods. By proposing a Latent Warpage Function (LWF), the authors innovate by simulating the behavior of a stretched material to instantly adjust pixel locations within the latent space, marking a departure from traditional iterative latent optimization. Additionally, the Bilateral Nearest Neighbor Interpolation (BNNI) strategy for handling null regions and the consistency-preserving mechanism that utilizes self-attention key-value pairs add further novelty to the approach. The integration of these components into a diffusion model framework for image editing showcases a creative combination of existing techniques to solve a known problem more efficiently.

Quality: The quality of the work is demonstrated through rigorous experimentation on the DragBench dataset, which validates FastDrag's efficiency and performance. The paper shows that FastDrag is not only faster but also maintains editing quality, being nearly 700% faster than the fastest existing method and 2800% faster than typical baseline methods. A series of ablation studies further reinforces the effectiveness of the individual components of the proposed method. The computational efficiency and scalability implications are considered, even though specific computational requirements are not fully detailed in the abstracted content.

Clarity: The paper is structured coherently, presenting the FastDrag method systematically across its phases: diffusion-based editing, one-step warpage optimization, BNNI for semantic integrity, and the consistency-preserving strategy. The abstract and introduction effectively outline the contributions, setting clear expectations for the paper's content. Theoretical assumptions and limitations are openly discussed, complying with the NeurIPS Code of Ethics. The use of visual aids like Fig. 2 likely contributes to illustrating the method's workflow, although the figure itself is not accessible in the abstracted content.

Significance: The significance of FastDrag lies in its potential to democratize and streamline image editing tasks, enhancing productivity in various domains such as art, design, education, and training. By offering an intuitive and efficient tool, it empowers a broader range of users to engage in creative image manipulation without requiring extensive technical skills. The societal impact discussion highlights both the positive outcomes of fostering creativity and the potential negative implications like misuse for disinformation or privacy breaches, showing a balanced perspective. The authors' consideration of mitigation strategies underscores a proactive approach to responsible innovation.

Overall, FastDrag represents a marked advancement in image editing technology, combining originality in method design, high-quality empirical validation, clarity in presentation, and significant implications for both the research community and practical applications. Its contribution to reducing latency in image editing workflows can have far-reaching effects on digital content creation workflows and user experiences.

缺点

The paper titled "FastDrag: Manipulate Anything in One Step" presents a novel one-step drag-based image editing method that aims to accelerate the editing process compared to existing n-step iterative optimization methods. While the paper makes significant contributions to the field of image editing, there are several areas where the work could be improved:

  1. Theoretical Depth: The paper does not appear to include theoretical results or proofs to support the claims made. While it is stated that the paper does not involve theoretical results, providing some theoretical grounding could strengthen the contribution, such as explaining the mathematical properties of the proposed Latent Warpage Function (LWF) or the bilateral nearest neighbor interpolation (BNNI) strategy.

  2. Experimental Design: While the paper includes qualitative and quantitative evaluations, there could be a more extensive range of experiments to further validate the robustness of the method. For instance, testing the method on a broader variety of image types and editing tasks could provide a more comprehensive understanding of its capabilities and limitations.

  3. Statistical Rigor: The paper does not report error bars or other statistical measures that would provide insight into the variability and reliability of the results. Including such measures would enhance the credibility of the experimental findings.

  4. Comparison with State-of-the-Art: The paper compares its method with existing techniques but could benefit from a more detailed analysis of where FastDrag stands in terms of the trade-offs between speed and quality. A deeper dive into how the method's performance compares across different types of image content and editing tasks would be valuable.

  5. User Study: The paper lacks a user study that could provide insights into the usability and user experience of the FastDrag method. Including feedback from potential end-users, such as artists or designers, could highlight practical aspects that may not be evident from quantitative metrics alone.

  6. Limitations Discussion: While the paper does discuss some limitations, it could be more explicit about the scenarios where the method may not perform as expected. Providing more concrete examples and discussing potential workarounds or future work to address these issues would be beneficial.

  7. Reproducibility: The paper claims that the results are reproducible and that the code is included in the supplementary materials. However, it would be helpful to provide more detailed instructions on how to set up the environment and run the experiments, ensuring that other researchers can easily replicate the results.

  8. Ethical Considerations: The paper touches on societal impacts but could expand on the ethical considerations of using such technology, especially regarding the potential for misuse in spreading misinformation or altering reality.

  9. Technical Details: The paper could benefit from a more thorough explanation of the technical details, such as the specific design choices in the U-Net architecture and how the latent consistency model (LCM) is integrated into the FastDrag framework.

  10. Long-Term Performance: The paper could address how the method performs over time, especially considering the potential for the model to degrade in quality as more edits are made to the same image.

By addressing these points, the authors could strengthen the paper's contributions and provide a more comprehensive understanding of the FastDrag method's capabilities and potential areas for future research.

问题

Questions for the Authors:

  1. Latent Warpage Function (LWF) Details: Could you provide more technical details about the design and implementation of the Latent Warpage Function? Specifically, how is the stretching behavior modeled, and what criteria determine the degree and direction of pixel adjustments within the latent space?

  2. Bilateral Nearest Neighbor Interpolation (BNNI) Robustness: How does the BNNI strategy cope with complex scenes or objects with varied textures? Are there any specific cases where BNNI might struggle, and how do these instances affect the final image quality?

  3. Consistency-Preserving Strategy and Diffusion Sampling: Can you elaborate on the process of adopting semantic information from the original image during diffusion inversion, and how exactly these key-value pairs guide the sampling process to maintain consistency? Are there quantitative measures to evaluate the consistency preservation?

  4. Scalability with Image Complexity and Dataset Size: Your work demonstrates impressive speed improvements, but how does FastDrag's performance scale with the complexity of the image being edited or the size of the underlying dataset? Are there any benchmarks or theoretical analyses to predict computational demands for larger datasets or more intricate images?

  5. Addressing Negative Societal Implications: Given the potential misuse of FastDrag for generating disinformation or breaching privacy, have you explored or considered integrating any technical safeguards directly into the tool to mitigate these risks? For instance, adding watermarks, traceability features, or built-in detection mechanisms for manipulated content.

  6. Limitations and Future Work: In Appendix D, you discuss the limitations of your method. Could you elaborate on the most critical limitations that future research should address to improve FastDrag, and what are your thoughts on potential directions for overcoming these limitations?

  7. Error Analysis and Sensitivity to Initialization: How sensitive is FastDrag to the initial state of the latent space? Have you conducted experiments to analyze the variability in output quality based on different starting points, and if so, could you share insights on the stability of the method?

Suggestions for Improvement:

  1. Quantitative Evaluation of Semantic Integrity: Consider including a quantitative evaluation of the semantic integrity of edited images, perhaps through metrics that assess feature similarity or structure preservation, to complement qualitative assessments.

  2. User Study or Survey: Conduct a user study or survey to gather feedback from artists, designers, and other potential users to understand the real-world usability and satisfaction with FastDrag. This could validate the claimed benefits of intuitiveness and efficiency.

  3. Expand on Mitigation Strategies: Further elaborate on potential mitigation strategies for the negative societal impacts mentioned. Discuss how these strategies could be practically implemented, and if possible, provide examples of successful implementations in similar contexts.

  4. Code Availability and Reproducibility: Since the computational efficiency and resource requirements are crucial aspects of FastDrag, consider making the code publicly available with clear documentation to facilitate reproduction of your results and encourage further development by the community.

  5. Additional Experimentation on Diverse Datasets: Extend the evaluation to include a variety of datasets covering different image types and complexities to strengthen the generalizability of FastDrag's performance claims.

局限性

The authors have addressed the limitations of their work in the "Limitations" section of the paper, which is in line with the NeurIPS Paper Checklist guidelines. They are encouraged to explicitly state any assumptions made and discuss how these assumptions could be violated in practical scenarios, as well as the potential implications of such violations. For instance, if the FastDrag method is highly dependent on certain image characteristics, such as resolution or lighting, the authors should clarify these dependencies and discuss the possible decrease in performance under varying conditions.

Regarding potential negative societal impacts, the authors should discuss if there are any direct paths from their research to unintended or malicious uses, as outlined in the guidelines. Given that FastDrag is an image editing tool that could potentially be misused for creating disinformation or deepfakes, it is essential for the authors to acknowledge this possibility and outline strategies to mitigate such risks. This could include discussing ethical release strategies for the model, such as gated access, user guidelines, or incorporating detection mechanisms to identify manipulated content.

If the authors have not done so already, they should also reflect on the broader ethical implications of their work, such as fairness considerations (ensuring the technology doesn't unfairly impact specific groups), privacy concerns (ensuring personal data is protected), and security considerations (preventing unauthorized use). The authors should also consider including a discussion on how the technology could be used as intended but still give incorrect results, or how it might be intentionally or unintentionally misused, along with possible mitigations.

In summary, if the authors have not adequately addressed these aspects, they should revise their paper to include a clear limitations section that outlines any assumptions, constraints, or areas where the method may fall short. Additionally, they should discuss potential negative societal impacts and propose strategies to mitigate these risks. The NeurIPS guidelines emphasize that authors should be transparent about limitations and negative impacts, and doing so will not negatively affect their review but instead demonstrates responsible research conduct.

作者回复

We would like to thank you for the positive feedback, helpful comments, and high praise and recognition of our work. Below are our responses to each individual comment (highlighted in italics).

Response for Weaknesses

W1: Theoretical Depth

Our one-step optimization strategy is developed by simulating strain patterns in stretched materials, inspired by [20]. The core of this strategy is LWF, which is strictly derived by formula (3)-(10) in our paper.

Furthermore, our effort in this study is to significantly reduce the editing time, the experiments we conducted have fully verified the contribution we claim. Specially, the results in Table 1 of our paper show FastDrag's ultra-fast editing speed compared to SOTA methods.

Additionally, the principle of BNNI is based on the most fundamental interpolation principles in the field of image processing.

W2, 3 and 4: Experimental Design, Statistical Rigor and Comparison with State-of-the-Art

The evaluation methods on DragBench [27] dataset are widely used in the field of drag-based editing, following the state-of-the-art methods [4, 14, 15, 27]. The results presented are averages across DragBench, which includes diverse image categories and drag tasks. Thus, the results on it are sufficiently demonstrating FastDrag’s superiority and robustness.

Additionally, for Statistical Rigor, please refer to the ''Statistical Rigor'' of the General Response.

Regarding the trade-offs between speed and quality, please see the response RfW3 to Reviewer eL9M.

W5: User Study

Thank you for your valuable suggestions. We plan to conduct a user study through surveys in the released project demo to ensure the reliability of the results. The outcomes of this study will be published following the demo release.

W6: Limitations Discussion

Regarding Limitations, please refer to ''Limitation'' in General Response to reviewers.

W7: Detailed Code Instructions

Thanks for your valuable suggestion. We will improve it and provide more detailed tutorials in the released code.

W8: Ethical Considerations

It is a common concern with generative methods. We will enforce strict open-source licensing in the publicly released code to limit unethical use.

W9: Technical Details of U-Net and LCM

Our diffusion structure's implementation entirely follows the mainstream baseline DragDiffusion [27]. Specifically, the U-Net used in our model is widely used in image generation [4, 14, 15, 27, 31]. As stated in Appendix B, the U-Net structure is adapted with LCM-distilled weights from LDM (i.e., Stable Diffusion 1.5). We will further clarify this in the camera-ready version.

W10: Long-Term Performance

Thank you for your interesting idea. We conducted experiments and found that repeated editing using diffusion models can degrade image quality over time due to accumulated errors. This may be mitigated by refined training techniques or corrective algorithms. Due to PDF space constraints, we are unable to include the figure in this rebuttal but will include related discussions in the camera-ready version.

Response for Questions

Q1: Latent Warpage Function (LWF) Details

In Section 3.2.1, we provide a detailed explanation of the derivation process for the Latent Warpage Function (LWF).

  1. We model the stretching behavior by Equation (3), which normalizes and aggregates ''component warpage vectors'' pjpjip_jp_j^{i\ast} caused by multiple drag instructions into a single warpage vector for subsequent latent optimization, as shown in Fig. 3. The pjpjip_jp_j^{i\ast} are modeled by Equation (5), with the stretch factor modeled by Equation (6), designed to simulate the behavior of stretched materials.

  2. The degree of pixel adjustments is inversely proportional to the distance of the pixel from handle point sis_i.

  3. The direction of pixel adjustments is the sum of pjpjip_jp_j^{i\ast}, whose direction is the same as corresponding drag instruction's direction siei|\overrightarrow{s_i e_i}|.

Q2: BNNI in Complex Scenes

Thank you for your valuable opinion. BNNI may struggle with cases involving detailed textures, as discussed in our response RfW1 to Reviewer eL9M. However, BNNI effectively handles most cases. Results on the DragBench dataset demonstrate its comparability with SOTA methods. We will try to improve it in future work.

Q3: Consistency-Preserving Strategy Details

Key-value pairs contain semantic information through attention mechanism during diffusion inversion, which has been proved by many researches [2, 18, 19]. Via cross-attention mechanism in each step, key-value pairs containing ori-image's semantics guide the sampling process to maintain consistency. We use the ''1-LPIPS'' metric to evaluate consistency preservation. Please refer to ''Evaluation Metrics'' in General Response.

Q4: Scalability with Image Complexity and Dataset Size

There is no existing drag-based image editing benchmark with a focus on intricate images or larger datasets. Although DragBench is not as large as some traditional datasets, it is the most complex dataset in its field—featuring more classes and a mix of real and generated images compared to those used in [14, 21, 27], which contain less image number and class or miss key drag instructions, such as mask. It also supports our primary contribution of improving editing speeds.

Q5: Negative Societal Implications

Thanks for your valuable suggestion. About this issue, please see response in RfW8.

Q6: Limitations and Future Work

Regarding Limitations and Future Work, please refer to ''Limitation'' in General Response to reviewers.

Q7: Sensitivity to Initialization

FastDrag uses DDIM inversion strategy to get the initial state of the latent space as same as other drag-based editing methods [14, 15, 31]. Since our latent optimization is decoupled from the diffusion inversion and sampling process, Fastdrag is not sensitive to the initial state of the latent space.

审稿意见
6

This paper presents a drag-based image editing method that uses the Latent Warpage Function to optimize pixel adjustments in a single step, which is an improvement over previous iterative methods. This approach simulates a stretched material in the latent space to allow for fast and accurate pixel adjustments. It also combines bilateral nearest-neighbor interpolation to handle null regions and a consistency-preserving strategy, involving key and value pairs in self-attention during the inversion process, to maintain semantic consistency and coherence.

优点

The proposed method reduces computational time by performing one-step optimization, which makes the process faster and more efficient. Additionally, it includes clever tricks using bi-linear nearest neighbor interpolation to fix empty areas. Preserving keys and values during inversion also appears to help maintain semantic coherence.

缺点

The paper needs to include more comprehensive evaluation metrics beyond visual inspection. The chosen metrics are unreliable, and the evaluation seems to be on a very small scale. Moreover, there are missing standard errors, and it is unclear if the results are cherry-picked or randomly chosen as the numbers on the proposed metrics look very much alike. To improve the overall evaluation of editing accuracy and quality, using interest points and key points like SIFT, SuperPoint, or DUST3R to verify if features at desired locations match the correct points in the edited image would be beneficial. Furthermore, there is a need to explore the effect of the edits on "other" parts of the image to understand potential unintended alterations due to the stochasticity in diffusion models. The edited images appear overly smooth and lose finer details, which may impact the overall realism.

问题

  1. Does the reported computational time include the inversion process?
  2. What happens if you integrate DragDiffusion with a single-step diffusion model? How does it compare with Fastdrag?
  3. Why does iterative drag diffusion suffer compared to the one-step approach?
  4. How does FastDrag affect other parts of the image during edits?
  5. Can you perform multi-point drag editing with FastDrag?

局限性

Yes in the appendix

作者回复

We would like to thank you for the positive feedback and helpful comments. This rebuttal addresses your comments and suggestions for conciseness. Following are our responses to each individual comment (which are highlighted in italics).

Response for Weaknesses (RfW)

W1: ... needs to include more comprehensive evaluation metrics ...

RfW1: Regarding evaluation metrics, following the SOTA methods [14, 15, 27], we employ ''MD'' (Mean Distance [22]), ''IF'' (i.e.,''1-LPIPS'' [10]) as evaluation metrics for a fair comparison. The metrics suggested by Reviewer, such as SIFT, are primarily used for feature point detection and matching. The suggested metrics may not be appropriate for drag-based editing due to semantic changes within the editing region. It is because that drag editing requires measuring the similarity between the edited result and the desired semantics, rather than comparing the ground truth and the edited result.

Due to the space limitation, please refer to ''Evaluation Metrics'' in the general response to reviewers, where we conduct more detailed analysis and explanation.

W2: Miss standard error and it is unclear if the result are ...

RfW2: Our method is evaluated on the widely used DragBench dataset [27] for drag-based editing, following the state-of-the-art methods [4, 14, 15, 27] for a fair comparison. The results presented are the averages across DragBench that includes diverse image categories and drag tasks, which are not cherry-picked or randomly chosen. This has been explained in the General Response to Reviewers, please refer to the Evaluation Metrics in general response to reviewers.

Regarding standard errors, to address your concern, we conducted an additional experiment by repeating our experiment 10 times under the same experimental settings. We observed that the variances of the performance metrics obtained from 10 realizations of our FastDrag are MD (0.000404), 1-LPIPS (9.44E-11), and Time (0.018), all of which fall within a reasonable range.

W3: ... effect of the edits on "other" parts of the image ...

RfW3: There are almost no changes outside the masked editing region using our method. It is because our approach optimizes the latent strictly within the masked region, as described by Equations (8)-(9) in our paper. We also employ a consistency-preserving strategy to maintain the consistency of image content outside the mask region. Note that, this is almost the same for other drag editing methods, as they also conduct latent optimization within the masked region, and adopt different strategy to maintain consistency such as LoRA [14, 15, 27, 31].

W4: ... smooth and lose finer details ...

RfW4: The limitation of smooth and losing finer details is common across diffusion-based drag methods, inherent to the diffusion models employed, such as LDM and LCM. This issue arises due to model approximations, randomness, and potential computational errors, which can result in imperfect symmetry between the inverse and sampling processes, leading to a situation where generated image may not be exactly the same as the original [28].

For instance, DragNoise [15] and DragDiffusion [27] introduce textures not present in the original images, affecting fidelity, as illustrated in Fig. 6 of paper. However, although our method may result in finer details loss, it outperforms others in overall task execution, especially in editing speed. We will discuss the limitation in the camera-ready version. Future research will be conducted to mitigate such effect and enhance the editing performance.

Response for Questions

Q1: Does the reported computational time include the inversion process?

Yes, the reported computational times include the inversion process, sampling process, and the time for latent optimization.

Q2: What happens if you integrate DragDiffusion with a single-step diffusion model? How does it compare with Fastdrag?

When integrating DragDiffusion with a single-step diffusion model, the editing time is still much longer than using FastDrag. For DragDiffusion and FastDrag under diffusion steps of 1, 20, and 50, we calculate the time required for inverse, sampling, and latent optimization respectively. The results provided in Fig. 3 of the attached PDF show that even with a single diffusion step (i.e., diffusion step set as 1), DragDiffusion still requires significantly more time (20.7 seconds) compared to FastDrag (2.88 seconds).

Q3: Why does iterative drag diffusion suffer compared to the one-step approach?

These methods require nn-step iterations to achieve semantic optimization, with each step to optimize semantics within small editing area of the image, thus they need nn small-scale, short-distance optimizations to achieve overall latent optimization, and require large amount of time to perform nn iterations of the optimization. Whereas our method only requires a single short-time computation on the latent to achieve the semantic optimization, thereby significantly reducing editing time.

Q4: How does FastDrag affect other parts of the image during edits?

During the editing process, other parts of the image outside the mask remain almost unchanged, as explained in RfW3 to the reviewer.

Q5: Can you perform multi-point drag editing with FastDrag?

Yes, our method can perform multi-point drag editing, which has been demonstrated in Fig. 6 (row 4) of our paper and Fig. 13 (row 6) in Appendix C.

We will provide further clarifications and explanations regarding RfW2 and Q3 in the camera-ready version of our paper.

评论

Thank you for the thorough response. Many of my concerns have been addressed, but I'm still unsure if the metrics used for evaluation are suitable. Perhaps this is something to consider for future work. Overall, the results seem promising and would be of interest to the broader community. I am willing to change my rating to a weak accept.

评论

Thank you for the supportive comments and the constructive suggestions for our future study.

审稿意见
5

This paper introduces a new one-step drag-based image editing method that significantly accelerates the editing process using a LWF function. It also employs a BNNI strategy to handle null regions and a consistency-preserving strategy to maintain the integrity of the edited image. Experimental results demonstrate FastDrag’s fast speed and performance compared to existing methods.

优点

  1. FastDrag is easy for editing and has a fast editing speed.
  2. BNNI strategy addresses the issue of null regions, maintaining semantic integrity and quality.
  3. It also provides spatial control over specific regions of the image, enabling detailed drag editing.

缺点

  1. If the drag distance is long, will the BNNI still success to maintain high semantic quality, how about the editing speed and complexity for long-distance dragging and latent relocation?
  2. Threre should also be some failed examples to better illustrate the proposed method.
  3. I am curious whether using a better base model can achieve better editing results, and whether there is a trade-off between editing time and editing performance.
  4. More recent works [1] should be included for comparison

[1] EasyDrag: Efficient Point-based Manipulation on Diffusion Models, CVPR 2024

问题

See the strengths and weaknesses

局限性

See the strengths and weaknesses

作者回复

We would like to thank the reviewer for the positive feedback and helpful comments. Following are our responses to each individual comment (which are highlighted in italics).

Responses for Weaknesses (RfW):

W1: If the drag distance is long, will the BNNI still success to maintain high semantic quality, how about the editing speed and complexity for long-distance dragging and latent relocation?

RfW1: For typical drag tasks, dragging is usually performed over short distances. However, for long-distance dragging, which is generally an object moving task, as illustrated in Figure 13 (row 3) in the Appendix, we do not need to employ the BNNI strategy. Instead, we can fill the semantics around the target location or a manually designated area of the image into the moved object's original location [18, 19], as described in Appendix B of our paper. We will further clarify this in the camera-ready version.

However, for extremely long-distance drag editing, our method may lose some details of the object, as explained and discussed in the "Limitation" of our general response to reviewers. Please refer to the limitation regarding "Extremely Long-distance Drag Editing" in our general response to reviewers.

Nonetheless, long-distance editing does not increase the editing speed and complexity of our method. Our method can achieve latent optimization with one-step warpage optimization using Equations (3)-(10), which is not affected by the editing distance.

W2: There should also be some failed examples to better illustrate the proposed method.

RfW2: Thank you for your valuable suggestion. We have included failed examples to illustrate the limitations of our method for better understanding, as shown in Figures 1 and 2 of the attached PDF with our author rebuttal. Please refer to the limitation discussion in our general response to reviewers.

W3: I am curious whether using a better base model can achieve better editing results, and whether there is a trade-off between editing time and editing performance.

RfW3: Regarding better base model, it may potentially enhance editing results. Since the compared state-of-the-art methods all utilize SD 1.5 for drag tasks, our experimental results are solely based on SD 1.5. Theoretically, our one-step optimization method is independent of the base model. Therefore, FastDrag can perform drag tasks based on more advanced models such as SDXL or SDXL turbo, and its editing performance theoretically varies with the base model used. However, integrating more advanced models like SDXL would require substantial code base restructuring (we use relatively early version of diffusers), due to time constraints in the rebuttal period, we are not able to implement this in time. We will support more base models in the future.

Regarding the trade-off between editing time and performance, there should be a trade-off between editing time and performance for nn-step optimization methods, as they all require nn-step iterative optimization to achieve desired editing performance, though these methods can reduce editing time by decreasing the number of iterations, this typically degrades performance, since less iteration will lead to insufficient latent optimization. Thus there should be a trade-off for nn-step methods.

In contrast, FastDrag leverages one-step warpage optimization to achieve latent optimization via Equations (3)-(10), the editing time will not be influenced by editing task or images. Therefore, our method does not have the trade-off between editing time and performance.

W4: More recent works [ref1] should be included for comparison.

RfW4: Thank you for the suggestion. The study [ref1] did not publish or preprint when we submitted FastDrag, thus we did not compare it. We will discuss and compare it in our camera-ready version.

[ref1] EasyDrag: Efficient Point-based Manipulation on Diffusion Models, CVPR 2024

作者回复

General Response to Reviewers

We would like to thank the reviewers for the positive feedback and valuable comments. We are elated that the reviewers found our paper well-written, the presentation clear and excellence in ultra-short drag-based editing time compared to state-of-the-art (SOTA) methods. This rebuttal addresses reviewers' main concerns and suggestions for conciseness. For more detailed questions and answers, please refer to the individual responses to each reviewer.

1. Evaluation Metrics (To Reviewers YBFQ, hGqW, Sik5)

In our study, we employ the widely used performance metrics in the field of drag-based editing, i.e., "MD'' (Mean Distance [22]), "IF'' (i.e., "1-LPIPS'' [10]), following the SOTA methods [4, 14, 15, 27]. MD assesses how well the approach moves the semantic contents to the target points. 1-LPIPS quantifies the similarity between the original and edited images (i.e., consistency), as detailed in Section 4.2 of our paper.

The metrics that Reviewer YBFQ suggested (i.e., SIFT, SuperPoint and DUST3R) are primarily used for feature point detection and matching. These may not be appropriate for evaluating performance of drag-based editing due to semantic changes within the editing region. For example, turning the head or thinning the face not only changes the image's content but also its semantics, which cannot be effectively measured by these metrics such as SIFT, as illustrated in Figure 13 of our paper.

Besides, our effort in this study is to significantly reduce the editing time for drag-based editing, thus we adopt the evaluation metrics following these SOTA methods for a fair comparison, and the results effectively support our contributions, i.e., ultra-short editing time with competitive quantitative metrics.

However, as a novel task, existing mainstream metrics (i.e., MD and IF) in this field may not be perfect to evaluate drag performance. We will make efforts to develop appropriate metric for drag editing in our future study.

2. Statistical Rigor (To Reviewers YBFQ, hGqW, Sik5)

Our method is evaluated on the most widely used DragBench dataset [27] for drag-based editing, following the SOTA methods [4, 14, 15, 27] in this field, for a fair comparison. DragBench is a diverse compilation encompassing more than 10 types of images, including 205 images with 349 pairs of handle and target points.

The comparison results presented are the averages across DragBench that includes diverse image categories and drag tasks. Thus, the results in our study are sufficiently to demonstrate superiority and robustness of our method, as compared with the state-of-the-arts.

To address the reviewers' concerns, we conducted an additional experiment by repeating our experiment 10 times under the same experimental settings. We observed that the variances of the performance metrics obtained from 10 realizations of our FastDrag are MD (0.000404), 1-LPIPS (9.44E-11), and Time (0.018), all of which fall within a reasonable range. These statistical results further demonstrate the effectiveness and stability of our method for drag editing. We will provide these statistical results in the camera-ready version for better clarity.

3. Limitation (To all Reviewers)

We will provide a more in-depth analysis of the limitations of our method in the camera-ready version from following three aspects:

Overly Smooth and Finer Details Loss: This is a common issue across diffusion-based drag editing methods, inherent to the diffusion models employed, such as LDM and LCM. For instance, DragNoise [15] and DragDiffusion [27] introduce textures not present in the original images, affecting fidelity, as illustrated in Figure 6 of our paper.

Though our method may also result in some loss of finer details, it outperforms other state-of-the-art methods in overall task execution, particularly in editing speed.

Extremely Long-distance Drag Editing: When conducting extremely long-distance drag editing, our method may lose some details of the dragged object. As illustrated in Figure 1 of the attached PDF, though our method can achieve the long-distance drag editing, part detail information of the objects are missing, i.e., window on post box (row 1), and window on house (row 2). This is because our optimization is conducted on a lower-dimensional latent space. If the feature changes in the details within the latent space are too large (i.e., long-drag editing), the semantics of these details will be severely disrupted, making it more difficult to complete or maintain the details, resulting in the missing details of drag editing.

However, our FastDrag can still achieve better editing performance than these SOTA methods, as illustrated in Figure 1 of the PDF. The reason why these methods perform inferior to our FastDrag is that they require nn-step optimizations, with each step to optimize semantics within small editing area of the image, thus they need nn small-scale, short-distance optimizations, to achieve overall latent optimization. When performing extremely long-distance drags, each step requires to optimize semantics within much larger editing area, resulting in difficulty for desired semantic optimization, leading to inferior performance to our FastDrag.

Highly Relying on Precise Drag Instruction: It is worth noting that achieving precise performance relies on clear drag instructions. FastDrag optimizes the latent space based on these instructions, which is also common for the SOTA methods. Therefore, providing clear instructions is crucial for desired performance. As illustrated in Figure 2 of the attached PDF, when the goal is to "thin the hair while keeping the face size", it is best to exclude the face from the mask region (row 2 of Figure 2). Similarly, when the task is to "lengthen the beak", the handle point should ideally be placed where the 'beak' feature is more prominent.

最终决定

The final rating of this paper is Accept, Weak Accept, Borderline Accept, and Borderline Reject. The paper introduces a drag-based image editing method using the Latent Warpage Function (LWF) for single-step pixel adjustments, which significantly improves speed and accuracy over iterative methods. It combines bilateral nearest-neighbor interpolation (BNNI) and a consistency-preserving strategy to maintain semantic coherence. While the method is efficient and shows significant speed improvements, the reviews highlight concerns about the lack of comprehensive evaluation metrics, detailed analysis of unintended alterations, and the potential trade-off between editing time and performance. The reviews suggest including more recent works for comparison, providing more failing cases, and exploring the effects on other image parts. The authors’ rebuttal has properly addressed these concerns, resulting in an increase in rating. During the reviewer discussion phase, the AC urged Reviewer hGqW to respond, but there was no reply. The AC has read the paper and concurs that this paper has made significant contributions in drag-based image editing, warranting acceptance at NeurIPS. Please incorporate the materials in rebuttal into the final version of this paper.