PaperHub
4.8
/10
Poster4 位审稿人
最低3最高6标准差1.3
6
6
3
4
3.5
置信度
正确性2.3
贡献度2.3
表达2.3
NeurIPS 2024

Gradient-free Decoder Inversion in Latent Diffusion Models

OpenReviewPDF
提交: 2024-05-09更新: 2024-11-06
TL;DR

We propose an efficient gradient-free decoder inversion for LDMs for ensuring invertible latent diffusion model, which significantly reduced runtime and memory usage compared to gradient-based methods in various recent LDMs.

摘要

关键词
Latent diffusion modelInversionGradient-free inversionResource-efficient inversion

评审与讨论

审稿意见
6

The paper introduces a gradient-free method for LDM decoder inversion. Compared to traditional gradient-based methods, this method is computational and memory efficient which makes it suitable for large-scale tasks like video generation. They provide theoretical support for the method's convergence. Their empirical results show that their gradient-free method reduces computation time and memory usage in applications such as watermark classification using Stable Diffusion 2.1 and InstaFlow models. Using optimization strategies like the Adam optimizer and learning rate scheduling further boosts the efficiency.

优点

1- The authors introduce a new gradient-free method for decoder inversion, which is both faster and more memory-efficient that traditional gradient-based methods. 2- The paper provides a theoretical analysis of the proposed method, showing that it converges under reasonable conditions. 3- The method is shown to be effective in a practical application where precise decoder inversion is necessary, and their proposed method achieves comparable accuracy to gradient-based methods.

缺点

1- The comparative experiments mainly focus on memory usage and runtime. There is limited evidence on how the accuracy of the inversion compares to gradient-based methods in various applications. Also, authors do not present images for qualitative comparison.

2- Experiments validating the convergence assumption are limited to specific hyperparameters for Stable Diffusion, LaVie, and InstaFlow. It is not clear if this behavior generalizes to other settings.

问题

1- How robust is the assumption 𝐸∘𝐷 ≈ 𝐼 ? Can you provide empirical evidence for this assumption in the context of non-linear autoencoders?

2- Have you tested the proposed inversion method on video LDMs?

3- How does the accuracy of the proposed method compare to gradient-based methods in applications such as image editing?

4- How were these parameters chosen? Have you done an ablation study?

局限性

Limitations have been briefly mentioned, yet there is not enough quantitative analysis for cases where the accuracy of gradient-based methods are significantly higher, and whether this affects applications in both image and video domain.

作者回复

Thank you for your review, and we are encouraged that you found that our method is new [S1], faster and more memory-efficient than traditional grad-based methods [S1], theoretical analysis shows that our method converges under reasonable conditions [S2], effective in a practical application [S3]. Here, we provide feedback on your reviews.

Weaknesses

[W1] "The comparative experiments mainly focus on memory usage and runtime. There is limited evidence on how the accuracy of the inversion compares to gradient-based methods in various applications. Also, authors do not present images for qualitative comparison."

We have already verified that, in addition to memory usage and runtime, our method also excels in terms of accuracy and precision versatility. To summarize the advantages of our method and their experimental evidence:

  • Fast: Our method needs shorter runtimes for achieving the same NMSEs than the grad-based method. (up to 5X faster, in Fig. 3c and Tab. S1c, 1.89 s vs 9.51 s to achieve -16.4 dB)
  • Accurate: In the same runtime, our method shows smaller NMSEs than the grad-based method. (up to 2.3 dB lower, in Fig. 3b and Tab. S1b, -21.37 dB vs -19.06 dB in 25.1 s)
  • Memory-efficient (significant): Our method consumes a smaller amount of GPU memories than the grad-based method. (up to 89% of the memory can be saved, in Fig. 3b, 7.13 GB vs 64.7 GB)
  • Precision-flexible: The grad-based method requires a full-precision model that supports backpropagation. Our method, however, is flexible and can run on any precision model, even though it does not support backpropagation.

The accuracy in the application was similar to the grad-based method because the grad-based method had up to 2.6 times longer runtime (Tab. 2). If the grad-based method had a runtime as short as the grad-free method, the grad-free would likely be more accurate than the grad-based. To verify this, we additionally experimented on applications:

  • Background-preserving image editing: Figure R1 (in the rebuttal PDF) shows the qualitative results of applying our algorithm to an experiment, which investigates how exact inversion improves background-preserving image editing [32]. To compare accuracy at similar execution times, we adjusted the number of iterations to match the execution time. At comparable execution times, our grad-free method better preserves the background and achieves a lower NMSE.
  • Watermarking classification: Figure R2 (PDF) shows the qualitative results of applying our algorithm to the watermark classification [14] in Sec. 5. Our grad-free method either reduces the runtime compared to the grad-based or achieves better accuracy within the same runtime. We will add these additional qualitative results to the revision.

[W2] "Experiments validating the convergence assumption are limited to specific hyperparameters for Stable Diffusion, LaVie, and InstaFlow. It is not clear if this behavior generalizes to other settings."

Our method is generally applicable to other LDMs because the hyperparameters in our method consist solely of the learning rate and number of iterations, which work the same for any model. As proof, Table S1 shows that it works well on three different models with the same learning rate and number of iterations.

Questions

[Q1] "How robust is the assumption 𝐸∘𝐷 ≈ 𝐼 ? Can you provide empirical evidence for this assumption in the context of non-linear autoencoders?"

As in Line 260, "using the encoder is just OK" means EDI\mathcal{E} \circ \mathcal{D} \simeq \mathcal{I}. Many image editing works [31, 44] have been using E\mathcal{E} as an adequate left-inverse of D\mathcal{D}, which explains EDI\mathcal{E} \circ \mathcal{D} \simeq \mathcal{I}. As empirical evidence from our own, the 'Encoder' column in Table 2 shows reasonably good accuracies (186/300, 149/300).

[Q2] "Have you tested the proposed inversion method on video LDMs?"

Yes, we have. See Fig. 3b.

[Q3] "How does the accuracy of the proposed method compare to gradient-based methods in applications such as image editing?"

Our method is more accurate than grad-based. We answered the same content in [W1], so please refer to [W1].

[Q4] "How were these parameters chosen? Have you done an ablation study?"

As we answered in [W2], our hyperparameters are model-independent and generally applicable to other LDMs. Nevertheless, we newly conducted ablation studies on the optimizer and learning rate scheduling. Due to the limited space, we provided the result in the rebuttal for reviewer 9ej4, [W1].

Limitations

[L1]: "Limitations have been briefly mentioned, yet there is not enough quantitative analysis for cases where the accuracy of gradient-based methods are significantly higher, and whether this affects applications in both image and video domain."

As we mentioned in Line 257, grad-based method is more accurate than our method if sufficient runtime is available. Nevertheless, that is mostly not a big deal, because too much accuracy is unnecessary for applications. In Tab. 2-InstaFlow, the number of iterations was 100 for the grad-based method. According to Table S1c, more than 100 iterations are needed to be more accurate than the grad-free method. However, in Table 2-InstaFlow, the accuracy is 227/300, which is same to the grad-free. This shows that too much accuracy is unnecessary for the watermark classification. We will add this discussion in the revision.

Additionally, we provide new qualitative results for various applications in Figs. R1 and R2 by running the grad-based method for both longer and shorter runtimes. This would help a more in-depth discussion on the accuracy trade-off. Thank you.

评论

Thanks to the authors for providing additional qualitative results. I think the paper can still be improved with more extensive experiments on various LDMs.

审稿意见
6

The paper proposes a gradient-free method for decoder inversion in latent diffusion models (LDMs), which significantly reduces computational complexity and memory usage compared to traditional gradient-based methods. The approach is theoretically proven to converge and is efficient in experiments with various LDMs, achieving comparable accuracy. This method is useful for applications like noise-space watermarking, demonstrating its practical utility in scenarios requiring efficient decoder inversion.

优点

  • The paper is well-written, with clear and easy-to-follow explanations.
  • The proposed method replaces gradient operations with forward inference, significantly saving memory and improving computational efficiency. Theoretical analysis ensures the convergence of the method.
  • Extensive experimental results on various latent diffusion models (LDMs) demonstrate the superior efficiency and effectiveness of the proposed approach.
  • The method's practical utility is highlighted through its successful application in noise-space watermarking.

缺点

  • The theoretical convergence of the forward step method and KM iterations is provided. However, the experiments employ the Adam optimizer. It appears that the term E(D(zk))E(x)E(D(z^k)) - E(x) in Eq. (4) is treated as the gradient in Adam. More details to clarify this could help avoid potential misunderstandings. Moreover, an additional ablation study on the impact of using Adam and cosine learning rate decay may enhance the comprehensiveness of the findings.
  • The paper acknowledges that gradient-based methods can achieve higher accuracy in certain applications. A more in-depth discussion on the accuracy trade-off and scenarios where this method might fall short would provide a balanced perspective.

问题

In Figures 2 and 4, there are some failure cases depicted. Could you provide more details on the success rate of the proposed method and how it compares to the gradient-based method? Additionally, a detailed analysis of these failure cases and potential reasons for the discrepancies would be very helpful.

局限性

Yes

作者回复

Thank you for your encouraging review. We are pleased that you found our work well-written, clear, and easy to follow [S1], our method significantly saves memory and improves computational efficiency [S2], our theoretical analysis ensures convergence [S2], and the extensive experiments demonstrate the superiority of our method [S3]. We are also glad that you consider our method practical and successful in application [S4]. Here, we have carefully considered the weaknesses and questions you raised.

Weaknesses

[W1] "The theoretical convergence of the forward step method and KM iterations is provided. However, the experiments employ the Adam optimizer. It appears that the term in Eq. (4) is treated as the gradient in Adam. More details to clarify this could help avoid potential misunderstandings. Moreover, an additional ablation study on the impact of using Adam and cosine learning rate decay may enhance the comprehensiveness of the findings."

Gradient in Adam

Thank you for a great comment. As you said, E(D(z))E(x)\mathcal{E}(\mathcal{D}(z)) - \mathcal{E}(x) is treated as the gradient in Adam. Actually, we are not the first to use Adam for non-gradient-based minimizations as if they were gradient-based. For example, in reinforcement learning, semi-gradient methods that use stop-grad operations are employed instead of gradients (they do not actually use gradients [A]), but Adam is still used. Additionally, zeroth-order optimization does not use gradients but still utilizes Adam [B]. Thanks for the interesting discussion topic; we will add this and clarify in the revision.

[A] Nota, Chris, and Philip S. Thomas. "Is the policy gradient a gradient?." arXiv preprint arXiv:1906.07073 (2019).

[B] Chen, Xiangyi, et al. "ZO-AdaMM: Zeroth-order adaptive momentum method for black-box optimization." NeurIPS 2019.

Ablation study on optimizer

Thank you for suggesting good ablation studies. We additionally conducted an ablation study in SD2.1 (32-bit), changing only the optimizer while keeping all other conditions the same. The table below shows the results:

# iter.2050100200
Vanilla-16.87 ±\pm 0.38-17.42 ±\pm 0.41-18.21 ±\pm 0.46-19.35 ±\pm 0.54
KM iterations-18.99 ±\pm 0.53-20.72 ±\pm 0.73-21.46 ±\pm 0.96-20.91 ±\pm 1.20
Adam (orig.)-19.39 ±\pm 0.54-20.84 ±\pm 0.66-21.71 ±\pm 0.77-21.85 ±\pm 0.82

Ablation study on learning rate scheduling

This time, we experimented to see what happens when using a fixed learning rate instead of applying learning rate scheduling with Adam.

# iter.2050100200
lr=0.01 (fixed)-20.05 ±\pm 0.58-21.07 ±\pm 0.70-21.22 ±\pm 0.74-20.61 ±\pm 0.79
lr=0.002 (fixed)-17.85 ±\pm 0.43-19.28 ±\pm 0.53-20.59 ±\pm 0.64-21.57 ±\pm 0.74
lr scheduled (orig.)-19.39 ±\pm 0.54-20.84 ±\pm 0.66-21.71 ±\pm 0.77-21.85 ±\pm 0.82

When using a fixed learning rate, we found that with a large learning rate (0.01), the performance was poor when the number of iterations was high, and with a small learning rate (0.002), the performance was poor when the number of iterations was low. In contrast, the fixed learning rate showed consistent performance across all intervals regardless of the number of iterations. Again, thank you for good suggestions. We will add these ablation studies to the revision.

[W2] "The paper acknowledges that gradient-based methods can achieve higher accuracy in certain applications. A more in-depth discussion on the accuracy trade-off and scenarios where this method might fall short would provide a balanced perspective."

As we mentioned in Line 257, the gradient-based method is more accurate than our method if sufficient runtime is available. Nevertheless, that is mostly not a big deal, because too much accuracy is unnecessary for applications. In experiments in Tab. 2-InstaFlow, the number of iterations was 100 for the gradient-based method. As shown in Tab. S1c, more than 100 iterations are needed to be more accurate than the gradient-free method. However, in Tab. 2-InstaFlow, the accuracy is 227/300, which is the same as the gradient-free. This shows that too much accuracy is unnecessary for the watermark classification. We will add this discussion in the revision.

Additionally, we provide new qualitative results for various applications in Figs. R1 and R2 by running the gradient-based method for both longer and shorter durations. This would help a more in-depth discussion on the accuracy trade-off. Thank you.

Questions

[Q1] "In Figures 2 and 4, there are some failure cases depicted. Could you provide more details on the success rate of the proposed method and how it compares to the gradient-based method? Additionally, a detailed analysis of these failure cases and potential reasons for the discrepancies would be very helpful."

Great idea. In Fig. R3 (in the PDF in the common rebuttal), we display the instance-wise cocoercivity, convergence, and accuracy for the gradient-based method (similar to Figs. 2 and 4). Like the gradient-free method, the gradient-based method showed that most instances satisfied cocoercivity, and better convergence often led to higher accuracy. However, it was observed that cocoercivity and convergence are not significantly correlated. In other words, 'cocoercivity \Rightarrow convergence' is not a general characteristic, but a unique feature we discovered in our gradient-free method.

As mentioned already in Line 177, we confirmed that the more cocoercivity is satisfied, the better the convergence. However, when examining the failure cases (i.e., instances that do not satisfy cocoercivity) directly, we could not identify any significant commonalities. Figure R4 in the common rebuttal PDF shows the 8 failure cases of Fig. 4a. We will add this discussion. Thank you.

评论

The authors' reply addresses most of my issues. I appreciate the clarification made by the authors, and I have no other concerns. I will maintain my score as weak accept.

审稿意见
3

The paper introduces a zero-order (gradient-free) inversion optimization algorithm for encoder-decoder based generative models, particularly focusing on latent diffusion models (LDM). The objective of the optimization problem is to find the latent vector zz for a given image xx such that x=D(z)x=D(z) where D is the decoder of the LDM. The proposed inversion algorithm updates the latent vector at the next iteration by relaxing the constraints of the objective function to find the z such that E(x)=E(D(z))E(x)=E(D(z)) (E is th encoder) by iteratively taking the difference between the two push forward maps until it converges to a fix-point. The authors demonstrate their proposed method is motivated technically, they show their inversion algorithm converges to a fixed point under reasonable assumptions (section 3.4). This type of analysis is further expanded to momentum (section 3.4 demonstrates reasonable assumptions). Empirically, the authors show their method significantly decreases computational cost while maintaining similar performance to gradient-base inversion algorithms.

优点

To my knowledge, the proposed method is novel, effective, and straightforward to implement for any encoder-decoder based architectures.

The proposed method demonstrates relatively similar performance with significantly less computational time in the experiments provided.

The proposed algorithm is technically motivated, and the assumptions are verified computationally.

The authors provided code for reproducibility.

缺点

Certain aspects of the writing need attention. For example, ρ is not defined in Equation 3. Additionally, contribution bullet point 3 ends with "and," while the rest of the contributions end with a comma, suggesting missing information. The contribution section itself reads like a run-on sentence separated by bullet points.

Figure 2: The scaling of the figure is confusing. It appears that NF is the ideal architecture because it has the lowest inversion error and the least computational runtime, which seems to argue against the proposed architecture LDM.

The limitations of the optimization algorithm are not clearly articulated in comparison to other inversion algorithms. The proposed method requires x (the image or signal) to be given, whereas other inversion algorithms require a set of measurements y, and a forward operator.

The results lack a visual verification to confirm the effectiveness of the proposed methodology. For instance, Section 5 replicates the experiments in [48], but does not reproduce the qualitative results to demonstrate the relationship between them.

问题

1). Could the authors please provide qualitative results similar to experiments performed in [48]?

2). Could the author please comment on the scope of the inversion algorithm and what applications it is suitable for?

3). Encoder-Decoder models are not bijective, therefore E(D(Z)E(D(Z) is an approximation of an invertible map, so there will be some information loss when approximating image x. In certain subfields, this is described as "Representation Error". Could the authors please comment on how this phenomenon affects their analysis even though their method converges to a fixed point (possibly very close to z*), there will be some information loss due encoder-decoder structure of the model.

4). Could the authors explain the scaling of Figure 2? Please refer to the comment in the weakness sections.

局限性

Please refer to the "weakness" portion of the review.

作者回复

Thanks for the review, and for finding [S1] our work novel, effective, and straightforward to implement for any encoder-decoder based architectures, [S2] our method demonstrates has significantly less runtime, [S3] technically motivated, and the assumptions verified computationally, [S4] and we provided code for reproducibility. However, it is puzzling that you have given a strong reject despite finding so many advantages. We hope you will reconsider your review after reading our rebuttal that has thoroughly addressed your feedback regarding the weaknesses and questions.

Weaknesses

[W1] "Certain aspects of the writing need attention. For example, ρ\rho is not defined in Equation 3. Additionally, contribution bullet point 3 ends with "and," while the rest of the contributions end with a comma, suggesting missing information. The contribution section itself reads like a run-on sentence separated by bullet points."

  • About ρ\rho: Thanks for pointing out. We will add a sentence that ρ\rho is the step size.
  • About contribution bullet point 3: We meant it (i.e., The contribution section itself reads like a run-on sentence).

[W2] "Figure 2: The scaling of the figure is confusing. It appears that NF is the ideal architecture because it has the lowest inversion error and the least computational runtime, which seems to argue against the proposed architecture LDM."

It seems you are referring to Figure 1b. Figure 1b shows the accuracy and runtime of inversion, which are merely parts of the numerous specifications of the generative model. As you said, normalizing flows (NFs) are good at inversion in accuracy and runtime. However, we are not finding which generative model does inversion best. LDMs efficiently generate high-quality and large-scale samples (Line 47), so we want to do inversion well in LDMs. We have nothing to do with NF.

For the scaling of the Figure 1b, please see [Q4].

[W3] "The limitations of the optimization algorithm are not clearly articulated in comparison to other inversion algorithms. The proposed method requires x (the image or signal) to be given, whereas other inversion algorithms require a set of measurements y, and a forward operator."

There must be a severe misunderstanding. It seems you are looking for a competitor to our algorithm in the inverse problems in imaging, which can be represented as y=Ax+ny=Ax+n. We solve a totally different problem, x=D(z)x=\mathcal{D}(z) (Eq. 5). Before our work, including GAN-inversion works [49], x=D(z)x=\mathcal{D}(z) could only be solved by gradient descent. Therefore, we compared our method with the gradient descent in Sections 4 and 5.

[W4] "The results lack a visual verification to confirm the effectiveness of the proposed methodology. For instance, Section 5 replicates the experiments in [48], but does not reproduce the qualitative results to demonstrate the relationship between them."

As we mentioned in Lines 204-210, We did not perform the watermark detection experiment from [48], but rather the watermark classification experiment from [14]. Nevertheless, we will add qualitative results for the watermark classification experiment [14], as in Fig. R2 (see the rebuttal pdf in "Author Rebuttal by Authors").

Questions

[Q1] "Could the authors please provide qualitative results similar to experiments performed in [48]?"

Yes, see the response to [W4].

[Q2] "Could the author please comment on the scope of the inversion algorithm and what applications it is suitable for?"

Sure.

  • Scope of the inversion algorithm: As in [W3], it refers to algorithms which solve findz  x=D(z)\underset{z}{\textrm{find}}\;x=\mathcal{D}(z).
  • Applications: As in Line 41, "for seeking the true latent [14], for watermarking [48], and for background-preserving editing [32]".

[Q3] "Encoder-Decoder models are not bijective, therefore, E(D(z)) is an approximation of an invertible map, so there will be some information loss when approximating image x. In certain subfields, this is described as "Representation Error". Could the authors please comment on how this phenomenon affects their analysis even though their method converges to a fixed point (possibly very close to zz^\star), there will be some information loss due encoder-decoder structure of the model."

Following the definition of the representation error [A], our representation error is 0, because we solve x=D(z)x=\mathcal{D}(z) with xRange(D)x \in \mathrm{Range}(\mathcal{D}). There is no measurement error either, since we do not have a measurement operation AA (we are not solving y=Ax+ny=Ax+n). All of the source of the error is the optimization error: "The optimization procedure did not find the best zz" [A].

As you mentioned, z0:=E(D(z))z^0 := \mathcal{E}(\mathcal{D}(z^\star)), the initial point of the optimization, is different from zz^\star. That is the problem that we solve in the paper: starting from z0z^0, seeking zz^\star.

[A] Bora, Ashish, et al. "Compressed sensing using generative models." ICML 2017.

[Q4] "Could the authors explain the scaling of Figure 2? Please refer to the comment in the weakness sections."

If you mean Figure 1b instead of Figure 2, roughly speaking, the first dotted line represents about 1 second, and the second dotted line represents around 5 to 10 seconds. For the inversion of LDM, referring to Table 4 in [14], it ranges from 30 to 160 seconds. Note that these can vary in different settings.

评论

I would like to thank the reviewers for taking the time to answer my questions.

审稿意见
4

This work provides a method for gradient-free decoding for latent diffusion models that can reduce the amount of required GPU memory and lessen the computation time as opposed to gradient-based methods. The method focusses on providing better invertibility in ldms that is based on a theoretical assumption that guarantees the convergence of the forward step and the inertial KM iterations for the ground truth. They further showcase the proposed method for tree-ring watermark classification problem.

优点

  1. They propose a new method for gradient-free decoding for the ldms that has some advantage over gradient-based methods.
  2. The paper contains a detailed description of the method along with the reasonings behind the assumptions made.
  3. Detailed set of experiments are present along with a thorough convergence analysis.
  4. Well-written in terms of method description and experiments.

缺点

  1. The paper is hard to follow in the methodology description and has a lot of assumptions which are difficult to verify from the paper itself. For example, in lines 152 and 165, some assumptions are being made which are not backed by proofs.
  2. Are the findings enough to support the validity of the said assumptions? A question I find hard to answer based on the described experiments.
  3. The advantages of the given approach seems limited and not quite powerful. Being able to reduce some memory for decoding is not a big enough contribution in itself.
  4. Limited novelty -- similarity with prior works like [14] mentioned in the paper.
  5. Not enough quantitative experiments to establish the utility of the said method -- Comparing for memory consumption and runtime is not enough in my opinion.
  6. Not enough applications provided to prove the merits of the proposed approach -- applying the current method on diverse set of applications like image generation, editing, interpolation, etc and comparing it with gradient-based methods and evaluating the results would help.

问题

Please refer the weaknesses section.

局限性

refer the weaknesses section.

作者回复

Thank you for your valuable review. We are glad that you found our work new [S1], containing a detailed description of the method along with the reasonings behind the assumptions made [S2], experiments are detalied [S3], convergence analysis thorough [S3], and experiments and methods are well-written [S4]. Here, we carefully responded to your comments on weaknesses and questions.

Weaknesses

[W1]: "The paper is hard to follow in the methodology description and has a lot of assumptions which are difficult to verify from the paper itself. For example, in lines 152 and 165, some assumptions are being made which are not backed by proofs."

While we have a dedicated subsection for the validation of assumptions in Sec 3.4 and related discussions in Sec 4 (as you indicated in [S2]), we will further refine it if any unverified assumption exists. Regarding your examples, one is trivial [Line 152], and the other is verified empirically [Line 165].

  • [Line 152] T()=ED()E(x)\mathcal{T}(\cdot) = \mathcal{E} \circ \mathcal{D}(\cdot) - \mathcal{E}(x) is continuous: Neural networks like E,D\mathcal{E}, \mathcal{D} are generally treated as continuous functions. Moreover, they are usually differentiable, in order to be trained.
  • [Line 165] T()=ED()E(x)\mathcal{T}(\cdot) = \mathcal{E} \circ \mathcal{D}(\cdot) - \mathcal{E}(x) is β\beta-cocoercive for the (yk,zk)(y^k,z^k) and zz^\star: Since β\beta-cocoercivity is hard to prove for all feasible (y,z)(y,z), we assumed β\beta-cocoercivity only for (yk,zk)(y^k,z^k) and zz^\star. Then, we empirically verified the cocoercivity for actual (yk,zk)(y^k,z^k) and zz^\star, as in Figure 2, which supports [Line 165].

We will add these to the revision.

[W2]: "Are the findings enough to support the validity of the said assumptions?"

Yes, they are. We have found the following things in three different models:

ModelCocoercivity (L175)Cocoercivity vs Convergence (L177)Convergence vs Accuracy (L179)
SD 2.1 (Image LDM)✔ Completed✔ Completed✔ Completed
Lavie (Video LDM)✔ Completed✔ Completed✔ Completed
InstaFlow (Image Latent Rectified Flow)✔ Completed✔ Completed✔ Completed

which can completely verify the following causality:

Most instances L175\overset{\text{L175}}{\Longrightarrow} Cocoercivity L177\overset{\text{L177}}{\Longrightarrow} Convergence L179\overset{\text{L179}}{\Longrightarrow} Accuracy.

[W3] "The advantages of the given approach seems limited and not quite powerful. Being able to reduce some memory for decoding is not a big enough contribution in itself."

Our advantages are not limited to 'reduce some memory'. We organize the advantages here. Our method is:

  • Fast: Our method needs shorter runtimes for achieving the same NMSEs than the grad-based method. (up to 5X faster, in Fig. 3c and Tab. S1c, 1.89 s vs 9.51 s to achieve -16.4 dB)
  • Accurate: In the same runtime, our method shows smaller NMSEs than the grad-based method. (up to 2.3 dB lower, in Fig. 3b and Tab. S1b, -21.37 dB vs -19.06 dB in 25.1 s)
  • Memory-efficient (significant): Our method consumes a smaller amount of GPU memories than the grad-based method. (up to 89% of the memory can be saved, in Fig. 3b, 7.13 GB vs 64.7 GB)
  • Precision-flexible: The grad-based method requires a full-precision model that supports backpropagation. Our method, however, is flexible and can run on any precision model, even though it does not support backpropagation. Our method can be immediately applied to many LDMs distributed in half-precision.

[W4] "Limited novelty -- similarity with prior works like [14] mentioned in the paper."

Our paper is novel, as it firstly proposed Gradient-free decoder inversion of in LDMs. The following table shows that [14] is different from ours.

WorkProblemComparisonModel
OursInversion of decodersGradient descent (similar to GAN inversion)SD2.1, LaVie, InstaFlow
[14]Inversion of denoising diffusion processesnaive DDIM inversionPixel space DM, SD2.1

One similar point to [14] is that we employed the forward step method, but that is just a very widely known optimization algorithm [39].

[W5] "Not enough quantitative experiments to establish the utility of the said method -- Comparing for memory consumption and runtime is not enough in my opinion."

We sufficiently verified four distinct advantages of our method. As mentioned in [W3], we did not only compare memory consumption and runtime, but also accuracy and float precision.

[W6] "Not enough applications provided to prove the merits of the proposed approach -- applying the current method on diverse set of applications like image generation, editing, interpolation, etc and comparing it with gradient-based methods and evaluating the results would help."

Thanks for a good suggestion. We additionally conducted background-preserving image editing [14, 32]. Figure R1 (see the rebuttal pdf in "Author Rebuttal by Authors") shows the qualitative results of applying our algorithm to the experiment in [14], which investigates how exact inversion improves the background-preserving image editing [32]. To compare accuracy at similar execution times, we adjusted the number of iterations to match the execution time. At comparable execution times, our grad-free method better preserves the background and achieves a lower NMSE. We will add this to the revision.

评论

I appreciate the effort put in by the authors but I stand by my rating.

作者回复

Dear Reviewers,

Thank you for taking the time to provide such valuable feedback. We are delighted to learn that you found many strengths in our paper. All reviewers noted that our research offers advantages over existing gradient-based methods, particularly in terms of speed and memory. All reviewers also acknowledged that our reasonable assumptions were verified through experiments and that the paper is well-written. [DG1n] found our method novel, and [2mTK, 9ej4] found our method can be successfully applied to gain benefits in practical applications. [9ej4] recognized that our extensive results demonstrate the superiority of our method. [DG1n] found our method can be easily extended to any encoder-decoder structure, and [DG1n] appreciated that we have provided the code to ensure reproducibility.

While many strengths of our manuscript were identified, here we would like to further clarify the following key advantages / contributions of our work:

1. Our method is fast, accurate, memory-efficient, and precision-flexible.

We have already verified that, in addition to memory usage and runtime, our method also excels in terms of accuracy and precision versatility. To summarize the advantages of our method and their experimental evidence, our method is:

  • Fast: Our method needs shorter runtimes for achieving the same NMSEs than the grad-based method. (up to 5X faster, in Fig. 3c and Tab. S1c, 1.89 s vs 9.51 s to achieve -16.4 dB)
  • Accurate: In the same runtime, our method shows smaller NMSEs than the grad-based method. (up to 2.3 dB lower, in Fig. 3b and Tab. S1b, -21.37 dB vs -19.06 dB in 25.1 s)
  • Memory-efficient: Our method consumes a smaller amount of GPU memories than the grad-based method, which can be a significant advantage for large-scale LDMs on a GPU with limited memory. (up to 89% of the memory can be saved, in Fig. 3b, 7.13 GB vs 64.7 GB)
  • Precision-flexible: The grad-based method requires a full-precision model that supports backpropagation. Our method, however, is flexible and can run on any precision model that may not even support backpropagation. Our method can be immediately applicable to many existing LDMs distributed in half-precision.

2. Our algorithm provably converges with rigorous proofs.

We believe and would like to emphasize that our theorems with rigorous proofs are important contributions of this work. We provided that our novel algorithm converges through Theorems 1 and 2, with rigorous proofs included in the supplementary material. Notably, proving convergence with momentum (Theorem 2) is significantly more challenging than without it.

We carefully responded to all reviewers' comments / concerns and will incorporate all the feedback in the revision. To see Figures R1-R4, please download the PDF below.

最终决定

In this paper, the authors present a gradient-free approach for inverting decoders in latent diffusion models. The approach has the strengths of reducing computational complexity and memory usage compared to prior methods, along with permitting support for half-precision LDMs. The authors additionally establish a proof of convergence and demonstrate practical utility of the approach in noise-space watermarking. The paper would be strengthened if it provided additional applications in which comparable empirical performance gains were replicated. The strengths outweigh the weaknesses, and the paper is recommended for acceptance.