Ensemble Kalman Diffusion Guidance: A Derivative-free Method for Inverse Problems
摘要
评审与讨论
The authors address inverse problems using diffusion models as priors within a restrictive setup where the forward model is accessible only point-wise. They employ a Predictor-Correct framework, where in the prediction stage, samples are drawn from the prior using the ODE describing the diffusion model. In the correction stage, a MAP problem involving the forward model at the current diffusion step is solved through gradient-based updates. Unlike previous approaches, the authors approximate the intractable term in this MAP formulation by evaluating it on samples generated via the ODE. Since only point-wise access to the forward model is available, the authors estimate the gradient through statistical linearization and ensemble Kalman methods. This approach maintains a set of particles and uses them, along with their centroids, to approximate the gradient. The authors validate their algorithm on three different inverse problems.
优点
Introduction of an algorithm that solves inverse problems with diffusion models prior that only requires point-wise access to the forward model
缺点
Methodological concerns
- The authors’ motivation for their "Derivative-free correction step" in Lines 246–263. The matrix appears without adequate justification, and in Equation (12), the authors invert even though it is a singular matrix, as the number of samples used to compute it is less than the dimensionality of the problem. Consequently, cannot be treated as a preconditioning matrix in this context.
- In the appendix, the authors base their proofs on Equation (20), which is introduced without sufficient explanation. This equation appears to correspond to the ensemble update, which assumes an estimation of the gradient, the very objective of the lemmas and propositions that follow. This circular reasoning raises concerns about the validity of the proofs.
Technical concerns
- In Lines 125–127 (paragraph following Equation (4)), the authors suggest that the guidance term depends on the noise scheduler . However, this dependence result from their specific formulation. As verification, the authors can review Equation (5) in [1] or write DPS's algorithm in terms of the score. The guidance term is not scaled by the noise scheduler.
- The statement in Lines 194–195 is misplaced. Specifically, is a composition of the simulated ODE and the forward model, making it highly non-convex. Therefore, the hypothesis of convexity is unrealistic. Besides, this term varies at each diffusion step, as it is composed with the ODE at different time steps, which diverges from the requirements outlined in [2], Chapter 4. Hence, the iterative updates may not converge to a true MAP estimate within this setups.
Errors and clarifications
- Equations (12)–(14) lack clarity. The variable is undefined, and while the argmin is specified with respect to , this variable does not appear in the equations.
- In Lines 897–903, the gradients of are missing a logarithmic term.
- The first part of Assumption (3) regarding is redundant. Since is a positive semi-definite matrix, its trace is non-negative, being the sum of positive eigenvalues. Therefore, if the trace of is zero, must be the zero matrix.
- In Lines 316–317, "DPG" should replace "DPS," as DPS is not included in the experiments.
- The term "guidance" is repeated twice in Line 52.
.. [1] Chung, Hyungjin, et al. "Diffusion posterior sampling for general noisy inverse problems." arXiv preprint arXiv:2209.14687 (2022).
.. [2] Parikh, Neal, and Stephen Boyd. "Proximal algorithms." Foundations and trends® in Optimization 1.3 (2014): 127-239.
问题
- Plot (b) in Figure 3 is difficult to interpret. Could you clarify what the values 0.2, 0.4, ..., 0.8 represent? Additionally, what do "Seq # DM" and "Seq # DM grad" refer to in this context?
- Plot (c) reports the runtime of the proposed algorithm, with an average of approximately 140 minutes. This is a considerable computational cost (about 2 hours) for solving one inverse problem.
- How does this runtime compare to other algorithms (aside from EKI) ?
- Can you comment on the practical applicability of the method given this runtime?
- Considering the high computational cost, how does the algorithm perform relative to methods that fine-tune or train smaller network components for the guidance term, as seen in [1, 2, 3, 4]?
.. [1] Black, Kevin, et al. "Training diffusion models with reinforcement learning." arXiv preprint arXiv:2305.13301 (2023).
.. [2] Uehara, M., Zhao, Y., Black, K., Hajiramezanali, E., Scalia, G., Diamant, N. L., ... & Levine, S. (2024). Fine-tuning of continuous-time diffusion models as entropy-regularized control. arXiv preprint arXiv:2402.15194.
.. [3] Fan, Ying, et al. "Reinforcement learning for fine-tuning text-to-image diffusion models." Advances in Neural Information Processing Systems 36 (2024).
.. [4] Denker, Alexander, et al. "DEFT: Efficient Finetuning of Conditional Diffusion Models by Learning the Generalised -transform." arXiv preprint arXiv:2406.01781 (2024).
Q1: We appologize for the confusion in Figure 3(b). We clarify as follows:
- Explain metrics Figure 3(b) compares the computation complexity of different algorithms. As defined in "evaluation metrics" paragrpah in Section 4.2,
Seq # DMis the number of sequential diffusion model evaluations, representing the count of diffusion model evaluations that must be performed in sequence where each query depends on the previous ones.Seq # DM gradis similar toSeq # DMbut focuses on the gradient evaluation of diffusion model. - Why we measure number of sequential evaluations? This is because the sequential evaluations determine the minimum runtime irrespective of available computational resources. By isolating the sequential evaluation, we can better assess how much of the algorithm can potentially benefit from parallelization. Similar to the concept of "critical path" for algorithm analysis in the computer science literature [5].
- "What does 0.2,..,0.8 mean?": to compare the relative computation complexity of different algorithms, each axis is normalized by diving by the maximum across all the algorithms. For example, a value of 0.8 indicates that the corresponding algorithm uses 80% of the computational resources compared to the most resource-intensive algorithm for that metric.
Q2: Runtime comparison: To provide a clearer perspective, we report the runtime of different algorithms on the Navier-Stokes equation, tested on a single A100 GPU. These results are now incorporated into Table 2 of the revised manuscript. As shown below, EnKG achieves the best performance while maintaining a runtime comparable to or even shorter than other algorithms.
| EKI | SCG | DPG | Forward/Central-GSG | EnKG | |
|---|---|---|---|---|---|
| Runtime (minute) | 121 | 119 | 228 | 105 | 124 |
All the algorithms take a while to finish. We believe this reflects the intrinsic difficulty of the problem rather than a limitation of our algorithm. Solving inverse problems where the forward model involves a numerical PDE solver is computationally expensive across all methods. For instance, Figure 13 in [14] reports runtimes ranging from 3–16 hours for a 2D spatio-temporal PDE problem. It is normal in these type of PDE-based problems to take hours or even days.
Practical Applicability
Given its inherently parallel nature, EnKG can take advantage of recent advancements in large-scale distributed computing infrastructure to mitigate these challenges. With appropriate optimization and resource allocation, we believe that EnKG remains scalable even for high-dimensional problems, especially as computational hardware and distributed frameworks continue to improve. Its derivative-free nature and ability to operate with black-box forward models make it a practical option for challenging scientific problems where gradient-based approaches may be infeasible such as numerical weather model calibration [16,17].
Comparison to conditional finetuning Thanks for the references and the interesting question. To investigate how the conditional diffusion models work, we implement the idea of provided reference DEFT [4] on our Navier-Stokes equation problem. Note that the original DEFT requires the gradient information as the input (Eq. (12) in their paper). Since our focus is on the black-box forward model setting, where gradients are unavailable, we replace the gradient input with the meansurement y. The table below compares the relative L2 error of DEFT and EnKG.
| Method | |||
|---|---|---|---|
| DEFT(w/o. gradient) | 1.041 (0.1282) | 1.042 (0.1308) | 1.038 (0.1330) |
| EnKG (ours) | 0.120 (0.085) | 0.191 (0.057) | 0.294 (0.064) |
As shown in the table, DEFT is not effective in the black-box setting. We hypothesize that this is because DEFT, as a hybrid method, relies critically on gradient information to guide the generation process. Without access to gradients, its performance degrades significantly. In contrast, EnKG is designed to operate without gradient information, making it well-suited for problems where the forward model is a black box. As the results indicate, EnKG achieves significantly better performance than DEFT in this scenario.
[5]: Kohler, Walter H. "A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems." IEEE Transactions on Computers 100.12 (1975): 1235-1238.
[6]: Booton, Richard C. "Nonlinear control systems with random inputs." IRE Transactions on Circuit Theory 1.1 (1954): 9-18.
[7]: Kazakov IE (1954) Approximate method of statistical investigations of nonlinear systems. In: Proceedings, Voenno-Vozdushnaya Akademia imeni N.I. Zhukuvskogo, vol 394, pp 1–52
[8]: Evensen, Geir. "The ensemble Kalman filter: Theoretical formulation and practical implementation." Ocean dynamics 53 (2003): 343-367.
[9]: Iglesias, Marco A., Kody JH Law, and Andrew M. Stuart. "Ensemble Kalman methods for inverse problems." Inverse Problems 29.4 (2013): 045001.
[10]: Garbuno-Inigo, Alfredo, et al. "Interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler." SIAM Journal on Applied Dynamical Systems 19.1 (2020): 412-441.
[11]: Calvello, Edoardo, Sebastian Reich, and Andrew M. Stuart. "Ensemble Kalman methods: a mean field perspective." arXiv preprint arXiv:2209.11371 (2022).
[12]: Okulicka-Dłużewska, Felicja. "Pseudoinverse preconditioner for saddle point problem." AIP Conference Proceedings. Vol. 1504. No. 1. American Institute of Physics, 2012.
[13]: Cahueñas, Oskar, Luis M. Hernández-Ramos, and Marcos Raydan. "Pseudoinverse preconditioners and iterative methods for large dense linear least-squares problems." Bull. Comput. Appl. Math 1 (2013): 69-91.
[14]: Gaggioli, E. L., and Oscar P. Bruno. "Parallel inverse-problem solver for time-domain optical tomography with perfect parallel scaling." Journal of Quantitative Spectroscopy and Radiative Transfer 290 (2022): 108300.
[15]: Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." Advances in neural information processing systems 35 (2022): 26565-26577.
[16]: Buehner, Mark, Ron McTaggart-Cowan, and Sylvain Heilliette. "An ensemble Kalman filter for numerical weather prediction based on variational data assimilation: VarEnKF." Monthly Weather Review 145.2 (2017): 617-635.
[17]: Miyoshi, Takemasa, and Masaru Kunii. "The local ensemble transform Kalman filter with the Weather Research and Forecasting model: Experiments with real observations." Pure and applied geophysics 169.3 (2012): 321-333.
Painful review of the revised version of the paper
The authors introduced several changes, particularly in the paragraphs "Guidance Correction Step" and "Derivative-Free Steps", the Experiments section, and the appendix. However, these modifications were not clearly marked or highlighted to distinguish them from the initial version, except for updates in Figure 3.
This lack of clear differentiation hinder the review process.
Misusage of pre-conditioning
Pre-conditioning in optimization involves transforming the problem using an invertible matrix, as described in [1], particularly in Chapters 5, 6, and 7. This transformation ensures the optimization is carried out in an alternative space through a well-defined change of variables, which inherently requires the preconditioning matrix to be invertible.
In this context, the authors’ characterization of as a pre-conditioning matrix is misplaced and introduces confusion. Since is not invertible (as it is computed from a limited number of samples, making it rank-deficient), it cannot fulfill the role of a true preconditioning matrix. Hence, using this term in the paper is misleading.
Flaws in Proposition 1
After revision, several concerns about Proposition 1 and its proof remain unresolved:
- Issue with Equation (18): This equation implies
This arises because, on the right-hand side of Equation (18), the summation involves the index k only in the first term. Consequently, using the definition of ,
This result does not make sense. The same issue appears in the appendix, specifically on Line 916.
- Incorrect Development of in the appendix Lines 801–804
the development of is flawed as the outer product between the involved quantities does not necessarily commute. In fact,
holds if only and if , which is not necessarly the case here.
.. [1] Nocedal, Jorge, and Stephen J. Wright, eds. Numerical optimization. New York, NY: Springer New York, 1999.
Dear Reviewer Fiay,
Thank you for providing further comments on our manuscript. We have carefully addressed them in our follow-up response and would like to kindly check if our reply has resolved your concerns.
As the discussion period is ending soon, we would greatly appreciate it if you could update the score if the concerns have been addressed or share any additional feedback if there are remaining issues.
Your input has been invaluable, and we are happy to provide further clarification if needed.
Best regards
We truly appreciate the reviewer’s detailed suggestions and careful reading of our manuscript. We address the reviewer’s concerns as follows:
- We would like to clarify that first part of the Assumption 3 is not redundant. Specifically, is the observation empirical covariance matrix. In general, does not necessarily imply that . For example, if the forward map maps everything to a constant, but . Assumption 3 baiscally ensures that the forward model does not map all different particles into identical observation.
- We have corrected the typos mentioned by the reviewer in the revised version of the manuscript.
We thank the reviewer for the detailed comments and constructive suggestions, and we apologize for any confusion caused by the presentation. We address the reviewer's concerns and questions below.
Methodological concerns
W1: Regarding the covariance matrix , we first note that our algorithm does not explicitly use . It is only used for derivation. When is singular, the correction step can be interpreted as a projected gradient step onto the subspace spanned by the particles. Below, we address the reviewer's concerns in detail.
- Role of the covariance matrix: intuitively, the matrix is an important component for the derivation that facilitates interaction among particles in our framework. In our paper, as we shown in Proposition 1, introducing enables us to derive the update rule of the particles for solving the inverse problem without explicit derivatives. The covariance matrix naturally appears in the expression of statistical linearization [6,7]. The combination of covariance matrix and statistical linearization is a well-established idea and central feature in the family of ensemble Kalman methods [8,9,10] dated back to the 1990s by Evensen et al [8]. To provide more context for this step, We add additional explanation after Eq.(12) in the revised version.
- Singular covariance matrix is valid and does not affect the drivation and proof: we appreciate the reviewer’s observation regarding the singularity the covaraince matrix when the number of particles is less than the problem dimension. While we agree that this case requires further discussion, we emphasize that singular covariance matrix does not affect the derivation or proof as they do not rely on being full-rank. Because the covariance matrix is always positive semi-definite (PSD), its inversion in Eq. (12) can be generalized to Moore–Penrose inverse (aka. pseudoinverse). The use of singular PSD matrix as preconditioner is valid and commonly adopted in the numerical iterative algorithms [12,13].
- Correction step is still effective even with singular covariance matrix In our case, when is singular, Eq.(14) effectively becomes a gradient step projected onto the subspace spanned by the particles. This is effective when the intrinsic problem dimension is smaller than the ambient space dimension. As shown in our experiments, 1~2k particles is already sufficient to solve various inverse problems effectively. As further studied in Figure 6, further increasing the number of particles yeilds diminishing return, supporting the practical utility of our approach even in high-dimensional problems.
- We agree this is an important case to consider and have added the discussion to lines 264-269 to clarify.
W2: Thanks for the constructive feedback. We apologize for any confusion caused by the current presentation. To address the reviewer’s concerns and ensure clarity, we have revised Proposition 1 and Lemma 1 in the updated version. The reasoning path is as follows:
- Starting point We begin with the ensemble update rule as defined in Eq.(16) (revised version).
- Convergence result Based on this update rule, we establish the convergence results presented in Lemma 1.
- Link Lemma 1 to Proposition 1 Lemma 1 is then used to prove Proposition 1.
- Proposition 1 connects the update rule to Eq.(14) Proposition 1 shows that our ensemble update rule given by Eq.(16) approximates the gradient correction step defined in Eq.(12) without explicit derivative.
We hope this resolves the reviewer’s concerns, and we are happy to provide further clarifications if needed.
Technical concerns W3: Thanks for sharing your comment. What we intended to convey is that the guidance term should theoretically depend on the noise schedule in the reverse process if the goal is posterior sampling. To see this, Eq.(4) in the DPS paper expresses the reverse SDE for posterior sampling, where the coefficient of the guidance is , directly tied to the noise schedule. However, in DPS’s algorithm (Algorithm 1 in their paper), the coefficient is implemented as a heuristic that is unrelated to (see footnote 5 on page 6 of DPS paper). This suggests that posterior sampling interpretation does not align with how DPS actually works in practice. This is one reason why we want to propose the Prediction-Correction framework as an alternative interpretation.
W4: Thank you for pointing out this issue. We agree that the underlying optimization problem is generally non-convex, which means the algorithm may converge to a local maximum rather than a global one. To address this, we have revised the statement to explicitly clarify that.
Highlight the modification of the revision We apologize for any inconvenience caused by the lack of highlights of the revised text. To facilitate the reviewer’s evaluation, we have highlighted all revisions in the updated manuscript. Additionally, we provide a summary of major updates in the meta response for ease of reference.
Misuse of terminology "pre-conditioner" We thank the reviewer for the reference and detailed explanation of the concern. We agree that describing as the term "pre-conditioner" could be misleading as as it traditionally applies to invertible matrices. To fully address the reviewer's concern, we have removed the term "pre-conditioner" and revised the dicussion to focus on the practical role of , which is to project the gradient onto the subspace spanned by the particles. This change has been reflected in lines 264–267 of the updated manuscript.
Flaws in Proposition 1
Issues with Eq.(18): we thank the reviewer for pointing out the issue and apologize for the confusion. There was indeed a typo in the superscript on the right-hand side of Eq.(18). The corrected right-hand side is as follows, where the index appears in both terms.
We appreciate the reviewer’s careful review and have corrected this typo in the updated manuscript. Additionally, we have reviewed the entire manuscript to ensure these indexing typos have been fixed.
Concerns about derivation of in Lemma 1 We appreciate the reviewer's feedback and agree that this step requires explicit justification. To clarify, we have added the following detailed explanation to explicitly show the equality: we first recall the definition of :
Using this definition, we rewrite one of the cross terms as:
Similarly, the other cross term can be simplified as Therefore, the sume of the cross terms simplifies to . The full derivation is provided in lines 758-776 of the updated manuscript.
We hope these clarifications address the reviewer’s concerns, and we are happy to provide further elaboration if needed.
Diffusion models have been used to address inverse problems, with numerous diffusion-based solvers that avoid retraining existing diffusion models. These approaches typically rely on pseudo-inverses or derivatives tied to the forward model. This paper introduces a novel diffusion-based inverse solver designed for cases where the forward model is unknown.
The authors propose Ensemble Kalman Diffusion Guidance (EnKG), a derivative-free method that utilizes only forward model evaluations along with a pre-trained diffusion model. In the proposed method, guidance term is computed as follows:
- Particles are initialized to compute covariance in following steps.
- During the diffusion trajectory, the particles are pushed by ODE solver.
- Then, synthesized samples are applied to the forward model and compute covariance of them.
- Diffusion trajectory is updated by the formula given the EnKG. The proposed method replace the derivative of the forward model by computation of covariance and ODE solving.
The empirical results demonstrate the effectiveness of EnKG across diverse inverse problems, including cases where the measurement operator is treated as a black box. Specifically, the method is applied to image inversion problems with explicit forward models, Navier-Stokes inverse problems where the forward model is computed by solving PDEs, and black hole imaging, where the forward model is a black box.
优点
- Novel Approach: The paper introduces statistical linearization within ensemble Kalman methods to diffusion-based inverse problems, a novel concept in this context.
- Innovative Guidance Term Formulation: The authors present a unique formulation for the guidance term, with a clever trick that replaces the derivative of the forward model with covariance from forward evaluations.
- Comprehensive Validation: The effectiveness of EnKG is demonstrated across three different scenarios: (1) cases where the forward model is known and differentiable, (2) cases where the forward model is known but differentiating it is impractical (e.g., PDE-based models), and (3) cases where the forward model is a black box, with observations as the only available information.
缺点
- Limited Discussion of Related Work: The paper lacks depth in interpreting and explaining related works.
- The motivation behind the weighting matrix is unclear. Could you provide further explanation on the intuition and reasoning behind the choice of weighting matrix?
- Although the Kalman method is a core component, the paper does not provide a thorough explanation of its role and mechanics in this context. Specifically, which parts of the method are directly applied from existing literature and which represent novel contributions of this paper? For example, the introduction of weighting matrix, the derivation using local linearity of the operator in the proof, and the convergence claim. Related to above questions, please clarify the intuition and motivation provided by the literature versus that introduced by the authors.
- Overstated Contribution: The claimed contributions seem somewhat overstated. The concept of Predictor Corrector interpretation in guidance-based methods is not entirely new.
问题
- Line 316: Is the DPS baseline included in the table?
- Black-Hole Imaging Problem: In the black-hole imaging problem, how is computed if is unknown and differs from the observed data? Could you provide a step-by-step explanation of how the black-box forward model is handled in the case? Additionally, please clarify if any assumptions are made about or generated samples in this scenario.
- Proof of Lemma 1: The proof shows a monotonic decrease in . Why does this quantity converge to zero? From another perspective, a vanishing covariance implies the trajectories converge to a single point. Does this implication contradict the ill-posed nature of inverse problems, which typically have many possible solutions?
Possible Errata
- Lines 225 and 284: Remove \Gamma .
- Line 731: Remove the extraneous ‘(‘.
- Line 847: Replace with .
Q1: Thanks for pointing it out. We have removed that sentence from the main text to avoid confusion in the revised version. For a more comprehensive evaluation, we have added a comparison against a few gradient-based methods including DPS as additional reference points in Appendix A.8.
Q2:
Thanks for the question. We assume G is black-box function, meaning that for any input , we can evaluate without having access to the internal structure or parameters of - similar to scenarios where software provides an API to evaluate but does not expose its source code. If this does not align with the reviewer’s interpretation of “unknown,” we would appreciate clarification. In the black hole imaging expeirment, we evaluate with the forward model inImage.observe from ethim library, which is a function of . As defined in Eq.(1), denotes the input to the forward model where the observed data is .
Q3: Thanks for the questions.
- Proof clarification We agree that the last step in the proof could benefit from additional clarification. In response, we have revised the last step of the proof as follows: By assumption 1 and 2, we know that both and are upper bounded. Further by Assumption 3, is lower bounded unless all the particles collapse in a single point. Therefore, there exists an such that Theforefore, plugging this into Eq.(23) gives showing that monotonically decreases to zero.
- Ill-posedness
- Different initialization may lead to different solutions The convergence result indicates that the algorithm returns a point estimate (an ensemble of particles that are close to each other) for each run. However, the randomness in the initialization (prior step) can still lead to different convergent points across runs just like other optimization methods. Thus, the convergence does not imply the algorithm cannot find different solutions.
- Clarification on ill-posedness The ill-posedness suggests that potentially many satisfy . However, when equipped with a proper prior, the potential solution space will be largely constrained, which is the motivation of using strong diffusion prior. In such cases, the set of possible solutions may reduce to a unimodal distribution.
Possible Errata:
- We apologize for the confusion. The inner product with subscript represents the weighted Euclidean inner product, where is the covariance matrix of the Gaussian noise in the observation, as defined in Eq.(1). We have added this notation and its definition to Table 4 for clarity.
- Other issues have been fixed in the revised version. Thanks again for the detailed review.
[1]: Booton, Richard C. "Nonlinear control systems with random inputs." IRE Transactions on Circuit Theory 1.1 (1954): 9-18.
[2]: Kazakov IE (1954) Approximate method of statistical investigations of nonlinear systems. In: Proceedings, Voenno-Vozdushnaya Akademia imeni N.I. Zhukuvskogo, vol 394, pp 1–52
[3]: Evensen, Geir. "The ensemble Kalman filter: Theoretical formulation and practical implementation." Ocean dynamics 53 (2003): 343-367.
[4]: Iglesias, Marco A., Kody JH Law, and Andrew M. Stuart. "Ensemble Kalman methods for inverse problems." Inverse Problems 29.4 (2013): 045001.
[5]: Garbuno-Inigo, Alfredo, et al. "Interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler." SIAM Journal on Applied Dynamical Systems 19.1 (2020): 412-441.
[6]: Calvello, Edoardo, Sebastian Reich, and Andrew M. Stuart. "Ensemble Kalman methods: a mean field perspective." arXiv preprint arXiv:2209.11371 (2022).
[7]: Chada, Neil K., Yuming Chen, and Daniel Sanz-Alonso. "Iterative ensemble Kalman methods: A unified perspective with some new variants." Foundations of Data Science (2021).
[8]: Kovachki, Nikola B., and Andrew M. Stuart. "Ensemble Kalman inversion: a derivative-free technique for machine learning tasks." Inverse Problems 35.9 (2019): 095005.
[9]: Bergou, El Houcine, Serge Gratton, and Jan Mandel. "On the convergence of a non-linear ensemble Kalman smoother." Applied Numerical Mathematics 137 (2019): 151-168.
[10]: Chada, Neil, and Xin Tong. "Convergence acceleration of ensemble Kalman inversion in nonlinear settings." Mathematics of Computation 91.335 (2022): 1247-1280.
[11]: Classifier-Free Guidance is a Predictor-Corrector. (Submission to ICLR2025).
We thank the reviewer for the constructive feedback and detailed review, and we appologize for any confusion caused by the presentation. In response, we address the reviewer's concerns and questions below.
W1: Role of the covariance matrix: intuitively, the matrix is an important component for the derivation that facilitates interaction among particles in our framework. In our paper, as we shown in Proposition 1, introducing enables us to derive the update rule of the particles for solving the inverse problem without explicit derivatives. The covariance matrix naturally appears in the expression of statistical linearization [1,2]. The combination of covariance matrix and statistical linearization [1,2] is a well-established idea and central feature in the family of ensemble Kalman methods [3,4,5,6] dated back to the 1990s by Evensen et al [3]. To provide more context for this step, We add additional explanation after Eq.(12) in the revised version.
Clarification on contribution
- In a nutshell, the proposed EnKG is a novel algorithm combines the expressive capabilities of data-driven priors with the ability to solve inverse problems using only black-box access to the forward model. EnKG introduces the core idea of statistical linearization from ensemble Kalman methods [3,4,5] into the Prediction-Correction framework for guided diffusion. At a high level, EnKG can also be viewed as an ensemble Kalman Inversion method regularized by the diffusion model prior.
- Relation to existing literature
- The use of covariance matrix stems from statistical linearization. This idea originates from [1,2] and is the fundamental building block of the family of ensemble Kalman method (See [6,7] for overview). However, its integration into the proximal operator defined in Eq.(12) is proposed by us.
- The convergence result in Lemma 1 is also developed by us. While prior works [9,10] establish the convergence results for other ensemble Kalman method in the non-linear setting, their proof do not directly apply to our case due to the difference in the update rule. That said, they share similar intuition and assumptions (e.g., regularity assumptions and bounded particles).
- The use of the local linearity of the operator is mentioned in [8,10]. However, [8] does not provide a formal proof and focuses on a different objective function and a momentum-based variant while [10] examines a different variant of EKI with Tikhonov regularization and covariance inflation.
- To address the reviewer's concerns and improve the clarity, we have also updated the Related Work section to extend the dicussion on the literature.
W2: Thank you for your feedback.
- To the best of our knowledge, no prior work explicitly provides the Predictor-Corrector interpretation for guidance-based methods as presented in our paper. We did just find a concurrent ICLR 2025 submission [11] that proposes a similar concept but focus on classifier-free guidance.
- The primary contribution of our work is the development of the EnKG algorithm, which integrates the Predictor-Corrector framework with ensemble Kalman methods. We are open to revising our contribution statement and softening the wording around the Predictor-Corrector interpretation if the reviewer can provide specific references to prior work that aligns with our approach.
Dear Reviewer G5qz,
We want to thank you for your careful review and valuable feedback. We have carefully revised the manuscript based on your feedback and have provided further clarifications in our rebuttal. As the discussion period concludes today, we would like to check if our responses have addressed your concerns. If so, we would greatly appreciate it if you could update your score to reflect this.
We sincerely value your input and are happy to provide further clarifications if needed.
Best regards
Apologies for the delayed response.
I have reflected further on the feasibility of the proposed theory.
Since is finite, does not guarantee that the quantity converges to zero. Other types of inequalities, such as , would be required.
Unfortunately, I do not believe such an inequality is achievable under the given assumptions.
Additionally, I would like to highlight some points that were missed during the previous review process: If a matrix , then .
The assumptions imply that , which is a rather strong condition, especially when is ill-posed.
I recommend that the authors provide empirical evidence to demonstrate that the proposed derivative-free correction adequately approximates the actual correction term.
We thank the reviewer for the response.
EnKG does not rely on the convergence limit We appreciate the reviewer's reflection and would like to clarify that Lemma 1 is a standard convergence result in theoretical analyses (https://en.wikipedia.org/wiki/Convergent_series), where the average distance between particles approaches zero in the limit of infinite steps. Importantly, Proposition 1 does not rely on this exact limit. Instead, the convergence result is used to bound the Taylor approximation error introduced in Proposition 1, ensuring that the error is well-controlled.
To further address this point, we empirically evaluate particle convergence in Figure 5, where we plot the average distance between particles at each update step for different problems. The results show rapid convergence, with the average distance nearing zero after 80 steps. This indicates that the Taylor approximation error remains well-controlled within a finite number of iterations, supporting the practical feasibility of our approach.
Clarify assumption If the reviewer is referring to Assumption 3, we clarify that it is relatively loose. The condition only holds when all particles produce identical responses, i.e., . This scenario occurs either when the algorithm has already converged or when the problem is so ill-posed that meaningful optimization is infeasible. Such a condition is highly unlikely, particularly when hundreds or thousands of particles are involved. Empirically, Figure 5 shows that the particles converge in a well-controlled manner across various experiments, supporting the validity of Assumption 3.
We hope this explanation provides further clarity on the assumptions and their practical implications. We kindly request the reviewer to reconsider the score in light of the key contributions and innovations of our work highlighted in our meta response.
The paper presents Ensemble Kalman Diffusion Guidance (EnKG), a novel derivative-free method for solving inverse problems using pre-trained diffusion models. Traditional approaches often require detailed knowledge of the forward model, such as derivatives, which limits their applicability. EnKG overcomes this by relying solely on black-box evaluations of the forward model and a pre-trained diffusion model. Key contributions are twofold: 1) EnKG operates solely with black-box access to forward model evaluations and a pre-trained diffusion model, making it particularly useful in scenarios where derivative information is inaccessible. 2) The authors introduce a prediction-correction (PC) framework that utilizes the empirical covariance matrix of ensemble particles during the correction step. This innovation allows EnKG to effectively bypass reliance on gradients, enhancing its applicability in non-linear inverse problems. The paper demonstrates the effectiveness of EnKG across various inverse problems, including applications in fluid dynamics. These examples highlight the method's capability to handle complex, non-linear scenarios that are common in scientific research. In summary, this work expands the toolkit for addressing inverse problems in machine learning by introducing a flexible and robust approach that maintains the generative power of diffusion models.
优点
- EnKG operates without gradients, needing only black-box access to forward models, making it highly applicable to complex inverse problems with unknown or undefined derivatives. The proposed PC framework generalizes existing methods, enabling adaptability across various inverse problems without retraining.
- The current work demonstrates strong performance, notably in complex tasks like the Navier-Stokes equation, outperforming gradient-based solutions
- Provides deeper understanding and new interpretations of diffusion-based approaches, contributing to the field of inverse problem-solving.
缺点
- The method may face challenges when scaling to very large models or high-dimensional data, as ensemble-based approaches can become computationally expensive. A further insight along this line would be useful. Also, it relies on pre-trained diffusion models, which might limit effectiveness if high-quality models are not available for certain tasks.
- The empirical validation focuses on specific problem sets; broader testing across diverse applications would strengthen the generalizability claim.
- A further analysis on the algorithm complexity would be beneficial as the combination of the prediction and correction steps might introduce additional computational and implementation complexity due to ensemble covariance estimation.
问题
- How does EnKG perform when applied to larger-scale problems or high-dimensional data? Are there specific limitations to its computational efficiency?
- Is there any dependence on the performance and the pertrained model? How sensitive is the method to the quality and type of pre-trained diffusion model used? What are the implications if a suitable model is not available?
Q1: Thanks for the good question.
- As discussed in our response to W3, the proposed EnKG is well-suited for large-scale problem when the forward model is the major cost and can be evaluated in parallel. In such cases, EnKG’s reliance on parallelizable updates ensures scalability when equipped with the modern distributed computing infrastructure. However, if the forward model is inherently slow and cannot be parallelized, EnKG’s runtime may become a limiting factor.
- Exploring even higher-dimensional problems to benchmark EnKG and other derivative-free methods is an exciting direction for future research. However, such an effort is beyond the scope of this paper, which primarily focuses on introducing and analyzing the new algorithmic design.
Q2: This is an interesting and insightful question.
Additional controlled experiment To investigate the dependence on the quality of pre-trained diffusion models, we conducted a controlled experiment on Navier-Stokes equation. Specifically, we trained a diffusion model prior using only 1/10 of the original training set and limited the training to 15k steps to simulate a lower-quality model. We evaluate the top two algorithms EnKG and DPG, and report the relative L2 error below.
| Algorithm | Original model trained with full data | New model trained only 1/10 data |
|---|---|---|
| DPG | 0.325 (0.188) | 0.394 (0.178) |
| EnKG (Ours) | 0.120 (0.085) | 0.169 (0.117) |
Robust performance We observe that while both algorithms experienced a performance drop due to the reduced quality of the diffusion model, the decline was relatively small compared to the significant reduction in training data. Notably, our EnKG demonstrated greater robustness, with a smaller performance drop than the best baseline method, DPG. These results indicate that while EnKG benefits from high-quality diffusion models, it is not overly sensitive to their quality. It maintains strong performance even with reduced model capabilities. We have included this interesting experiment and results in Appendix A.9.
[1]: Kohler, Walter H. "A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems." IEEE Transactions on Computers 100.12 (1975): 1235-1238.
Thank you for the detailed responses and the additional results provided in the rebuttal. I apologize for my late reply due to unforeseen circumstances. After reviewing the comments and the paper again, I would like to summarize my thoughts again.
The authors' point about leveraging advancements in distributed computing to mitigate computational challenges is well taken. However, it remains essential to clarify how these optimizations will be implemented in practice, especially as computational demands increase. The clarification that EnKG does not explicitly calculate or store the covariance matrix is helpful. The focus on vector inner products for ensemble updates is an interesting approach that could lead to efficiency gains. The metrics provided for evaluating computational complexity are also valuable, particularly the distinction between sequential and total model queries, which aids in understanding the algorithm's scalability.
The authors have effectively communicated that EnKG is designed for large-scale problems where forward model evaluations can be parallelized. However, they also note that if the forward model is slow and cannot be parallelized, runtime may be a limiting factor. The controlled experiment investigating the robustness of EnKG against variations in pre-trained model quality is particularly insightful. The results indicate that while performance does decline with reduced data quality, EnKG maintains a stronger performance relative to DPG. This finding supports the claim that EnKG is robust and not overly sensitive to model quality.
While I acknowledge the improvements and clarifications made by the authors, I have some reservations regarding points highlighted by other reviewers, especially the reviewer Fiay. Therefore, I will maintain my initial score.
We sincerely thank the reviewer for the thoughtful feedback, positive comments, and careful review of our manuscript. We appreciate your recognition of the clarifications and improvements made during the discussion period.
Regarding the concerns raised by other reviewers, we have provided detailed responses and addressed these points comprehensively in our rebuttal. A summary of these discussions is included in our General Response to Summarize Discussion for ease of reference. We believe that the remaining concerns are relatively minor and have been adequately addressed.
Thank you again for your constructive engagement and valuable insights throughout the review process.
W1: Thank you for pointing out this important consideration. We agree that EnKG may face high overall computational costs when scaling to very high-dimensional problems. However, due to its inherently parallel nature, EnKG can take advantage of recent advancements in large-scale distributed computing infrastructure to mitigate these challenges. With appropriate optimization and resource allocation, we believe that EnKG remains scalable even for high-dimensional problems, especially as computational hardware and distributed frameworks continue to improve. Regarding the reliance on the pretrained model, we address the concern in the response to Q2 below.
W2: We thank the reviewer's comment on the scope of our experiments. Establishing a proper set of benchmarks for this style of problem is a major project in its own right and we are excited to pursue it in the future.
W3: We would like to first clarify that EnKG does not explicitly calculate or store the covariance matrix, as shown in Algorithm 2. Instead, the ensemble updates are performed using vector inner products, which are computationally efficient. Consequently, the primary computational cost arises from the forward model queries and diffusion model queries, which are reflected in the metrics for computational complexity defined in the “Evaluation Metrics” section of Sec. 4.2. For the reviewer’s convenience, we explain these metrics below.
Metrics for evaluating computational complexity
Total # Fwdis the total number of forward model queries i.e., .Seq # Fwdis the number of sequential forward model queries, representing the count of forward model queries that must be performed in sequence where each query depends on the previous ones.Total # DMandSeq # DM: similar toTotal # FwdandSeq # Fwdbut focuses on the diffusion model evaluations.Total # DM gradandSeq # DM grad: similar toTotal # FwdandSeq # Fwdbut focuses on the gradient evaluation of the diffusion model.
The sequential metrics are important as they determine the minimum runtime irrespective of available computational resources. By isolating the sequential evaluation, we can better assess how much of the algorithm can potentially benefit from parallelization. Similar to the concept of "critical path" for algorithm analysis in the computer science literature [1].
Analysis of Computational Complexity Figure 3(b) compares different algorithm through the len of the above metrics. As we can see, EnKG requires the least sequential forward model calls, which indicates that EnkG can scale up to large-scale problem when the forward model is the major cost and can be evaluated in parallel.
This paper proposes a new approach to solve inverse problems using a derivative-free optimization method based on the Ensemble Kalman Filter. The core idea is to approximate the data-fidelity term gradient with a statistical linearization from the ensemble Kalman methods. The method is applied to three types of inverse problems: computational imaging problems, the Navier-Stokes equation, and the black-hole imaging problem. The method is compared other derivative-free baselines.
优点
- Interesting approach to solve inverse problems.
- Derivative-free approaches can be useful in many cases and have received much less attention than gradient-based methods.
缺点
- Positioning relative to the state-of-the-art is not clear, in particular with respect to the proposed framework, which seems to be a variant of the existing ones (see Q2).
- The evaluation is not completely satisfactory (see Q3, Q4).
问题
-
Q1: Use of a
pretrained diffusion modelis mentionned. I can see this type of models for images, but for complex data with varying domain of definition, like in many scientific applications, the diffusion model needs to be retrained for each type of data no? IT is not completely clear how reusable these models are outside computational imaging. -
Q2: The prediction-correction scheme strongly relate to the usual Plug-and-Play methods classicaly used in inverse problem resolution. The proposed scheme is related to the Forward-Backward Splitting (FBS) method, which is typically used in the DiffPIR [B]. From what I understand, equation (9) in this paper corresponds to equation (13) in [A]. Could you elaborate on the main difference of between the proposed method and the DiffPIR framework? (except the derivation free aspect). Other missing references are [B], a survey on using diffusion models for inverse problems, and [C], which also consider diffusion prior for inverse problems, with gradient based method but applied to the Black-hole imaging problem. Adding comparison with their method would be very useful.
-
Q3: To make is possible to evaluate how good the results are compared to methods that are able to access the gradient of the forward operator, it is necessary to add a few methods that have access to the forward operator's gradients. Indeed, even though ODE solvers are not always natively differentiable, there are more and more works that consider making them differentiable, for instance using
jaxfor Navier stokes using pseudo-spectral solver here. The interesting question is: should we spend some time making them differentiable or do we not gain much by doing so. Therefor, quantifying how much is lost on simple cases as the ones presented here is necessary to make the case of derivative free methods. In particular, adding the results of the DPS method and the DiffPIR method would be very useful at least for the imaging task. Note that these models are both implemeted in thedeepinvlibrary (see here). Also, adding the DPS base line and PnP-DM form [C] for black hole imaging would better illustrate how much we loose by note considering the gradient of this differentiable operator. Adding them for Navier-Stokes would also be very interesting but probably more challenging. -
Q4: The value of and in the experiments are not reported. Could the author provide them? From equation (16), I understand that we need to compute the forward operator or times at each iteration. From the NAvier stokes experiment, assuming that the procedure is run for 1k steps, I guess and ? How were these value chosen? What are their impact on the results? Also, would having access to the gradient mean going roughly 100 times faster than EnKG? (The computational cost of computing the gradient through autodiff is approximately x2/3 times the cost of evaluating the forward operator). Note that the metrics chosen (# of evaluation of Fwd/DM, Seq) are not clear. Better definition of what they represent mean would be useful. Adding the total runtime of the method would help a lot to assess the computational cost.
-
Q5: In the black-hole experiment, how many simulation are used to train the diffusion model?
Minor comments, nitpicks and typos
-
Missing ref:
-
l.066: "One more challenging" -> "On"
-
l.069: "More computationally efficient" -> than what?
-
l.078: "often Gaussian" -> They are almost never Gaussian, but they are modeled as such and this gives reasonable results.
-
l.141: "and Nelder-Mead simplex methods" -> Extra
,. -
l.198: The drop of the subscript for the gradient is confusing
-
Eq (7) -> The notation is not defined. As it is not used anywhere else, I would recomment using instead.
-
Eq (12) -> should have a super script . It is also not completely consistent for .
-
l.302: "instead"
-
l.316:
our approach outperforms the standard strong DPS baseline-> The DPS result are not present in the table, so I think they are missing? -
l.335: What is the
EDMframework? -
l.847: The change from "l" to "j" should be made explicit as it is not immediately clear and can appear as a typo.
References
[A] : Daras, Giannis, et al. A survey on diffusion models for inverse problems. arXiv preprint, 2024.
[B] : Zhu, Yuanzhi, et al. Denoising diffusion models for plug-and-play image restoration. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[C] : Wu, Zihui, et al. Principled Probabilistic Imaging using Diffusion Models as Plug-and-Play Priors. arXiv preprint, 2024.
[1]: Deng, Chengyuan, et al. "OpenFWI: Large-scale multi-structural benchmark datasets for full waveform inversion." Advances in Neural Information Processing Systems 35 (2022): 6007-6020.
[2]: Allgower, Eugene L., and Kurt Georg. Numerical continuation methods: an introduction. Vol. 13. Springer Science & Business Media, 2012.
[3]: Kochkov, Dmitrii, et al. "Machine learning–accelerated computational fluid dynamics." Proceedings of the National Academy of Sciences 118.21 (2021): e2101784118.
[4]: Chen, Ricky TQ, et al. "Neural ordinary differential equations." Advances in neural information processing systems 31 (2018).
[5]: Gaggioli, E. L., and Oscar P. Bruno. "Parallel inverse-problem solver for time-domain optical tomography with perfect parallel scaling." Journal of Quantitative Spectroscopy and Radiative Transfer 290 (2022): 108300.
[6]: Buehner, Mark, Ron McTaggart-Cowan, and Sylvain Heilliette. "An ensemble Kalman filter for numerical weather prediction based on variational data assimilation: VarEnKF." Monthly Weather Review 145.2 (2017): 617-635.
[7]: Miyoshi, Takemasa, and Masaru Kunii. "The local ensemble transform Kalman filter with the Weather Research and Forecasting model: Experiments with real observations." Pure and applied geophysics 169.3 (2012): 321-333.
[8]: Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." Advances in neural information processing systems 35 (2022): 26565-26577.
[9]: Kohler, Walter H. "A preliminary evaluation of the critical path method for scheduling tasks on multiprocessor systems." IEEE Transactions on Computers 100.12 (1975): 1235-1238.
[10]: Chung, Hyungjin, et al. "Diffusion Posterior Sampling for General Noisy Inverse Problems." The Eleventh International Conference on Learning Representations.
[11]: Peng, Xinyu, et al. "Improving Diffusion Models for Inverse Problems Using Optimal Posterior Covariance." Forty-first International Conference on Machine Learning. 2024.
[12]: Tang, Haoyue, et al. "Solving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient Viewpoint." International Conference on Artificial Intelligence and Statistics. PMLR, 2024.
Sorry for the delayed answer.
I thank the authors for their clarification. I appreciate that the authors took my comments into consideration and added new comparisons with gradient-based methods. I recommend adding them to the main paper (at least table 1 and 2) to provide a comprehensive overview of the results. In most cases, the gradients could be computed but require a significant amount of work, so it is important to know how much is lost by not using them. There is a clear computational tradeoff between running many forward operations and spending time to have a stable method to compute the gradient.
Adding results with an adjoint method on the Navier-stokes problem would also be very interesting to see how much is lost by not using the gradient information of the forward operator. Note that I am not asking for this extra result for the discussion period; I recommend working on this for either the final version of the paper or a resubmission.
One thing that I don't understand is why the gradient-based method does not always outperform the gradient-free method for the linear image reconstruction problem. This is surprising, and I would appreciate some insights into why this is the case.
Also, I am surprised that why the time gain of DPS is only 4 while it should only evaluate the 1k forward problem and 1k diffusion model, so the order of magnitude seems quite different when looking at Table 2.
Overall, I agree with Reviewer G5qZ that the paper oversells the method in its current form. It does not give a comprehensive overview of the trade-offs between gradient-free and gradient-based methods. The addition of the new experiments improves this, but I think the message is not clear enough.
I am ready to discuss this with the other reviewers if they think this is not the case. I will increase my score to 5 as the paper improved compared to my initial evaluation.
We thank the reviewer for increasing the score and additional comments.
The trade-offs between gradient-free and gradient-based methods As mentioned in the reviewer's response, the tradeoff between derivative-free and gradient-based methods must account for the effort spent in implementing a reliable and accurate gradient estimation for the problem. However, quantifying this effort is inherently challenging, particularly for PDEs. For example, developing adjoint methods for different classes of Navier-Stokes equations is still an open research problem [4]. Consequently, we believe there is no single answer to this trade-off, and the answers will change over time as the techniques evolve and as we tackle harder non-differentiate problems.
If the reviewer refers to computational complexity comparisons between the proposed EnKG and other gradient-based methods, assuming the gradient is already implemented, please see our clarification below. However, it is important to note that such comparisons inherently ignore the significant effort required to implement gradient computation.
Gradient-based methods are not always better than derivative-free methods This is a good question. There are two potential reasons related to the gradient of the forward model:
- Ill-posedness: even in linear problems, gradient-based methods may get stuck in local optima when the problem is ill-posed.
- Measurement noise: even with the exact gradient, overfitting to the noisy measurement does not yield the best solutions.
In contrast, derivative-free methods offer implicit regularization by smoothing gradients (Central-GSG, DPG) or projecting updates onto a subspace (EnKG).
Subtle performance differences may also arise from interactions with MMSE estimates and algorithmic implementation nuances (e.g., DPS and DiffPIR). While this remains an open question, we hope these insights are helpful.
Clarifying the computation complexity
The runtime gain of DPS being only 4× compared to EnKG reflects EnKG’s parallelization capability. As shown in Table 2, the Seq # Fwd of EnKG is just 0.14k compared to DPS's 1k, resulting in a shorter critical path. With a full parallelization of particle computations, EnKG could outperform DPS in runtime. However, on a single A100 GPU, full parallelization is not feasible, which places the runtime ratio 4 (EnKG/DPS) between 0.14 (ideal parallelization) and 295 (based on Total # Fwd).
Overselling the method We apologize for any aspects of the paper that may have given the impression of overselling. We will certainly revise the text to address that if the reviewer could point to specific instances where the tone seems overstated. Nonetheless, we feel that the study of derivative-free methods for diffusion guidance is a sufficiently novel intellectual contribution in its own right and worthy of publication at ICLR.
Intellectual merit of studying derivative-free methods
Last, we would like to emphasize the intellectual merit of studying derivative-free methods in a broad context when solving inverse problems.
- Applicability to non-differentiable problems (e.g., symbolic music generation [1], stochastic weather models [2]).
- Handling scenarios where the forward model’s internal structure is unknown, such as black-box APIs
- High parallelizability that allows them to benefit from the recent advancements in large-scale distributed computing infrastructure, as discussed in [3].
- Allowing for more flexible forward modeling without requiring structural constraints for gradient computation.
Additional experiment on Navier-Stokes equation We agree this would be an interesting experiment. However, this is highly non-trivial due to the presence of the nonlinear term in the Navier-Stokes equation. To our knowledge, we are not aware of any public implementation of the gradient estimation (adjoint methods) for the incompressible Navier-Stokes equation, with the primal solver being the pseudospectral solver. We would greatly appreciate any references the reviewer could share.
We hope these responses address your concerns, and we welcome any additional feedback or suggestions. Thanks again for the constructive engagement.
[1]: Huang, Yujia, Adishree Ghatare, Yuanzhe Liu, Ziniu Hu, Qinsheng Zhang, Chandramouli Shama Sastry, Siddharth Gururani, Sageev Oore, and Yisong Yue. "Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion." In Forty-first International Conference on Machine Learning.
[2]: Miyoshi, Takemasa, and Masaru Kunii. "The local ensemble transform Kalman filter with the Weather Research and Forecasting model: Experiments with real observations." Pure and applied geophysics 169.3 (2012)
[3]: Salimans, Tim, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. "Evolution strategies as a scalable alternative to reinforcement learning." arXiv preprint arXiv:1703.03864 (2017).
[4]: Fang, Lean, and Ping He. "A duality-preserving adjoint method for segregated Navier–Stokes solvers." Journal of Computational Physics 503 (2024): 112860.
Q4: Parameter choices: We apologize for the confusion and clarify the parameters choices as follows. First, is the number of forward model queries for our baselines (Forward-GSG and Central-GSG). We set and it is chosen so that the overall runtime cost of Forward-GSG and Central-GSG is similar to the proposed EnKG. For the choice of , we conducted an ablation study on Navier-Stokes equation, as shown in the figures 6 and 8. The performance improves as we increase the number of particles, with offering the best trade-off between accuracy and runtime. For the numer of steps, we simply fixed it to 144 as we follows the EDM framework [8]. Figure 2 in [8] suggests that the EDM sampler yields marginal improvement beyond this point. The parameter choices of EnKG for each type of problem are specified as below.
| Problem | Number of particles | Number of correction steps |
|---|---|---|
| FFHQ256 | 1024 | 144 |
| Navier-Stokes | 2048 | 144 |
| Black hole | 1024 | 144 |
Runtime comparison We also report the runtime for different algorithms on the Navier-Stokes equation, measured on a single A100 GPU. Table 2 has been updated to include these runtime results. For PnP-DM, we explored all the hyperparameter combinations in their paper, but all of them crashed with numerical instability in the solver. We found that the Langevin Monte Carlo generates noisy vorticity field that violates the stability condition of the PDE solver. So we had to use very small time step for statbility, which significantly increased runtime (estimated to exceed 100 hours). Thus, we marked these results as "timeout" below. On the other hand, DPS is about 3.5x-4x faster but produces worse results than EnKG as reported in our earlier comparisons.
| EKI | SCG | DPG | Forward/Central-GSG | EnKG | DPS | PnP-DM | |
|---|---|---|---|---|---|---|---|
| Runtime (minute) | 121 | 119 | 228 | 105 | 124 | 35 | Crashed/timeout |
Elaboration on the computation complexity metrics
- We identify that the primary computation resource requirements of inverse problem solvers are the forward model query and diffusion model query. Our chosen metrics aim to comprehensively represent these resource demands. More specifically,
Total # Fwdis the total number of forward model queries i.e., .Seq # Fwdis the number of sequential forward model queries, representing the count of forward model queries that must be performed in sequence where each query depends on the previous ones.Total # DMandSeq # DM: similar toTotal # FwdandSeq # Fwdbut focuses on the diffusion model evaluations.Total # DM gradandSeq # DM grad: similar toTotal # FwdandSeq # Fwdbut focuses on the gradient evaluation of the diffusion model.
- Why we measure number of sequential evaluations? This is because the sequential evaluations determine the minimum runtime irrespective of available computational resources. By isolating the sequential evaluation, we can better assess how much of the algorithm can potentially benefit from parallelization. Similar to the concept of "critical path" for algorithm analysis in the computer science literature [9].
Q5: 50k black-hole GRMHD simulations are used for training diffusion model.
Minor comments, nitpicks and typos: thanks for the reviewer's detailed comments. We address your concerns below:
- Update "more computationally efficient than other derivative-free methods" for clarity.
- EDM is proposed by [8] to express the diffusion models in a unified and clean framework (introduced in Line 96). We have added the citation to the word EDM to avoid confusion.
- We have fixed the other issues in the revised version.
Q3: We appreciate the insightful question and constructive suggestions.
Part 1: additional comparison against gradient-based methods as reference
- For a more comprehensive evaluation, we have added results of gradient-based methods in Table 8, Table 9, and Table 10 in Appendix A.8 (last page). We also present them right below for the reviewer's convenience. Specifically, we have added DPS and DiffPIR for imaging tasks, DPS and PnP-DM to Navier-stokes equation and black-hole experiments as requested by the reviewer.
- The results show that EnKG clearly outperforms gradient-based methods like DPS and PnP-DM on Navier-Stokes equation and achieves comparable results on image restoration and black hole imaging. The details about the comparison are added to Appendix A.8.
FFHQ256 image restoration tasks:
| Method | Inpainting (PSNR/SSIM/LPIPS) | SR (x4)(PSNR/SSIM/LPIPS) | Deblur (Gaussian) (PSNR/SSIM/LPIPS) | Phase Retrieval (PSNR/SSIM/LPIPS) |
|---|---|---|---|---|
| DPS | 21.77/0.767/0.213 | 24.90/0.710/0.265 | 25.46/0.708/0.212 | 14.14/0.401/0.486 |
| DiffPIR | 22.87/0.653/0.268 | 26.475/0.744/0.220 | 24.870/0.690/0.251 | 22.204/0.733/0.270 |
| EnKG | 21.70/0.727/0.286 | 27.17/0.773/0.237 | 26.13/0.723/0.224 | 20.06/0.584/0.393 |
Relative L2 error on Navier-Stokes equation:
| Methods | |||
|---|---|---|---|
| DPS | 0.308 (0.214) | 0.349 (0.246) | 0.382 (0.228) |
| PnP-DM | Crashed/timeout | Crashed/timeout | Crashed/timeout |
| EnKG | 0.120 (0.085) | 0.191 (0.057) | 0.294 (0.064) |
Black hole imaging
| Method | PSNR | Blur PSNR | ||
|---|---|---|---|---|
| PnP-DM | 28.211 | 32.499 | 1.120 | 1.224 |
| DPS | 23.984 | 26.220 | 1.212 | 1.079 |
| EnKG | 29.093 | 32.803 | 1.426 | 1.270 |
Part 2: response to the question on differentiable solvers
- We acknowledge the reviewer’s point about differentiable solvers. While differentiable solvers are valuable in certain contexts, we argue that developing derivative-free methods remains a valuable direction for the following reasons:
- Autograd is not always efficient or reliable Automatic differentiation frameworks like JAX or PyTorch provide a general and straightforward way to estimate gradient. However, it often becomes computationally prohibitive for PDEs with high iteration counts due tracking a huge computation graph caused by unrolling the iterative numerical solvers. For example, in our Navier-Stokes equation problem, the autograd encounter out-of-memory issue when the pseudospectral solver unrolls for more than around 6k steps (on a A100-40GB GPU). Additionally, the paper [3] associated with aforementioned
jax-cfdpackage explicitly mentioned this issue in the first paragraph of page 4 in their paper. Therefore, autodiff is only practical for a limited range of problems. Moreover, autodiff can introduce additional numerical error, as evidenced in neural ODE work [4], which prompted the authors to derive an adjoint method to control the error. - Efficient and robust adjoint methods are not always easy to design While adjoint methods exploit PDE structure for efficient gradient estimation, they often require problem-specific derivations. Robust and efficient designs for adjoint methods remain an active area of research [5].
- Gradient could be inaccessible for many other reasons For instance, the forward model could be stochastic, partially unknown, or implemented as a black-box software, making gradient computation infeasible. We would be happy to provide further elaboration if the reviewer wishes.
- Autograd is not always efficient or reliable Automatic differentiation frameworks like JAX or PyTorch provide a general and straightforward way to estimate gradient. However, it often becomes computationally prohibitive for PDEs with high iteration counts due tracking a huge computation graph caused by unrolling the iterative numerical solvers. For example, in our Navier-Stokes equation problem, the autograd encounter out-of-memory issue when the pseudospectral solver unrolls for more than around 6k steps (on a A100-40GB GPU). Additionally, the paper [3] associated with aforementioned
- Given these challenges, we believe developing derivative-free methods is also a valuable direction. For example, derivative-free methods have been widely used in operational weather forecasting centers [6,7].
We appreciate the constructive suggestions and detailed review, and we apologize for any confusion caused by our presentation. We address the reviewer's questions and concerns below.
Q1: Thanks for the good question. The prior diffusion models typically need to be trained on domain-specific data to fully leverage domain characteristics, as we did with Navier-Stokes equation and black-hole data. Once trained, it can reused for different problems (various forward models) within that domain. For example, the same black-hole prior can be reused for different telescope configurations, such as varying the number of telescopes, their positions, or the level of measurement noise. Similarly, this adaptability extends to other domains, such as seismology and ocean dynamics, where the same prior can support different problem configurations, which is practically useful.
Exploring diffusion priors that generalize across multiple domains is an exciting direction for future research. However, such an effort would require studying the interplay between prior transferability and the underlying problem characteristics, which extends beyond the scope of this paper.
Q2: We thank the reviewer for the referrences and questions, and we applologize for any confusion caused by our presentation.
- Correction step is general and requires algorithmic design We would like to first clarify that Eq.(9) is a general expression for showing how existing guidance-based methods fit within our Prediction-Correction framework. This feature is shared by many other algorithms [10,11,12], including DiffPIR. The specific design of the correction step is an important algorithmic contribution and differentiates various methods.
- EnKG vs DiffPIR The correction step of EnKG is conceptualized in Eq.(14) and implemented as Eq.(16) or Eq.(17). If the reviewer is referring to a comparison between Eq.(14) or Eq.(16) in our paper and Eq.(13) in DiffPIR, we elaborate on the major differences below:
- Optimization space In Eq.(13) of DiffPIR, they optimize over , the clean data space (implemented with Tweedie's formula in their paper). In contrast, as shown in Eq.(12), EnKG tackles the subproblem in , which resides in the space.
- Update rule EnKG solves the problem in Eq.(12) in a completely different way. EnKG runs an ensemble of interacting particles to solve it without gradient information. In comparison, DiffPIR performs the standard gradient step. These distinct approaches yield entirely different algorithms, as evident from comparing Algorithm 2 in the proposed EnKG and Algorithm 1 in the DiffPIR paper.
- Derivative-Free Optimization The differences in algorithm lead to a key different property: derivative-free nature of EnKG. EnKG can solve problems that DiffPIR cannot handle as the gradient is not always avaliable. We discuss the broader value of this property further in our response to Q3.
- Frameworks for Interpreting Guidance Our prediction-correction (PC) scheme also offers a different intepretation of the guidance-based methods, distinct from the DiffPIR's HQS framework. Inspired by numerical continuation [2], the PC scheme progresses along a prior-defined path while optimizing for high-likelihood regions, which enables a clean development of our EnKG method. In contrast, DiffPIR views the generic guidance-based methods through the len of a series of MAP estimation problems (Appendix A.1 in their paper).
- We have incorporated the reference [A] at line 103 and added the results of PnP-DM as requested. These details are further elaborated in our response to Q3.
- We hope this comparison clarifies the distinctions between EnKG and DiffPIR. Please let us know if further elaboration is needed.
We sincerely thank the reviewers for their constructive feedback, insightful questions, and recognition of our contributions. We are pleased that the reviewers highlight the proposed PC framework as “providing deeper understanding” (x3Ty) of existing guidance-based algorithms and recognize EnKG as an "interesting" (8WU4) and "novel" (G5qz) approach to solve inverse problems with only black-box access. The reviewers appreciate the "strong performance" (x3Ty) of EnKG across various problems as well as its "comprehensive validation" (G5qz). We address the reviewers’ questions, concerns, and suggestions in our individual responses.
Main text
- Lines 141-143: expanded the discussion of related work. (G5qZ)
- Lines 198-200: revised the discussion to focus on the practical setting of non-convex optimization. (non-convex). (Fiay)
- Lines 264-269: revised the discussion about the role of the covariance matrix. (Fiay)
- Lines 287-299: revise Proposition 1 for clarity and corrected typos. (Fiay)
- Table 2: added a column to report runtime for different algorithms. (Fiay)
- Lines 430-458: revised the explanation of evaluations metrics. (8WU4, Fiay, x3Ty)
- Figure 3: updated the caption for clarity. (Fiay)
Appendix
- Table 4: included the definition of weighted Euclidean inner product. (G5qZ)
- Lines 731-742: revised the statement of Lemma 1 for clarity. (Fiay)
- Lines 758-776: added an intermediate reasoning step in the proof for better readability. (Fiay)
- Lines 791-799: revise the last step for clarity. (G5qZ)
- Appendix A.8: added additional comparisons to gradient-based methods for reference. (8WU4)
- Appendix A.9: additional experiment on the dependence on the quality of pre-trained diffusion models. (x3Ty)
We sincerely thank the reviewers for their constructive feedback and engagement during the discussion. As the discussion period has now closed, we would like to summarize the key points of discussion where reviewers have not responded or confirmed that their concerns were addressed:
- Derivative-free vs gradient-based (8WU4) The reviewer raised concerns regarding the trade-offs between derivative-free and gradient-based methods. We argue that this trade-off is inherently challenging to quantify, especially for PDEs, as developing reliable gradient estimation methods remains an open research question [1,2]. We also emphasized the intellectual merit of studying derivative-free methods for inverse problems, which includes:
- Applicability to non-differentiable problems (e.g., symbolic music generation [3], stochastic weather models [4]).
- Handling scenarios where the forward model’s internal structure is unknown, such as black-box APIs
- High parallelizability that allows them to benefit from the recent advancements in large-scale distributed computing infrastructure, as discussed in [5].
- Allowing for more flexible forward modeling without requiring structural constraints for gradient computation.
- Question about Assumption 3 (G5qz) We have provided further clarification on Assumption 3 and its practical implication. Empirical evidence is also provided in Figure 5.
- Misuse of terminology "pre-conditioner" (Fiay) We have removed the term "pre-conditioner" and revised the dicussion to focus on the practical role of , which is to project the gradient onto the subspace spanned by the particles.
- Typos in the superscript (Fiay) We appreciate the reviewer’s careful attention to detail and have corrected this typo in the updated manuscript.
- Question about Lemma 1 proof step (Fiay) We have provided a detailed explanation of the proof step both in rebuttal and the revised manuscript (lines 758-776) to address the reviewer’s concerns.
We hope this summary demonstrates our efforts to address all potential concerns. We kindly ask the reviewer to reassess the score in light of the key contributions and value of our work, as highlighted in our meta response. We feel that our study of derivative-free methods for diffusion guidance is a sufficiently novel intellectual contribution in its own right and worthy of publication at ICLR.
[1]: Gaggioli, E. L., and Oscar P. Bruno. "Parallel inverse-problem solver for time-domain optical tomography with perfect parallel scaling." Journal of Quantitative Spectroscopy and Radiative Transfer 290 (2022): 108300. [2]: Fang, Lean, and Ping He. "A duality-preserving adjoint method for segregated Navier–Stokes solvers." Journal of Computational Physics 503 (2024): 112860.
[3]: Huang, Yujia, Adishree Ghatare, Yuanzhe Liu, Ziniu Hu, Qinsheng Zhang, Chandramouli Shama Sastry, Siddharth Gururani, Sageev Oore, and Yisong Yue. "Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion." In Forty-first International Conference on Machine Learning.
[4]: Miyoshi, Takemasa, and Masaru Kunii. "The local ensemble transform Kalman filter with the Weather Research and Forecasting model: Experiments with real observations." Pure and applied geophysics 169.3 (2012)
[5]: Salimans, Tim, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. "Evolution strategies as a scalable alternative to reinforcement learning." arXiv preprint arXiv:1703.03864 (2017).
The paper proposes a method for solving inverse problems with diffusion models that only requires forward model evaluations as opposed to being able to differentiate through the forward model. The paper approximates the gradient corresponding to a data consistency term.
The reviewers and the AC agree that the problem considered is interesting and important and the approach promising.
However, the majority of the reviewers recommend rejecting the paper. The key issues that remain after the rebuttal and discussion are:
-
Reviewer Fiay identified methodological and technical issues, and as pointed out by reviewer Fiay some of the technical issues remain in the revision, although several have been fixed.
-
Reviewers 8WU4 and G5qZ both find that the positioning relative to gradient-based methods and tradeoffs are insufficient, and it remains questionable how gradient-free methods can outperform gradient based ones. Moreover the reviewers oversells the contributions.
The paper has potential and the problem and approach are interesting, but for the two remaining key issues, I recommend rejecting the paper.
审稿人讨论附加意见
please see meta review
Reject