PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
4
3
5
3.5
置信度
创新性2.8
质量2.5
清晰度2.5
重要性2.3
NeurIPS 2025

Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Linear Extrapolation

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
Inverse problemdiffusion modelfew-stephigh order solver

评审与讨论

审稿意见
4

This paper proposes learnable linear extrapolation (LLE), designed to improve the performance of any diffusion-based inverse algorithm that follows the steps: (i) predict the clean signal, (ii) enforce measurement consistency, and (iii) map back to the noisy manifold to continue the reverse sampling process. The main insight of this paper, as far as I can tell, comes from Corollary 2, where they show that the estimate of the clean sample conditioned on the measurements, denoted x0(t,y)\mathbf{x}_{0}(t, \mathbf{y}), is a linear combination of the estimate at the current timestep along with all previous timesteps. Motivated by this, they propose solving an optimization problem to learn such linear coefficients in order to obtain an accurate estimate of the clean signal, provided that a reference dataset exists.

优缺点分析

I would say that the main strength of this paper is that they can improve upon existing diffusion-based inverse problem solvers in a lightweight fashion. Table 7 shows that LLE does not induce a large overhead, but can improve upon the existing methods up to some extent.

For weaknesses, I am not sure how effective this method would actually be considering that the improvements are not great. I would suggest the author to put more qualitative results such that we can look at the performance differences a bit better. On top of that, this method assumes that there exists a reference dataset to perform the optimization, which I am not sure holds in practice. How were these reference points chosen for experiments? I perhaps missed this in the main text.

I also think that the presentation of this paper could be improved; I found the presentation of the methodology a bit convoluted. Perhaps there is a way to condense the main text to bring the algorithm box up to the main part.

问题

  • Why are the results for DMPS in Table 1 so poor? I suggest the authors to look into this to provide a more fair comparison.
  • How precisely do Equations (19–20) need to be optimized? Did the authors find that the loss associated with this problem is easy to optimize? In cases where the optimization falls into a local minimum, did the authors find that LLE still outperforms the original method? I believe these questions are fundamental to understanding the effectiveness of LLE, if they have not already been addressed.

局限性

I have stated the limitations under weaknesses above. I would be willing to raise my score if my questions can be effectively addressed.

最终评判理由

The authors have addressed my points and hence I have raised my score. I do still have some uncertainty about the actual practicality of the method, so I am on the fence.

格式问题

There are no formatting concerns as far as I can tell.

作者回复

We thank the reviewer for recognizing the lightweight advantage and general applicability of our method. Below, we provide point-by-point responses to the concerns and questions raised. All additional experimental results are reported in terms of PSNR ↑ / SSIM ↑ / LPIPS ↓.

1. "I am not sure how effective this method would actually be considering that the improvements are not great. I would suggest the author to put more qualitative results."

We sincerely thank the reviewer for the valuable comments and helpful suggestion. We would like to clarify that in many cases, LLE improves the PSNR of the original algorithm by close to or more than 1dB, which is generally regarded a considerable improvement in the context of image restoration. Moreover, we consistently observe improvements across all three metrics—PSNR, SSIM, and LPIPS—indicating that LLE not only reduces pixel-wise reconstruction error but also enhances perceptual quality.

Regarding the qualitative results, we appreciate the suggestion and will include more visualizations in the camera ready version. In the meantime, we refer the reviewer to Appendix I (pages 35–39) for additional qualitative examples beyond those presented in the main paper. We apologize for the inconvenience that, per this year's policy, visual results cannot be included during the rebuttal phase. Thank you again for this thoughtful feedback.

2. "This method assumes that there exists a reference dataset to perform the optimization, which I am not sure holds in practice. How were these reference points chosen for experiments?"

In all of our experiments, we use 999-step DDIM to sample 50 synthetic images from the diffusion model as the reference dataset. This design ensures that LLE remains practical in real-world scenarios: since diffusion-based inverse algorithms assume access to a pre-trained diffusion model, it is straightforward to generate the reference dataset using the same model without requiring any real images.

While this setup is mentioned at several points in the paper (e.g., line 52 on page 2, line 223 on page 7, and line 722 on page 25), we acknowledge that the current presentation may not be sufficiently clear or direct, and could lead to misunderstanding. We will revise the Experimental Setup section to explicitly describe the construction of the reference dataset used by LLE. We thank the reviewer again for this helpful question.

3. "I found the presentation of the methodology a bit convoluted. Perhaps there is a way to condense the main text to bring the algorithm box up to the main part."

We appreciate the reviewer's suggestion. We will move the algorithm box into Section 4 of the main text in the camera ready version to improve the overall flow and coherence. We will also refine the presentation of the methodology section to make the description clearer and more concise.

4. "Why are the results for DMPS in Table 1 so poor? I suggest the authors to look into this to provide a more fair comparison."

We thank the reviewer for pointing out this issue. We have double-checked our implementation of DMPS and confirmed that it is consistent with the original paper. We also verified our implementation under 100 and 1000 sampling steps, and the results closely match those reported in the original work. In fact, we observe that our implementation achieves comparable performance with the original paper using as few as 15 steps (see Table 18 and Table 19 on page 30), which supports the correctness of our implementation.

Upon further investigation into the hyperparameter settings of DMPS, we find the primary issue lies in the scaling parameter λ\lambda. While the original paper suggests setting λ1\lambda \geq 1, we find this choice to be suboptimal in low-step regimes, especially when the number of steps is below 10. To address this, we conduct a step-by-step and task-by-task hyperparameter search for DMPS on the CelebA-HQ dataset with 3, 4, 5, and 7 sampling steps. The table below shows the optimal values of λ\lambda we identified.

StepsDeblurInpSRCS
30.20.20.30.2
40.30.30.40.3
50.40.40.50.4
70.50.60.60.5

Using these revised hyperparameters, we re-evaluate both the original DMPS and the LLE-enhanced version on noisy tasks on CelebA-HQ. The updated results are shown in the table below. We believe these results provide a more fair and accurate comparison for DMPS. We will incorporate them into the camera ready version, along with revised results on noiseless tasks on CelebA-HQ and on FFHQ. We sincerely thank the reviewer for helping us improve the fairness and credibility of our experimental evaluation.

StepsAlgoDeblur (noisy)Inp (noisy)SR (noisy)CS (noisy)
3DMPS14.02 / 0.174 / 0.74711.91 / 0.100 / 0.80215.53 / 0.235 / 0.76312.45 / 0.102 / 0.791
DMPS-LLE14.10 / 0.177 / 0.74411.97 / 0.104 / 0.80018.76 / 0.253 / 0.77612.52 / 0.105 / 0.789
4DMPS17.95 / 0.322 / 0.61913.41 / 0.144 / 0.76219.33 / 0.365 / 0.64514.09 / 0.151 / 0.741
DMPS-LLE18.23 / 0.325 / 0.61713.55 / 0.149 / 0.75923.07 / 0.511 / 0.54414.19 / 0.155 / 0.738
5DMPS21.20 / 0.404 / 0.56214.71 / 0.164 / 0.73822.95 / 0.498 / 0.54114.85 / 0.165 / 0.719
DMPS-LLE21.63 / 0.406 / 0.55816.45 / 0.186 / 0.69024.66 / 0.625 / 0.43715.57 / 0.180 / 0.699
7DMPS25.12 / 0.559 / 0.44615.87 / 0.187 / 0.71625.96 / 0.669 / 0.39616.04 / 0.213 / 0.679
DMPS-LLE25.84 / 0.576 / 0.43620.12 / 0.337 / 0.56225.82 / 0.707 / 0.35517.36 / 0.303 / 0.594

5. "How precisely do Equations (19–20) need to be optimized? Did the authors find that the loss associated with this problem is easy to optimize? In cases where the optimization falls into a local minimum, did the authors find that LLE still outperforms the original method?"

We thank the reviewer for the insightful question. We find that the optimization of the objective function (19–20) requires relatively few steps, is easy to optimize, and is not prone to getting stuck in local minima.

To better understand the impact of optimization iterations on LLE's performance, we conduct an additional experiment evaluating LLE trained for {25, 50, 100, 150, 200} iterations. The experiment is performed using 5-step DDNM on the FFHQ noiseless tasks. The results are summarized in the table below. We observe that LLE achieves comparable and strong performance when trained for iterations between 50 and 200. This suggests that LLE converges quickly with relatively few iterations and does not show obvious signs of overfitting.

Theoretically, this is reasonable since LLE learns interpolation coefficients by minimizing a weighted sum of MSE loss and LPIPS loss. When the LPIPS weight ω\omega is zero, the problem reduces to a least squares problem, which is convex and free of local minima. Although the LPIPS loss is non-convex, our choice of a relatively small weight (ω=0.1\omega = 0.1) ensures that the overall loss landscape does not contain many undesirable local minima, making optimization easier in practice. We sincerely thank the reviewer for this insightful question and will include these additional experiments and discussions in the final version of the paper.

Train stepsDeblur (noiseless)Inp (noiseless)SR (noiseless)CS (noiseless)
-39.82 / 0.964 / 0.08726.34 / 0.719 / 0.38029.99 / 0.865 / 0.21118.29 / 0.591 / 0.487
2539.93 / 0.965 / 0.08628.21 / 0.821 / 0.31629.93 / 0.863 / 0.20718.87 / 0.597 / 0.476
5040.12 / 0.967 / 0.08528.61 / 0.834 / 0.30329.95 / 0.863 / 0.20819.13 / 0.602 / 0.468
10040.08 / 0.966 / 0.08428.71 / 0.838 / 0.29329.94 / 0.863 / 0.20619.21 / 0.605 / 0.463
15040.10 / 0.966 / 0.08428.77 / 0.840 / 0.29029.91 / 0.862 / 0.20419.23 / 0.603 / 0.461
20040.14 / 0.967 / 0.08428.82 / 0.842 / 0.28829.93 / 0.863 / 0.20419.25 / 0.604 / 0.462

We would like to express our gratitude once again for the reviewer's comments and suggestions for our work. Should there be any further questions, we welcome continued discussion.

评论

Thank you for the thoughtful rebuttal. I will raise my score to a 4 in light of the response.

评论

We would like to express our heartfelt gratitude for your recognition and raising the score! Thank you once again for your kind consideration!

评论

Dear Reviewer 7BP1,

Thank you once again for your helpful and constructive review.

As the Discussion phase is approaching its end in less than three days, we would like to kindly remind you to take a look at our rebuttal. We have added substantial experiments and clarifications in response to your valuable comments and concerns.

If there are any points we may have not fully addressed, we would be grateful if you could let us know.

Looking forward to your feedback.

Best regards,

The Authors.

审稿意见
4

This paper tackles the high computational cost of diffusion-based inverse problem solvers, particularly in few-step scenarios. First, it proposes a "canonical form" that deconstructs diffusion-based inverse algorithms into three distinct modules: a Sampler, a Corrector, and a Noiser. This framework is designed to unify a wide range of existing methods. Second, based on this canonical form and an analysis of ODE solvers, the paper introduces Learnable Linear Extrapolation (LLE), a lightweight method that improves few-step performance by learning to linearly combine the current estimate with previous ones. The authors demonstrate that LLE consistently improves results across nine reformulated diffusion algorithms on several inverse problem tasks, such as deblurring and inpainting on facial datasets.

优缺点分析

Strengths

  1. LLE method is a lightweight and intuitive approach, well-motivated by an analysis of high-order ODE solvers.
  2. The proposed canonical form is an insightful contribution that provides a structured, unified view of diverse diffusion-based inverse algorithms.

Weaknesses

  1. The claim of enhancing performance lacks validation against the true state-of-the-art. The baseline set, while broad, omits recent, competitive methods like LGD (Song et al., 2023), DCDP (Li et al., 2024), MPGD (He et al., 2024), FPS-SMC (Dou and Song, 2024), and SGS-EDM (Wu et al., 2024).
  2. The over-reliance on facial datasets (CelebA-HQ, FFHQ) is not representative of general-purpose restoration; validation on a diverse benchmark like ImageNet is needed to prove general applicability.
  3. The paper's complete focus on pixel-space models is a critical weakness. As LDMs are the de facto standard for high-resolution restoration, a method that does not demonstrate a clear path for applicability to them has severely limited practical significance.
  4. The paper fails to discuss or compare its approach with consistency models (e.g., Consistency Posterior Sampling). This is a major concurrent research direction for accelerating inverse problem solvers, and its omission makes the literature review and experimental comparison incomplete.

问题

  1. How is LLE expected to perform against the baselines mentioned in the weakness section?

  2. What are the specific technical challenges in applying the canonical form and LLE to Latent Diffusion Models? Does linear extrapolation hold in the latent space, and how does the VAE's compression affect the Corrector module?

  3. How can the generalization of results from facial datasets to more complex domains like ImageNet be justified? The distinct priors of faces versus general scenes could affect the validity of the learned extrapolations.

  4. Please elaborate on the conceptual differences and key trade-offs (e.g., performance vs. training cost) between the LLE method and consistency-based approaches like Consistency Posterior Sampling.

局限性

A more thorough discussion of the challenges involved with LDMs and a justification for not addressing them is warranted.

最终评判理由

I thank the authors for their thorough rebuttal. In light of their response, I have raised my score to Borderline Accept.

格式问题

None

作者回复

We sincerely thank the reviewer for acknowledging the motivation and general applicability of our work. Below, we provide detailed responses to the concerns and questions raised. For clarity, we organize our response as follows:

(1) We first present additional experiments on ImageNet to further validate the generality of LLE in more challenging settings.

(2) We then report the performance of additional baselines on the FFHQ dataset as suggested by the reviewer, including five pixel-space methods, three latent-space methods, and one method based on consistency model.

All results are reported using PSNR ↑ / SSIM ↑ / LPIPS ↓ for consistency.

1. "Validation on a diverse benchmark like ImageNet is needed to prove general applicability."

We supplement the results of Π\PiGDM, RED-diff, DPS, and ReSample on ImageNet with 4 steps and 10 steps in the following table. Following [12, 14] in the main paper, we sample 100 images from the ImageNet test set for evaluation. For the noisy tasks, we add Gaussian noise with σ=0.1\sigma = 0.1 with respect to data range [-1, 1] (same in the following experiments). We use the ADM pre-trained checkpoint ([4] in the main paper).

The results demonstrate that LLE consistently improves performance across all methods in the few-step setting on ImageNet. This confirms that LLE is effective and broadly applicable on more complex datasets.

In addition, we conduct a cross-domain experiment between FFHQ and ImageNet. Due to space limitations, we kindly refer the reviewer to our response to Reviewer enUs (Point 4) for detailed results. These results further supports LLE's out-of-domain robustness.

StepsAlgoDeblur (noiseless)Inp (noisy)SR (noiseless)CS (noisy)
4Π\PiGDM23.56/0.727/0.29418.75/0.408/0.64117.41/0.373/0.55714.60/0.217/0.746
Π\PiGDM-LLE30.14/0.875/0.19619.41/0.435/0.61920.53/0.524/0.53116.73/0.359/0.650
RED-diff22.57/0.589/0.49018.00/0.357/0.59224.81/0.719/0.41917.10/0.409/0.574
RED-diff-LLE22.7/0.602/0.48621.69/0.524/0.49225.19/0.726/0.35917.77/0.468/0.531
DPS15.90/0.268/0.68122.92/0.672/0.42117.10/0.358/0.56715.30/0.300/0.633
DPS-LLE18.42/0.388/0.62224.41/0.690/0.41020.33/0.517/0.53318.03/0.474/0.556
ReSample25.89/0.746/0.35520.49/0.440/0.52123.18/0.651/0.41717.50/0.439/0.540
ReSample-LLE26.18/0.757/0.34023.28/0.561/0.44324.98/0.712/0.36917.80/0.454/0.529
10Π\PiGDM26.22/0.789/0.28921.99/0.609/0.48721.49/0.532/0.50316.21/0.360/0.681
Π\PiGDM-LLE29.92/0.856/0.21623.69/0.611/0.47122.76/0.596/0.45018.61/0.509/0.540
RED-diff23.41/0.620/0.45822.95/0.539/0.44925.75/0.749/0.34918.49/0.466/0.517
RED-diff-LLE23.42/0.623/0.46225.07/0.669/0.38625.92/0.754/0.30919.22/0.513/0.487
DPS19.55/0.426/0.60824.65/0.711/0.36421.28/0.520/0.51216.61/0.413/0.612
DPS-LLE20.33/0.463/0.55527.17/0.780/0.31722.42/0.585/0.46019.52/0.590/0.463
ReSample26.49/0.765/0.33125.26/0.625/0.36725.27/0.721/0.35319.60/0.519/0.449
ReSample-LLE26.72/0.772/0.31725.78/0.651/0.34925.61/0.736/0.32920.02/0.541/0.434

2. Including more baseline methods like LGD, DCDP, MPGD, FPS-SMC, and SGS-EDM.

We supplement experiments with LGD, DCDP, MPGD, FPS-SMC, and SGS-EDM on noisy tasks on FFHQ with 4 steps. The results are reported in the Pixel space part of the table below. The results show that LLE consistently improves the performance of all five baselines in the few-step regime. This further validates the generality and broad applicability of LLE.

3. "What are the specific technical challenges in applying the canonical form and LLE to Latent Diffusion Models?"

There are no difficulties in extending the canonical form and LLE from pixel space to latent space. The core theoretical analysis of linear extrapolation still holds. The observation function with LDMs is y=A(D(z0))+σn\mathbf{y} = \mathcal{A}(\mathcal{D}(\mathbf{z}_0)) + \sigma \mathbf{n}, where D()\mathcal{D}(\cdot) denotes the decoder of the VAE, A()\mathcal{A}(\cdot) is the observation operator, and z0\mathbf{z}_0 is the latent. This formulation corresponds to a nonlinear inverse problem with respect to z0\mathbf{z}_0, since the composed operator AD\mathcal{A} \circ \mathcal{D} is highly nonlinear. Our theoretical motivation remains valid, as it does not rely on any specific assumptions about the form of the observation function.

To solve inverse problems with LDMs, the Corrector (or the inverse algorithm) must be capable of handling nonlinear observation functions. Moreover, as long as the inverse algorithm is applicable with LDM, LLE can still serve as an effective enhancement strategy.

To validate LLE with LDMs, we include results of Latent-DPS (introduced in [12] of the main paper), PSLD [R.1], and STSL [R.2] on the FFHQ dataset. The results are reported in the Latent space part of the table below. LLE consistently improves performance for these LDM baselines. The only adjustment required for training LLE in the latent space is a minor change to the loss in Equation (20), which we modify to L(D(z~0,ti(n)),D(z0(n)))\mathcal{L}(\mathcal{D}(\tilde{\mathbf{z}}_{0, t_i}^{(n)}), \mathcal{D}(\mathbf{z}_0^{(n)})), since LPIPS is computed in the pixel space. These results further confirm the generality and effectiveness of LLE when applied to LDMs.

4. The conceptual differences and key trade-offs between the LLE method and consistency-based approaches like Consistency Posterior Sampling.

We would like to clarify that LLE and Consistency Posterior Sampling (referred to CPS below) [R.3] aim to improve performance from different but complementary perspectives. Specifically, LLE enhances the efficiency by learning a better solver over the original trajectory, while CPS leverages the strong single-/few-step performance of consistency models to improve reconstruction quality.

Moreover, we note that CPS also fits into our proposed canonical form, and LLE can be applied to further enhance CPS's performance in the few-step regime. More specifically, CPS uses a consistency model as the Sampler, rather than a traditional diffusion model, but it still generates a sequence of x0\mathbf{x}_0 estimates, forming a valid trajectory. This trajectory can be directly used with LLE's linear extrapolation framework.

We supplement experiments with CPS on the ImageNet-64 dataset. The results are shown in the CM part of the table below. We use the openai/diffusers-cd_imagenet64_l2 checkpoint available on Hugging Face, and test on 100 randomly sampled images from the ImageNet test set.

LLE consistently improves CPS in the few-step setting, demonstrating that LLE and CPS are not exclusive, but rather complementary approaches. This highlights LLE's broad applicability and its ability to boost performance when applied to advanced sampling frameworks with consistency models.

StepsTypeAlgoDeblur (noisy)Inp (noisy)SR (noisy)CS (noisy)
4Pixel spaceDCDP26.00/0.619/0.42824.75/0.558/0.43624.16/0.585/0.57817.70/0.461/0.535
DCDP-LLE26.12/0.681/0.37825.54/0.596/0.41125.24/0.707/0.35018.16/0.471/0.524
LGD22.81/0.645/0.43321.77/0.548/0.51324.95/0.680/0.43915.84/0.451/0.513
LGD-LLE24.59/0.695/0.37628.25/0.768/0.33925.55/0.705/0.40216.52/0.503/0.499
MPGD25.29/0.705/0.38027.33/0.753/0.30924.93/0.562/0.57917.36/0.437/0.565
MPGD-LLE25.43/0.704/0.36627.76/0.768/0.29825.42/0.703/0.36018.02/0.446/0.543
SGS-EDM27.72/0.776/0.30819.71/0.416/0.53826.53/0.738/0.33717.72/0.482/0.530
SGS-EDM-LLE27.67/0.782/0.30724.44/0.511/0.45326.95/0.762/0.30918.40/0.516/0.486
FPS-SMC25.11/0.517/0.47517.52/0.188/0.68624.62/0.524/0.47416.58/0.224/0.666
FPS-SMC-LLE26.26/0.617/0.42818.63/0.229/0.64326.14/0.672/0.40017.66/0.384/0.569
4Latent spaceLatent-DPS16.42/0.428/0.63615.43/0.426/0.62815.30/0.369/0.68214.51/0.364/0.665
Latent-DPS-LLE17.08/0.441/0.60917.83/0.492/0.65215.49/0.399/0.64814.70/0.366/0.646
PSLD17.55/0.468/0.63017.71/0.474/0.62717.72/0.471/0.63015.06/0.438/0.647
PSLD-LLE18.10/0.477/0.60718.55/0.490/0.59918.57/0.489/0.59715.24/0.433/0.644
STSL16.08/0.378/0.65516.87/0.404/0.63915.70/0.368/0.66315.31/0.383/0.629
STSL-LLE16.97/0.392/0.61617.50/0.407/0.59816.27/0.360/0.62815.72/0.384/0.621
4CMCPS12.69/0.204/0.61312.71/0.269/0.64914.65/0.317/0.55914.80/0.242/0.619
CPS-LLE16.50/0.283/0.52217.64/0.383/0.56618.57/0.410/0.47116.58/0.268/0.580

5. Ethical Concern

It was unexpected for us to receive an Ethical Concern flag, as all datasets and algorithms used in our work are publicly available and widely adopted. We'd appreciate it if the reviewer could elaborate on the specific ethical concerns in detail.

We would like to express our gratitude once again for the reviewer's comments and suggestions for our work. All additional experiments and comparisons, including those on ImageNet and nine baseline methods on FFHQ, will be included in the camera ready version to further support the generality and validity of LLE. Should there be any further questions, we welcome continued discussion.

[R.1] Rout, Litu, et al. "Solving linear inverse problems provably via posterior sampling with latent diffusion models." NeurIPS, 2023.

[R.2] Rout, Litu, et al. "Beyond first-order tweedie: Solving inverse problems using latent diffusion." CVPR, 2024.

[R.3] Xu, Tongda, et al. "Consistency model is an effective posterior sample approximation for diffusion inverse solvers." ICLR, 2025.

评论

We appreciate your highlighting the imbalances in ethnicity, age, and gender within the datasets we used, as well as the potential social impacts of our method. We sincerely apologize for not addressing these ethical concerns in our previous rebuttal. After carefully reviewing your comments and those from the three Ethical Reviewers, we will acknowledge the potential issues associated with the datasets in both the Limitations section and the checklist. The modifications are as follows:

In the Limitations section, We will add the following statement:

  • We note that the CelebA-HQ and FFHQ datasets used in this paper have demographic imbalances, which may cause our results to fail to generalize to underrepresented groups. A more diverse and balanced dataset is needed in the field to ensure fairness across factors such as ethnicity, gender, and age. A more representative diffusion model trained on such data is also preferred to better support real-world applications.

In the checklist, we will revise our response to Question (10) ("Broader impacts") to "Yes" and add the following statement:

  • Facial datasets such as CelebA-HQ and FFHQ are used in our experiments. While our method is general-purpose, we acknowledge that related technologies may potentially be misused, for example in DeepFake scenarios. Improved techniques such as digital watermarking and DeepFake detection may need to be further developed to mitigate the potential societal risks posed by such misuse.

Thank you again for highlighting the ethical issues with the datasets we used. We believe that these new additions can address the ethical concerns in our paper.

评论

I thank the authors for their thorough rebuttal. Most of my concerns are addressed.

评论

Thank you again for reviewing our work and for your positive feedback on our rebuttal. We will incorporate all the additional experiments and discussions into the camera-ready version of the paper. If you have any remaining concerns, please don’t hesitate to let us know and we would be happy to address them.

评论

Dear Reviewer bt2H,

Thank you once again for your helpful and constructive review.

As the Discussion phase is approaching its end in less than three days, we would like to kindly remind you to take a look at our rebuttal. We have added substantial experiments and clarifications in response to your valuable comments and concerns.

If there are any points we may have not fully addressed, we would be grateful if you could let us know.

Looking forward to your feedback.

Best regards,

The Authors.

审稿意见
3

This paper introduces "Learnable Linear Extrapolation", a method to improve the performance of diffusion-based inverse algorithms with few sampling steps by leveraging previous estimation. The authors propose a canonical form that unifies several diffusion-based inverse algorithms by decomposing them into three modules: Sampler, Corrector, and Noiser. The method operates by learning optimal linear combination coefficients to refine current predictions using previous estimates predictions. The coefficient are pre-learned my minimizing a reconstruction + w * perceptual loss with respect to a reference dataset. The authors validate the method through extensive experiments across multiple algorithms and tasks.

优缺点分析

Strengths:

  • A rich framework that encompasses various diffusion posterior samplers
  • A novel method that improves performance with reduced required number of sampling steps

Weaknesses:

Theoretical Justification:

  • Ambiguity in the justification for x^0\hat{x}_0 being a linear combination of previous estimate, particularly in Equation (10), it is unclear why it holds and a reference would great. In fact, empirical observations of the extrapolation coefficient in Figures 6-8 suggests minimal dependence of the current iterate on previous estimates, in fact the matrix of weights is dominated by the diagonal or first diagonal elements, while the rest are almost zero.
  • The choice and size of the ground-truth datasets are not adequately addressed. Also the robustness of the method to out-of-distrbution samples. While the authors assess the the robustness on datasets FFHQ and CelebA-HQ, it is of little relevance as both dataset represent human faces. A more relevant discussion would be on Imagenet dataset, for instance.

Practical Considerations:

  • Computational Overhead: Essential computational aspects, including the training time required to obtain the extrapolation coefficients and the memory requirements of the method during inference (as it requires storing previous estimates), are not discussed.

Inconsistencies:

  • Error in equation (2): There is a missing logarithm in Equation (2)
  • Figure 4 Clarity: Figure 4 is unclear and lacks essential information. Specifically, the values of ww are missing, and while the axes are labeled (LPIPS and PSNR), the context or meaning of w w in relation to these metrics is not provided.

问题

None

局限性

  • Memory Footprint: The method necessitates storing previous estimate, which could lead to considerable memory overhead
  • Out-of-Distribution Performance: the method might suffer from out-of-distribution samples as the extrapolation coefficient are computed beforehand on a selected dataset.
  • Algorithm Dependence: the method is dependent on the specific diffusion algorithm employed. changing the diffusion sampler requires re-training to get the extrapolation coeffcients.

最终评判理由

To sum up,

  • the proposed idea is original
  • my concerns were addressed during the rebuttal
  • the authors committed to fixing the highlighted inconsistencies and typos in the manuscript

On the other hand, the work has few limitations

  • the method is bound to a specific algorithm (requires re-training for each algorithm)
  • minor improvements in out-of-distribution setup (e.g. using the method trained on Imagenet for FFHQ) as shown in the authors response
  • low dependencies between iterates (the matrix of weights) that may compromise the effectiveness of the extrapolation

As conclusion, I retain my rating

格式问题

None

作者回复

We sincerely thank the reviewer for recognizing the novelty and generality of our proposed method. Below, we provide point-by-point responses to the raised concerns and questions. All additional experimental results are reported using the metrics PSNR ↑ / SSIM ↑ / LPIPS ↓.

1. "Ambiguity in the justification for being a linear combination of previous estimate, particularly in Equation (10), it is unclear why it holds and a reference would great."

Equation (10) employs a forward difference approximation of high-order derivatives. The related derivation can be found in [R.1] and in book [R.2] (Chapter 6.5). We will add these citations in the revised version to make the presentation clearer. We sincerely thank the reviewer for the helpful suggestion.

2. "The matrix of weights in Figure 6-8 is dominated by the diagonal or first diagonal elements, while the rest are almost zero."

Yes, and this is consistent with our expectations. According to Equation (10), elements farther from the diagonal correspond to higher-order derivatives of the sampling trajectory. Since most sampling trajectories are expected to be smooth, the high-order derivatives decay rapidly to zero, which results in the weight matrix being dominated by the diagonal and first off-diagonal elements. In a few cases, we observed non-zero elements farther from the diagonal, indicating possible non-smooth regions in the trajectory with larger high-order derivatives. We will include this discussion in Appendix C.1. We thank the reviewer for this insightful comment.

3. "The choice and size of the ground-truth datasets are not adequately addressed."

Regarding the train set for LLE, we used 999-step DDIM to sample 50 synthetic images from the diffusion model as the reference set, as stated in the paper (line 52 on page 2, line 223 on page 7, and line 722 on page 25). Using only synthetic images makes LLE more practical in real world applications, since diffusion-based inverse algorithms assume access to a pre-trained diffusion model, thus making it straightforward to construct the reference set required by LLE.

Regarding the test set in our experiments, we follow common practice (e.g., references [5, 6, 12, 14] in the main paper) to sample test examples, as described in the experiment section. Specifically, we use 1000 test samples for CelebA-HQ and 100 samples for FFHQ (and ImageNet in the additional experiments).

We will further add a paragraph at the beginning of the Experiment section to clarify these settings. We sincerely thank the reviewer for helping us improve the clarity of our paper.

4. A more relevant discussion of cross-domain performance on Imagenet

Here we supplement cross-domain experiments of LLE between the FFHQ and ImageNet datasets. We test DDNM with 5 steps, and the results are shown in the table below ("–" denotes results without using LLE). LLE demonstrates generalization ability across FFHQ and ImageNet. The coefficients trained on one dataset can still improve performance on the other in most cases, and sometimes even outperform those trained in-domain. This further indicates the out-of-domain robustness of LLE, which is critical for ensuring its general applicability.

Moreover, we have conducted broader evaluations of LLE with multiple inverse algorithms on the ImageNet test set; due to space limitations, we kindly refer the reviewer to our response to Reviewer bt2H (Point 1) for details.

We will include these new experimental results in the camera ready version. We sincerely thank the reviewer for the valuable suggestion.

TestsetTrainsetDeblur (noiseless)Inp (noiseless)SR (noiseless)CS (noiseless)
FFHQ-39.82 / 0.964 / 0.08726.34 / 0.719 / 0.38029.99 / 0.865 / 0.21018.29 / 0.591 / 0.488
FFHQFFHQ40.08 / 0.966 / 0.08428.71 / 0.838 / 0.29229.94 / 0.863 / 0.20619.21 / 0.605 / 0.463
FFHQImageNet39.94 / 0.965 / 0.08628.30 / 0.818 / 0.31729.94 / 0.863 / 0.20819.01 / 0.608 / 0.470
ImageNet-36.48 / 0.949 / 0.09923.35 / 0.628 / 0.41625.74 / 0.754 / 0.30217.86 / 0.535 / 0.486
ImageNetImageNet36.52 / 0.949 / 0.09824.87 / 0.726 / 0.36425.76 / 0.752 / 0.29718.35 / 0.541 / 0.478
ImageNetFFHQ36.53 / 0.949 / 0.09925.14 / 0.759 / 0.34125.70 / 0.750 / 0.29618.20 / 0.527 / 0.485

5. Training time required to obtain the extrapolation coefficients and the memory requirements during inference.

Regarding the training time, we reported the training time of LLE with different inverse algorithms in Table 6 of Appendix C.4 (page 19). Training LLE takes only 2 to 20 minutes, thanks to its lightweight design and the use of a small reference set.

Regarding the memory footprint, here we further report the GPU memory usage (in MB) of DDNM and DPS with and without LLE in the table below. We measure the CUDA peak allocated memory, and we account only for the memory used by model parameters, data, and additional storage required by the algorithm. The results show that LLE introduces only a few to several dozen MB of additional memory overhead, even under the 15-step setting. This is consistent with our expectations: during inference, LLE only needs to store a few 256×\times256 RGB images from previous steps, typically around 0.7 MB per image, resulting in at most a few dozen MB of additional cost. This is negligible compared to model weights or other algorithm-specific overhead (e.g., back-propagation in DPS).

In summary, both the training and inference overhead of LLE are minimal, which further supports the lightweight advantage of LLE. We will include these additional results and discussions in the camera ready version. We sincerely thank the reviewer for the helpful suggestion.

Steps34571015
DDNM590.25590.25590.25590.25590.25590.25
DDNM-LLE597.50599.37601.50608.25612.75626.25
DPS2388.782388.782388.782388.782388.782388.78
DPS-LLE2396.412398.412400.412407.292411.792429.04

6. Missing logarithm in Equation (2).

We thank the reviewer for pointing out the error, and we will correct it accordingly. The correct form is: ϵθ(xt,t)=1αt_x_tlogqt(xt)\epsilon_{\theta}(\mathbf{x}_t, t)=-\sqrt{1-\overline{\alpha}_t} \nabla\_{\mathbf{x}\_t}\log q_t(\mathbf{x}_t).

7. The values and meaning of ω\omega in Figure 4.

In Figure 4, the ablated parameter ω\omega refers to the weight of the LPIPS loss in Equation (20). We tested ω={0.0,0.025,0.05,0.075,0.1,0.2,0.3,0.4,0.5}\omega = \{0.0, 0.025, 0.05, 0.075, 0.1, 0.2, 0.3, 0.4, 0.5\}, which correspond in the exact order to the points from the top-right to the bottom-left in Figure 4. This result suggests that smaller values of ω\omega encourage LLE to achieve higher PSNR, while larger values lead to better LPIPS. This provides a direct and practical way to control the trade-off between PSNR and LPIPS, depending on application-specific preferences.

We thank the reviewer for the helpful comments. We will annotate the corresponding ω\omega values in Figure 4 and add this explanation and discussion to the figure caption in the camera ready version. Sorry for the inconvenience as we are unable to include any updated figures during the rebuttal phase.

8. Limitations-"The method is dependent on the specific diffusion algorithm employed."

We agree that LLE requires training for each specific inverse algorithm. This is because different algorithms vary in their design principles, formulations, and posterior score estimations, which lead to different sampling trajectories. Consequently, the high-order approximation coefficients of these trajectories differ, necessitating separate training for each algorithm. Designing a more general approach, including aligning sampling trajectories across algorithms or developing unified high-order approximation parameters, is a promising direction for future work. We thank the reviewer for this comment.

We would like to express our gratitude once again for the reviewer's comments and suggestions for our work. Should there be any further questions, we welcome continued discussion.

[R.1] Fornberg, Bengt. "Generation of finite difference formulas on arbitrarily spaced grids." Mathematics of computation 51.184 (1988): 699-706.

[R.2] Süli, Endre, and David F. Mayers. "An introduction to numerical analysis". Cambridge university press, 2003.

评论

I thank the authors and acknowledge reading their response. Most of my initial concerns have been addressed.

Regarding point 2, I have a remaining question: If the off-diagonal elements of the matrix are expected to be zero, wouldn't this defy the purpose of extrapolation? Most coefficients would be zero, rendering the extrapolation useless. Could you please clarify this point further?

评论

Thanks for your response and sorry for the misunderstanding. As you observed, there exist a lot of "zero" values in the third- and higher-order components in Figure 7 and 8. However, this doesn't mean all higher-order terms are exactly zero. Actually, they just decay to zero as the order grows, and these seemingly "zero" entries are in fact small but non-zero due to display precision, which also influence the performance. Evaluating these components requires nearly no extra computation. Hence, we choose to keep them for better performance. We will include these discussions in the camera-ready version of the paper. Thank you again for your valuable feedback, and please feel free to let us know if you have any further questions.

评论

Dear Reviewer enUs,

Thank you once again for your helpful and constructive review.

As the Discussion phase is approaching its end in less than three days, we would like to kindly remind you to take a look at our rebuttal. We have added substantial experiments and clarifications in response to your valuable comments and concerns.

If there are any points we may have not fully addressed, we would be grateful if you could let us know.

Looking forward to your feedback.

Best regards,

The Authors.

审稿意见
5

The paper proposes Learnable Linear Extrapolation (LLE) which learns coefficients to combine previous estimates and the current prediction for improved inference in solving inverse problems using diffusion models with few steps. The authors rewrite several diffusion-based algorithms for inverse problems in a canoncial form which can be combined with LLE. They show consistent improvements for the different algorithms on several standard reconstruction problems.

优缺点分析

Strengths:

  • LLE shows consistent improvements for a wide range of different algorithms
  • The canoncial form combines several existing algorithms in a unified framework, which is useful for practitioners
  • The paper is well written and easy to follow

Weaknesses:

  • LLE requires additional finetuning and cannot be used out of the box
  • Corallary 2 that motivates LLE is based on ODEs, while the canonical form is non-deterministic due to the additional noiser step. I saw in the appendix that there is an extension to the more general SDE case; this should be highlighted more in the main paper
  • Improvements are only shown in the few-step regime (up to 14 steps); how large are the gaps to methods taking significantly more steps?
  • I would appreciate if the experiment section included problems where the diversity of solutions/full posterior is evaluated.

问题

  • Using the history of estimates combined with learned coefficients has also been used in Bespoke Non-Stationary Solvers (https://arxiv.org/pdf/2403.01329). Can the LLE method be seen as learning a type of non-stationary solver as defined in the paper?
  • How are the time steps chosen? Are they just uniformly spaced? Are there methods to choose them in a more optimal way?
  • What happens if we choose a large value for the number of steps SS and LLE? Do we still see improvements?

局限性

yes

最终评判理由

I thank the authors for the thorough rebuttal. I will keep my positive rating.

格式问题

no formatting concerns

作者回复

We sincerely thank the reviewer for acknowledging the effectiveness and generality of our method. Below, we provide responses to the concerns and questions raised. All additional experimental results are reported in terms of PSNR ↑ / SSIM ↑ / LPIPS ↓.

1. "LLE requires additional finetuning and cannot be used out of the box."

We appreciate the reviewer's comment. We'd like to highlight that the training process of LLE is highly efficient, and LLE demonstrates cross-domain generalization ability. First of all, as shown in Table 6 on page 19, the training phase of LLE takes only a few minutes, making it practical and feasible in real-world scenarios. Moreover, as presented in Table 4 on page 18, LLE exhibits a certain degree of cross-domain generalization. We further validate this in a more challenging ImageNet-to-FFHQ cross-domain setting, where LLE still achieves good performance (see the table below, where we test DDNM with 5 steps). This indicates that domain shift does not significantly undermine the effectiveness of LLE. Lastly, we acknowledge that LLE currently requires retraining for different inverse problems. Given the minimal training cost, we consider this an acceptable trade-off. Designing more efficient ways to enhance LLE's performance on cross-task generalization is a promising direction for future work.

TestsetTrainsetDeblur (noiseless)Inp (noiseless)SR (noiseless)CS (noiseless)
FFHQ-39.82 / 0.964 / 0.08726.34 / 0.719 / 0.38029.99 / 0.865 / 0.21018.29 / 0.591 / 0.488
FFHQFFHQ40.08 / 0.966 / 0.08428.71 / 0.838 / 0.29229.94 / 0.863 / 0.20619.21 / 0.605 / 0.463
FFHQImageNet39.94 / 0.965 / 0.08628.30 / 0.818 / 0.31729.94 / 0.863 / 0.20819.01 / 0.608 / 0.470
ImageNet-36.48 / 0.949 / 0.09923.35 / 0.628 / 0.41625.74 / 0.754 / 0.30217.86 / 0.535 / 0.486
ImageNetImageNet36.52 / 0.949 / 0.09824.87 / 0.726 / 0.36425.76 / 0.752 / 0.29718.35 / 0.541 / 0.478
ImageNetFFHQ36.53 / 0.949 / 0.09925.14 / 0.759 / 0.34125.70 / 0.750 / 0.29618.20 / 0.527 / 0.485

2. "The SDE version of the derivation should be highlighted more in the main paper."

We appreciate the reviewer's valuable suggestion. As the reviewer pointed out, most diffusion-based inverse algorithms are formulated in the SDE framework, and the injected random noise often plays a significant role in improving solution quality. Thus, the SDE-based derivation provided in Appendix B.3 offers a more complete explanation of the motivation and formulation of LLE. We will highlight the SDE formulation in the main text and explicitly clarify its connection to the motivation behind LLE in the camera ready version.

3. "How large are the gaps to methods taking significantly more steps?"

Regarding the comparison between the few-step LLE-enhanced method and the original algorithm with significantly more steps, we kindly refer the reviewer to Table 8 in Appendix C.5 (page 20), where we compare the performance of DDNM with LLE within 15 steps and the default DDNM with 100 steps. The results show that in many cases, LLE-enhanced DDNM with only 15 steps achieves performance comparable to the 100-step baseline.

Regarding the comparison between the LLE-enhanced method and the original algorithm both under significantly more steps, we supplement an additional experiment comparing DDNM with and without LLE under a larger number of sampling steps (25, 50, and 100) on the FFHQ dataset, as presented in the table below. To avoid the need to store an excessive number of past results when the number of steps is large, we limit the lookback window to 10 steps. The results show that as the number of steps increases, the performance gap between methods with and without LLE becomes smaller. This is consistent with our expectation as LLE searches for an improved sampling trajectory within the linear subspace spanned by previous samples. When the original algorithm takes more steps, the discretization error is reduced, and its trajectory already closely approximates the optimal one. In such cases, LLE tends to follow the original trajectory more closely, resulting in smaller gaps.

Therefore, LLE brings more significant improvements in few-step regimes, which aligns with our primary objective of enhancing the performance of inverse algorithms under a limited number of steps. We will include these results and the accompanying discussions in the appendix as an additional ablation study. We appreciate the reviewer's thoughtful question.

StepsDeblur (noiseless)Inp (noiseless)SR (noiseless)CS (noiseless)
25-41.14 / 0.972 / 0.06434.33 / 0.941 / 0.10930.92 / 0.885 / 0.14625.01 / 0.837 / 0.220
25LLE41.35 / 0.974 / 0.06134.23 / 0.941 / 0.11030.93 / 0.885 / 0.14725.61 / 0.841 / 0.215
50-41.57 / 0.974 / 0.05635.48 / 0.953 / 0.07731.14 / 0.889 / 0.14228.94 / 0.897 / 0.158
50LLE41.71 / 0.976 / 0.05535.33 / 0.952 / 0.07931.15 / 0.889 / 0.14228.81 / 0.897 / 0.158
100-41.70 / 0.974 / 0.05636.29 / 0.960 / 0.05831.41 / 0.883 / 0.14032.15 / 0.927 / 0.122
100LLE41.86 / 0.976 / 0.05436.18 / 0.959 / 0.06031.37 / 0.883 / 0.13931.56 / 0.927 / 0.122

4. "Evaluating the diversity of solutions/full posterior."

Regarding the diversity on the full test set, we have included results in Appendix H that report FID scores, which serve as a quantitative measure of diversity across the dataset. Regarding the diversity of individual posteriors, we would like to clarify that the image restoration tasks considered in this paper typically have highly concentrated posteriors. For instance, in the 4×\times super-resolution task on 256×\times256 images, the solution is generally close to deterministic and does not deviate significantly from the ground truth. As a result, a common practice in such settings is to evaluate performance using reconstruction metrics such as PSNR, SSIM, and LPIPS. We will include visualizations of solutions corresponding to different initial noises in the camery ready version to better illustrate the diversity of solutions. Sorry for the inconvenience that we are unable to include any visualizations during the rebuttal phase due to submission constraints. We thank the reviewer for this insightful suggestion.

5. "Can the LLE method be seen as learning a type of non-stationary solver?"

Yes, LLE can be viewed as learning a broad class of non-stationary solvers as defined in the provided paper. Specifically, the non-stationary solver in the paper is formulated as a linear combination of all previous steps' estimates and velocity fields (or scores equivalently). Accordingly, LLE can be interpreted as a data-driven approach to learn the optimal coefficients of this linear combination. We will include this paper as a reference and discuss the connection between LLE and these non-stationary solvers in the camera ready version. We thank the reviewer for the valuable comment!

6. "How are the time steps chosen? Are they just uniformly spaced? Are there methods to choose them in a more optimal way?"

Yes, in this paper, the time steps are uniformly spaced, as stated in Appendix F.2 (line 716 on page 25). We will add this clarification in the main text to make this setup clearer.

Adaptive selection of time steps may indeed further improve the performance of inverse algorithms. For example, by leveraging LD3 [R.1], the time steps of inverse algorithms could be parameterized and optimized in a data-driven manner to find more optimal schedules. Combining such adaptive time step schedules with LLE is a promising direction for future work. We appreciate the reviewer's suggestion and will include this discussion in the camera ready version.

7. "What happens if we choose a large value for the number of steps for LLE?"

See our response to point 3 above.

We would like to express our gratitude once again for the reviewer's comments and suggestions for our work. Should there be any further questions, we welcome continued discussion.

[R.1] Tong, Vinh, et al. "Learning to Discretize Denoising Diffusion ODEs." ICLR, 2025.

评论

I thank the authors for answering my questions and appreciate the additional results from the rebuttal. In an updated version, the authors should highlight the connection between LLE and non-stationary solvers. I will keep my current rating.

评论

We sincerely appreciate your insightful review and kind consideration. In the camera-ready version, we will explicitly highlight the connection between LLE and non-stationary solvers. Thanks for your valuable suggestion!

评论

Dear Reviewer hxDr,

Thank you once again for your helpful and constructive review.

As the Discussion phase is approaching its end in less than three days, we would like to kindly remind you to take a look at our rebuttal. We have added substantial experiments and clarifications in response to your valuable comments and concerns.

If there are any points we may have not fully addressed, we would be grateful if you could let us know.

Looking forward to your feedback.

Best regards,

The Authors.

最终决定

The paper examines the computational challenges of diffusion-based inverse algorithms, which achieve strong performance but typically require many denoising steps. The authors analyze ODE solvers in the inverse setting and identify a linear combination structure underlying the approximations used to trace inverse trajectories. From this perspective, they propose a canonical form that unifies various diffusion-based inverse methods. Building on this framework, they introduce Learnable Linear Extrapolation (LLE), a lightweight technique that refines predictions by optimizing linear combination coefficients of current and past estimates, thereby mitigating the instability of analytical solvers. The authors also carry out experiments for various algorithms and tasks to show that LLE enhances efficiency and performance.

Reviewers highlighted multiple strengths: the framework provides a clear and structured perspective on diverse diffusion methods, the idea of learning linear combination coefficients is interesting, and the method consistently yields performance improvements across a broad set of inverse problems. The paper is also praised for its minimal training overhead, and thorough ablations. Some weaknesses were raised, including overlap with prior work on non-stationary solvers, the reliance on facial datasets in the initial submission, and missing comparisons to certain SOTA baselines and consistency models. However, the rebuttal effectively addressed many of these issues by adding diverse experiments, clarifying theoretical connections, and discussing extensions to broader settings. Ethical concerns regarding dataset bias and potential misuse were acknowledged and will be addressed in the final version. While one reviewer was not fully satisfied with the paper and the response of the authors, overall I think the paper merits acceptance.