PaperHub
5.0
/10
Rejected4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.8
置信度
ICLR 2024

Task-Guided Biased Diffusion Models for Point Localization

OpenReviewPDF
提交: 2023-09-20更新: 2024-02-11
TL;DR

A task-guided biased diffusion model is proposed for point localization tasks including crowd localization, human pose estimation, and cell localization.

摘要

关键词
Diffusion models; crowd localization; cell localization; human pose estimation

评审与讨论

审稿意见
5

This paper introduces the Diffusion Models to refine the results of point localization from classical methods. Considering the slow inference speed and stochastic nature of sampling of diffusion models, the paper presents the task-biased noise in the forward diffusion process for inference efficiency and a task-guided loss for mitigation of stochasticity. Experiments demonstrate the effectiveness of the proposed method.

优点

  1. This paper introduces the task-biased noise in the forward process in place of the zero-mean Gaussian with reasonable derivation for accelerated reverse sampling speed.
  2. Experiments on several point localization tasks demonstrate the effectiveness of the proposed method.

缺点

  1. The necessity of applying diffusion models to point localization tasks is not illustrated. In other words, the positive properties of diffusion models seem not incorporated into this approach. Whether the diffusion model here can be replaced with other simple network for the same purpose while the posed two problems are correspondingly absent.
  2. The proposed method is not correlated to the task of point localization, as the targeted problems are general for diffusion models. So why the point localization tasks are chosen.
  3. Lack of the visualization for two out of three comparison tasks (neither in appendix).
  4. Introducing the task-biased noise in the forward diffusion process have been proposed in literatures [1][2]. The comparison baselines should include these methods instead of all non-diffusion methods and the discussion is recommended.
  5. Please carefully check the full manuscript for writing typos. [1] Accelerating Diffusion Models for Inverse Problems through Shortcut Sampling [2] Image Restoration with Mean-Reverting Stochastic Differential Equations

问题

Please refer to the paper weaknesses.

评论

Thank you for the thoughtful comments. We carefully addressed each issue and revised the manuscript accordingly.

Q1: illustration

[Reply]: We employ diffusion models for their ability to predict sparse outputs, beneficial for localization tasks. A comparative illustration between our method and a basic network is added in Figure 6.

Q2: localization tasks

[Reply]: Please refer to Q1.

Q3: visualization

[Reply]: The visualization of the other two tasks has been added in Figure 6.

Q4: diffusion methods comparison

[Reply]: Thanks for the suggestion. We have added more discussion in the revised version.

Q5: writing typos

[Reply]: We have carefully improved the typos.

审稿意见
5

The authors hypothesize that diffusion models can be used to enhance the performance of deep learning methods for predictive tasks involving sparse outputs, such as point-localization tasks. They introduce two approaches in order to reduce the number of steps required to make a prediction and in order to reduce the variability over different predictions resulting from different Gaussian noise samples fed to the reverse diffusion process. The first one consists of modifying the conditional forward diffusion process such that the evolution of each conditional distribution terminates to non-standard Gaussian, whose mean is a function of a density map predicted by another given baseline method. Second, they modify the loss function and propose a task-guided loss that decreases the average distance between predictions in the experiments they conducted.

优点

  • The idea of using non-standard Gaussians as base distributions in conditional diffusion models is interesting. In particular if the generated dimensions will not be very different from the conditioned dimensions, this could potentially simplify the learning task.
  • The empirical results across different datasets and tasks are encouraging.

缺点

  • While the biased forward diffusion process is defined and derived, the inverse q(xt1xt,x0,x~)q(x_{t-1}|x_t,x_0,\tilde{x}) is not, which normally would allow the definition of the estimated generative process which is not included in the paper. Furthermore, the authors do not provide the KL divergence between p(xt1xt,x~)p(x_{t-1}|x_t,\tilde{x}) and q(xt1xt,x0,,x~)q(x_{t-1}|x_t,x_0,,\tilde{x}) which would allow the construction of the loss function for the biased conditional diffusion. Only the loss for a vanilla diffusion model is given in Equation (2) (also in LnL_n, n is not defined, it should be LsimpleL_{simple}?). The score estimator ϵθ\epsilon_{\theta} in loss function in (2) does not admit any dependent variables to condition on. To summarise, no details are provided with regards to the generative process.
  • The task-guided loss is not motivated and explained properly. In addition, the expression of the loss LtL_t is not provided explicitly anywhere in the paper. Furthermore, the ground truth image cannot be reconstructed with the expression (9), and one should notice that the score here is not defined to depend on any conditioned variables either. When LtL_t is the MSE between x0~\tilde{x_0} and x0x_0, then LtL_t forces the model to simply try to predict a scaled version of the data point x0x_0. How do the authors justify training the same network to predict both the score and the data point? Training a network to learn one of these enables us to define a function predicting the other (they both enable us to predict the mean of q(xt1xt,x0)q(x_{t-1}|x_t,x_0), with some modifications), but training the same network to perform both tasks is counter-intuitive to me. In [1] (Equation 16) a similar combination of the loss is used, but here one network focuses on modelling the mean and the other the covariance matrix of p(xt1xt)p(x_{t-1}|x_t).
  • The authors appear to use the classical DDPM from Ho et al 2020, and the classical generative procedure as a baseline, that is, GL+DDPM (and probably in their method, even though that is not described). Considering that the suboptimal classical DDPM generative procedure is used for the baseline, it would be interesting to see the comparison between GL+EDM (which is more efficient [2]) and the model the authors introduce.

Minor Weaknesses:

  • In Equation (1), it should be 1αˉt\sqrt{1-\bar{\alpha}_t} in front of the noise instead of 1αˉt1-\sqrt{\bar{\alpha}_t}. Also, in Equation (9) 1αˉt\sqrt{1-\bar{\alpha}_t} and αˉt\sqrt{\bar{\alpha}_t} should switch places.
  • Some parts of the text can be expressed better (example: third line of the abstract).
  • The structure is not ideal. The details of diffusion models should be given in Section 2 under a subsection titled 'background' and not in Section 3. In addition, the position of tables/figures and the places when they are referenced should be improved. Figure 3 is referenced in the introduction, but is located in page 5. While Figure 1 is located in page 3, but is referenced for the first time at the end of page 4. Similarly, Table 1, is placed in page 6, but is referenced for the first time in page 8, while table 2 is introduced and referenced in page 7. This makes reading the paper quite confusing.
  • A table with all the steps performed should be given as comparing them just from written text in the current form is difficult.

问题

  • Why does the AUC drop in Figure 3b at step 10?
  • Why was a discrete setting was implemented instead of an SDE diffusion model [3]? Furthermore, more modern conditioning strategies can be used [4].

[1] Nichol et al. Improved Denoising Diffusion Probabilistic Models

https://arxiv.org/pdf/2102.09672.pdf

[2] Karras et al. Elucidating the Design Space of Diffusion-Based Generative Models

https://arxiv.org/abs/2206.00364

[3] Song et al. Score-Based Generative Modeling through Stochastic Differential Equations

https://arxiv.org/abs/2011.13456

[4] Ramesh et al. Hierarchical text-conditional image generation with clip latents

https://arxiv.org/abs/2204.06125

伦理问题详情

none

评论

We sincerely thank you for the constructive feedback. We have addressed each issue and revised the manuscript accordingly.

Q1: inverse diffusion

[Reply]: Equation (2) provides a basic overview of standard diffusion models. Therefore, we didn't include the condition in the equation.

Q2: loss function

[Reply]: For a fair comparison, we simply use the same task-guided loss functions as baseline methods.

Q3: GL + EDM

[Reply]: Thanks for the suggestion. This work primarily explores the application of diffusion models to point localization tasks, focusing solely on evaluating DDPM. Investigation of other advanced diffusion models will be a part of future research.

Q4: reorganization

[Reply]: We have reorganized the paper for better understanding. Please refer to the revised paper.

Q5: AUC at step 10

[Reply]: The AUC drops at step 10 as only 10 steps can’t produce high-quality images in diffusion models.

Q6: SDE diffusion model

[Reply]: This work mainly focuses on how to apply diffusion models to point localization tasks. Therefore, we only evaluate DDPM, other improved diffusion models will be investigated in future works.

审稿意见
5

This paper proposes a new method for point localization. The proposed method utilizes diffusion models along with existing point-localization models. The proposed method refines the generated density map from the existing methods and adopts the task-related loss to form the final predictions. The proposed method achieves state-of-the-art performances on some benchmark datasets.

优点

The proposed method is somehow simple yet effective. The proposed method is a general framework and can have performance gains on multiple tasks.

缺点

However, I still have some concerns about this paper:

  1. The proposed method somehow lacks novelty here. In general, the author just utilizes existing methods and with a "Diffusion model" to build the final model predictions. To me, it is like a two-stage/post-processing step of the model. Also, the proposed method is somehow simple: The task loss is not a new component in the model, the author just applies it in the training process of the diffusion model. Also, the biased term is not a brand new idea in the vision community (like non-gaussian diffusion models and others).

  2. The "biased" DDPM needs careful design: As I said before, the proposed biased diffusion model is just a diffusion model with prior. The computer vision community has multiple choices to inject priors to the diffusion models like conditions, texture inversions, etc. The author could provide more choices in the ablation step.

  3. The experiments part needs to be revised. From my side, the author could have a brief recap of diffusion models in 3.1, then give more space to provide ablations, and more detailed methods (like the proposed framework on different methods in the same task), FPS, training times, etc.

问题

Please mainly see the weaknesses section for details.

评论

Thank you for the thoughtful comments. We carefully addressed each issue and revised the manuscript accordingly.

Q1: Novelty.

[Reply]: The primary contribution of our paper is to apply diffusion models to localization tasks since the diffusion models are capable of producing sharper outputs. We have added a comparison between the predictions from the proposed method and the baseline in Figure 6. The result shows that the proposed method produces sharper outputs.

Q2: Bias

[Reply]: In the proposed method, as illustrated in Figure 1, the condition is employed to inject prior information. It's important to note that the input of this condition is essential, as it provides information about the localization of persons.

Q3: Reorganize paper

[Reply]: According to the suggestion, we have reorganized the paper by reducing the content in Section 3.1 and adding more analysis in Section 4.3.

审稿意见
5

This paper proposes a method to use diffusion models for the task of point localization. Specifically, the model introduces a task-guided bias in the diffusion modeling to improve inference efficiency. Besides, it also uses a task-guided loss to decrease the variance of diffusion model predictions in order to increase the accuracy.

优点

  • The proposed method may be interesting for the community of specific vision applications such as point localization.
  • The paper conducts extensive experiments on three tasks including crowd localization, pose estimation, and cell localization. The comparison with previous results seems to show the promising results of the proposed method in these benchmarks.

缺点

  • The motivation and hypothesis of the paper may be questionable: At the beginning of the abstract, the author “hypothesize that diffusion models can be used to enhance the performance of deep learning methods for predictive tasks involving sparse outputs, such as point-localization tasks.” Where is this hypothesis or intuition from? What kind of data distribution is learned by the diffusion model in this formulation? Why should the incorporation of a diffusion model be helpful for the sparse outputs? Although Sec. 3.2.1 describes some motivations, it is not very convincing and clear why the diffusion model is inherently helpful in this case.
  • About the novelty of task-guided bias: In order to reduce sampling steps, the method uses the baseline prediction as initial prediction. This looks very similar to a bunch of previous works on conditional diffusion models, such as [1], while the paper seems not to mention any previous works in this line and no comparison as well. From Sec. 3.2.2, it looks like the forward and backward diffusion process all needs a biased noise distribution, does it mean that the training process of the diffusion model also requires the paired data (initial prediction and ground truth)? This may indicate that for each different dataset/task/baseline to obtain the initial prediction, it is required to train a new diffusion model? What kind of data distribution priors are learned in the process?
  • About the purpose of task-guided loss: From the paper, this loss is proposed to reduce the variance of prediction from different noise so as to improve accuracy. First, it is not very clear to me why minimizing the stochasticity of the reconstructed samples can help with accuracy. Second, with such constraint, it seems the prior Gaussian noise distribution is collapsed or converted to a delta data distribution? What does this mean in the predictive task? If the stochasticity of the reconstructed samples is not preferred, why not just use a fixed input or deterministic sampling process?
  • Experiments and results: 1) The proposed method seems to be quite sensitive to the baseline method used to get the initial prediction, as shown in Table 4. 2) The gain and difference between the proposed method compared with corresponding baseline models seem to be relatively small, as shown in Table 4. 3) In Table 1, the precision of the baseline method achieves a much higher score than all other DDPM incorporated methods. Again, what prior has been learned by such a diffusion model and why it should benefit the predictive results. The motivation is not very clear.
  • One of the many contribution claims of the paper is that task-guided bias can improve sampling efficiency. What is the sampling time of the proposed method? How does it compare with the model without task-guided bias? How does it compare with the baseline models in those tables achieving comparable results?
  • No discussion about the method limitation in the paper.
  • The figure demonstration such as Figure 1 could be further improved.

[1] I2SB: Image-to-Image Schrödinger Bridge, ICML 2023.

问题

Please see weakness for the details of questions.

评论

Thank you for the insightful comments. We carefully addressed each issue and revised the manuscript accordingly.

Q1: intuition

[Reply]: Diffusion models excel in modeling data distribution. In our study, we trained these models with sparse dot maps, enabling them to predict sparse outputs effectively. This is validated through the visualization of the predicted maps in Figure 6.

Q2: model training

[Reply]: Thanks for pointing out the work [1], additional discussion about it has been added in Section 2. Indeed, the diffusion model needs to be trained for a new dataset or task which is a common practice.

Q3: task-guided loss

[Reply]: We didn't claim that accuracy is improved because we minimized the stochasticity. Since we apply diffusion models to discriminative tasks, the task-guided loss explicitly supervises the model to predict the correct location. At the same time, it reduces the stochasticity of the prediction.

Q4: motivation

[Reply]: The paper is motivated by the ability of diffusion models to predict sharp images, a feature standard networks lack. This makes them a natural choice for localization tasks. We have added a visualization in Figure 6. Please refer to the revised version.

Q5: efficiency

[Reply]: Our comparison focuses solely on the sampling steps, as the time per step is identical to standard diffusion models due to precomputed biases. Compared to baseline methods, our approach offers a slight improvement in computational time, as it requires additional diffusion steps.

Q6: limitation

[Reply]: The primary limitation of our method is its increased computational demand compared to baseline models. This issue is discussed in Section 5.

Q7: figure 1 demonstration

[Reply]: We have revised the figure in the revised version.

评论

Thanks to the author for responding my questions. After reading the response and other reviewers' comments, I will keep my original rating.

AC 元评审

The paper proposes a diffusion model for point localization. The reviewers unanimously vote for rejection, citing a lack of general novelty, and the narrow applicability. They suggest that the paper might be more appropriate for a specialized venue.

为何不给更高分

Lack of general novelty, and the narrow applicability.

为何不给更低分

N/A

最终决定

Reject