PaperHub
4.8
/10
withdrawn4 位审稿人
最低3最高6标准差1.1
6
3
5
5
3.8
置信度
正确性2.5
贡献度2.8
表达2.3
ICLR 2025

Bayesian Enhancement Models for One-to-Many Mapping in Image Enhancement

OpenReviewPDF
提交: 2024-09-18更新: 2025-01-30

摘要

关键词
Image Enhancement

评审与讨论

审稿意见
6

One of the key issues worth considering in image enhancement is determining the appropriate level of enhancement, which is referred to as the "one-to-many" problem in the paper. In this paper, the authors attempt to address this issue using Bayesian Neural Networks (BNNs). This is the most significant contribution of the paper and also the aspect that I find most interesting. Specifically, this paper introduces Momentum Prior to mitigate the convergence difficulties of Bayesian-based Enhancement Models and proposes a two-stage approach to reduce the complexity of Bayesian-based Enhancement Models.

优点

1.This is the first method based on Bayesian Neural Networks (BNNs) for image enhancement.

2.The approach demonstrates impressive results in both low-light image enhancement (LLIE) and underwater image enhancement (UIE) tasks.

缺点

The weaknesses summarize the questions that follow; please refer to the questions for more details.

1.Ground Truth Leakage in Prediction Method(Q1)

2.Slightly lack of Novelty in the Momentum Prior(Q2.1)

3.Insufficient Evidence to support the motivation of Momentum Prior (Q2.2) and TWO-STAGE APPROACH (Q3)

4.The coherence and readability of the paper could be improved. For instance, both the abstract and the introduction do not address the motivation of the TWO-STAGE APPROACH. However, it suddenly occurs at the contribution and main text that "TWO-STAGE APPROACH is introduced to the complexity of BEM in modeling high-dimensional image data".

问题

I have previously considered the one-to-many problem mentioned by the authors and even experimented with BNNs, but eventually abandoned the approach. Therefore, I am pleasantly surprised to see this paper and sincerely hope this paper can be accepted. However, there are some issues that prevent me from giving a higher score, leading to a borderline reject. I have carefully considered the issues I raised, and while these issues might be somewhat pointed, I really haven't found clear answers to them. If the authors address some concerns I raise, I would be willing to increase my score!

Q1. The most critical issue is in Section 3.3, "PREDICTIONS UNDER UNCERTAINTY", where the prediction method seems to rely on ground truth (GT) information.

  • Lines 225 -230, when GT is available, mean squared error (MSE) or other perceptual metrics are computed, and the image with best score is selected as the output. Since MSE is a step in calculating PSNR and the perceptual metric used in this paper LPIPS is highly correlated with PSNR, this approach essentially involves running multiple iterations and choosing the result that is closest to the GT. This appears to be a form of GT leakage. The network is expected to determine the degree of enhancement autonomously. However, this method relies on GT information to determine the enhancement level, which is akin to using an advanced "GT mean" trick.

The paper also lacks an ablation study comparing with and without the proposed prediction method. The authors can consider adding such ablations and evaluating the model using no-reference image quality metrics or exploring alternative fusion methods.

Q2. Novelty and motivation of the Momentum Prior method:

  • Q2.1 Lack of innovation in the Momentum Prior: The Momentum Prior is essentially a combination of BNN and Exponential Moving Average (EMA). EMA is a well-established and commonly used technique, so simply combining BNN and EMA may lack sufficient novelty.
  • Q2.2 Insufficient evidence to support the motivation of Momentum Prior: The Momentum Prior is introduced to address issues of underfitting or even non-convergence. However, the paper does not cite any works that discuss these problems in BNN training, and there is insufficient experimental evidence to support the claim of non-convergence in BNNs. In Section 5.3, lines 499-520, there is an ablation study on different priors, and Figure 8 shows PSNR values over iterations. Yet, from Figure 8, it appears that other priors also converge, albeit to poorer results, which does not strongly support the claim of underfitting or non-convergence.

I have personally thought about this problem for some time, and I understand the difficulty in proving it. Some other methods present both PSNR and loss training curves, which the authors may want to consider.

Q3. Insufficient evidence to support the motivation of TWO-STAGE APPROACH:

  • The TWO-STAGE APPROACH is proposed to address imprecision due to the complexity of high-dimensional images. However, there are no relevant references cited, and the ablation study only demonstrates that the two-stage approach performs better than a single-stage approach. I still do not know the exact reasons why the two-stage approach is superior.

I remain skeptical about whether the issue is truly due to dimensionality. For instance, on a low-dimensional dataset, would using only a Bayesian Neural Network (BNN) suffice? The authors can consider test one stage and two stage on lower-dimensional tasks or provide a theoretical analysis of why the two-stage approach helps with high-dimensional data or cite some relevant references about "the complexity of high-dimensional images reduces the performance of BNN".

评论

Thank you for your valuable review. We feel grateful that the reviewer has previously considered BNNs for the image enhancement problem. Since BNNs are not a mainstream approach, our exploration involved numerous challenges and failures, but we have seen some promising results in this direction, and we have continuously refined our method to address them.

To ensure the reproducibility of all results presented in the paper, we have provided the reviewer with a fully anonymous GitHub repository. The repository includes the code (more functions are coming) and GIF-based visualisations.

Below, we address the reviewers' comments in detail.

Q1. (1) The most critical issue is in Section 3.3, "PREDICTIONS UNDER UNCERTAINTY", where the prediction method seems to rely on ground truth (GT) information. (2) The paper also lacks an ablation study comparing with and without the proposed prediction method. The authors can consider adding such ablations and evaluating the model using no-reference image quality metrics or exploring alternative fusion methods.

A1: (1) Since this was a common concern. Please refer to Q1 in the “Shared Comments”. (2) In our code, we also implemented a Monte Carlo approach to obtain the results of the function. However, since the fusion results obtained via Monte Carlo were not significantly different from those of BEM_determ, we did not include this in the main paper.

In the newly added Appendix H, we provide results obtained using a no-reference metric and compare them visually with those obtained using a full-reference metric. As shown in Figure 22 and Figure 23, the enhanced images generated using the no-reference metric are comparable to, and in some cases even better than, those generated with the full-reference metric.

评论

By the way, in my memory, from my previous experiments, I didn’t encounter the noise in the (a) setting as shown in Figure 15, even in image denoising tasks. This discrepancy seems somewhat unusual.

Additionally, it appears that “Label Diversity Augmentation” introduces excessive randomness, leading to significant brightness variations during inference. In my previous experiments, trained BNNs generally produced more likely and consistent results, with less extreme variation. This high randomness and fluctuation in image quality could be why fusion across 100 images yields suboptimal results.

评论

Thank you. I have also replied in the "Shared Comments".

评论

Q3. Insufficient evidence to support the motivation of TWO-STAGE APPROACH: The TWO-STAGE APPROACH is proposed to address imprecision due to the complexity of high-dimensional images. However, there are no relevant references cited, and the ablation study only demonstrates that the two-stage approach performs better than a single-stage approach. I still do not know the exact reasons why the two-stage approach is superior.

A3: To address this, we have added Appendix F to our revised manuscript, specifically prepared for the reviewer, and we sincerely hope you will take the time to review it.

Some works have utilised BNNs for dense prediction tasks, such as semantic segmentation [1]. These tasks can tolerate less accurate pixel-level predictions because they do not involve perceptual quality. By contrast, image enhancement is highly sensitive to perception. For instance, in semantic segmentation, prediction values of 0.6 and 0.9 might represent the probability that a pixel belongs to the background, which is tolerable. However, in an image with pixel values ranging from 0 to 1, 0.6 and 0.9 appear vastly different to the human eye.

In high-dimensional image data, BNNs introduce uncertainty in the prediction of each pixel. This pixel-level uncertainty often manifests as noise in the output image, negatively affecting both visual perception and certain image quality metrics (see Figure 15 in Appendix F). We believe this challenge might explain why BNNs have not been widely adopted for image enhancement in prior work.

Initially, we did not aim to create a two-stage BNN-DNN model. In fact, we didn’t choose to create a two-stage model willingly—it was a practical solution we could found to address the noise issue after many failed attempts. Before adopting the two-stage framework, we attempted to reduce noise by replacing the BNN’s final projection layer with a deterministic layer. However, these attempts did not yield satisfactory results.

To intuitively compare the advantages of the two-stage BNN-DNN framework, we have included comparisons with several alternative frameworks we experimented with. The results are illustrated in the table below.

FrameworkDownscale (Stage-I)UIEB-R90 (PSNR ↑)UIEB-R90 (SSIM ↑)LOL-v1 (PSNR ↑)LOL-v1 (SSIM ↑)
(a) BNNN/A21.720.88522.740.818
(b) BNN-v2N/A23.710.89924.780.852
(c) DNNN/A20.830.86423.760.842
(d) BNN-DNN25.620.94026.830.877
(e) DNN_down-DNN20.680.81222.850.823
(f) Cascaded DNNs20.950.87323.980.827
(g) BNN-DNN17.780.68919.260.798

With the adoption of the two-stage BNN-DNN framework, we were able to make meaningful progress. We acknowledge that the two-stage approach is not the most elegant solution, and we hope future work can explore more refined alternatives. However, for the time being, the two-stage framework is a practical and effective implementation.

[1] Kendall, Alex, Vijay Badrinarayanan, and Roberto Cipolla. "Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding." arXiv preprint arXiv:1511.02680 (2015).

Q4: The coherence and readability of the paper could be improved. For instance, both the abstract and the introduction do not address the motivation of the TWO-STAGE APPROACH. However, it suddenly occurs at the contribution and main text that "TWO-STAGE APPROACH is introduced to the complexity of BEM in modeling high-dimensional image data".

We thank the reviewer for pointing out structural issues in our paper. In the revised manuscript, we added the following sentence to the abstract to clarify the motivation for our two-stage approach:

To address the noise in predictions of Bayesian Neural Networks (BNNs) for high-dimensional images, we propose a two-stage approach.

Additionally, we refined lines 75–78 to further emphasise the motivation for the two-stage approach.

评论

Q2. Novelty and motivation of the Momentum Prior method:

  • Q2.1: Lack of innovation in the Momentum Prior: The Momentum Prior is essentially a combination of BNN and Exponential Moving Average (EMA). EMA is a well-established and commonly used technique, so simply combining BNN and EMA may lack sufficient novelty.

  • A2.1: We respectfully highlight that, to the best of our knowledge, this work is the first to apply the momentum concept to implement a dynamic prior for BNNs in low-level vision tasks. As the reviewer rightly pointed out, EMA is indeed a widely used and versatile technique. Our approach builds on this simple yet effective idea to address the longstanding challenge of low performance associated with the fixed prior in BNNs.

  • Q2.2: Insufficient evidence to support the motivation of Momentum Prior. The Momentum Prior is introduced to address issues of underfitting or even non-convergence. However, the paper does not cite any works that discuss these problems in BNN training, and there is insufficient experimental evidence to support the claim of non-convergence in BNNs. From Figure 8, it appears that other priors also converge, albeit to poorer results, which does not strongly support the claim of underfitting or non-convergence. I have personally thought about this problem for some time, and I understand the difficulty in proving it. Some other methods present both PSNR and loss training curves, which the authors may want to consider.

  • A2.2: Based on the reviewer’s feedback, we have expanded our explanation of the motivation and inspiration for the momentum prior in lines 205–215 of the revised manuscript and cited relevant previous works.

    We agree that the absence of the training curve in Figure 8 makes it difficult to assess potential underfitting or non-convergence issues. Unfortunately, due to an oversight, we did not compute the PSNR curve on the training set. We commit to rerunning the experiments to obtain and include the training curve.

    That being said, Figure 8 still demonstrates that the Momentum Prior outperforms both the empirical Bayes prior and the Gaussian prior. Furthermore, given that BNNs with fixed priors perform significantly worse than DNNs with the same architecture on the same dataset, we can reasonably infer from empirical observations that this discrepancy likely arises from non-convergence.

    For reference, the performance (in terms of PSNR) of the single-stage BNN in deterministic mode, using different priors, on the LOL training and test sets is as follows:

    DatasetFixed GaussianEmpirical Bayes PriorMomentum Prior
    Training Set12.3618.6325.08
    Test Set11.8418.0422.56

    Additionally, we observed that BNNs with an Empirical Bayes Prior tend to exhibit very large gradients, with the average gradient norm reaching as high as 500 during training. After applying gradient clipping, the parameters of the model with an Empirical Bayes Prior tend to stagnate after a certain number of iterations, oscillating without significant updates. These observations lead us to hypothesise that the failure of these models is due to underfitting.

    To support the claim that BNNs with a fixed prior are prone to underfitting, we have cited relevant prior works [1][2] in our revised manuscript.

    [1] Dusenberry, Michael, et al. "Efficient and scalable bayesian neural nets with rank-1 factors." International conference on machine learning. PMLR, 2020.

    [2] Tomczak, Marcin, et al. "Collapsed variational bounds for Bayesian neural networks." Advances in Neural Information Processing Systems 34 (2021): 25412-25426.

评论

To A2:

In about 2022, I used to spend a long time researching and experimenting with BNN and LLIE, so many parts of this paper leave me puzzled.

First, "Momentum Prior" and "Gaussian prior" represent the same underlying prior, the difference is the optimization. This paper proposes "Momentum Prior" as a replacement for the Gaussian prior, as the authors suggest that "relying on a fixed, simple prior can restrict the network’s capacity to fit the data effectively." However, without the Gaussian prior (i.e., weights following a Gaussian distribution), the KL divergence cannot be simplified to the form in Eq. (8). In traditional BNN learning, the mean and variance of each layer’s parameters are learnable. This paper applies EMA to the learnable mean and variance, so the parameters produced by BNN still follow a Gaussian distribution.

Second, in fact, I have encountered similar ideas in low-level papers, such as [1].

Third, I am skeptical about EMA's effectiveness. EMA's capability isn't as extensive as suggested. It can improve good parameters through ensembling, but if the initial parameters are poor, ensembling alone cannot yield such large improvements. When poor parameters appear, EMA’s mechanism requires numerous iterations to dilute this effect. While EMA results should theoretically outperform the Gaussian prior to some degree, I am surprised to see such substantial improvement here. In the table, does "fixed Gaussian" mean that the variance of the Gaussian prior is fixed?

[1]. Pang, T., Quan, Y., Ji, H. (2020). Self-supervised Bayesian Deep Learning for Image Recovery with Applications to Compressive Sensing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12356. Springer, Cham. https://doi.org/10.1007/978-3-030-58621-8_28

To A3:

I also feel that the two-stage approach is not an elegant solution and seems more like a compromise.

Beyond the two-stage training, the "MAMBA Backbone," "Controllable Local Enhancement," and "Label Diversity Augmentation" are not true contributions and instead further make the approach overly complex. This added complexity detracts from the method's elegance and extensibility, potentially hindering future comparisons in future work.

For a paper pioneering BNN, especially one submitted to ICLR, metrics like PSNR may be of limited relevance. In fact, the PSNR could discourage other researchers from following up on this work. In future similar study, it might be better to remove unnecessary components and focus solely on polishing the main contribution and motivation. For instance, the authors themselves emphasize in the supplementary material that the image with the highest PSNR is not necessarily the best in Figure 22 and Figure 23. Since the goal of the paper is to address the one-to-many problem, it would be more effective to introduce alternative quantitative comparison formats to showcase diversity. However, the authors have instead highlighted a particularly high PSNR as support, which may not add much value.

评论

W2- Response to Comment: To A2:

(1) We apologise for any confusion caused regarding the term Gaussian Prior. For clarity, the terms "simple," "naïve," or "fixed Gaussian Prior" in our paper all refer to a Zero-Mean Isotropic Gaussian distribution, such as N(0,0.1I)\mathcal{N}(0, 0.1I) , where the mean is 0, and the variance is equal across all dimensions. We have clarified this in the revised version of our manuscript.

Momentum Prior is still a Gaussian distribution, so Eq. (8) remains valid. In our paper, the definition of Momentum Prior focuses on the update rules for the mean and variance. Its Gaussian nature is evident from Eq. (7), which explicitly defines it as such. This clarification has been incorporated into the revised manuscript.


(2) We were not aware of the paper pointed out by the reviewer earlier, and we thank the reviewer for bringing it to our attention. In the revised version of our manuscript, we have added a citation to this work. However, it is worth noting that in [1], BNNs did not outperform DNNs, whereas in our work, we have successfully achieved this.


(3) Reason A : The effectiveness of EMA depends heavily on the domain it is applied to and how it is used. For tasks like supervised image recognition, the improvements brought by EMA models are indeed limited. However, in self-supervised learning MOCO[3], which is widely recognised, the EMA teacher serves as a highly effective supervisory signal. In previous works such as [4] on Bayesian deep learning, similar ensemble output techniques have been applied to approximate Bayesian estimation by training multiple DNNs. These methods aim to ensure that the model considers a wide range of possibilities. Similarly, when used as a prior, EMA summarises information from multiple checkpoints during training, providing an effective guide for optimising BNN parameters.

Reason B: Regardless of the presence of poor parameters in the Momentum Prior, these represent an important signal of predictive uncertainty. This conveys to the posterior that previous variations in predictions should not be entirely disregarded. In the optimisation objective of BNN (Eq. 8), the posterior updates are influenced not only by the constraints imposed by the prior but also by the supervisory signals provided by the target. As a result, even if poor parameters appear in the Momentum Prior, they do not play a decisive role. Similarly, the EMA model in MOCO may occasionally learn erroneous information. However, through continuous iteration, such errors are gradually mitigated. The success of MOCO is largely due to the iterative updates of the EMA model’s parameters, which refine it into an effective supervisory signal over time.

Reason C: Another motivation for using the Momentum Prior is our observation that during DNN training (e.g., over 150K iterations), the results start oscillating around ~80K iterations. These oscillations occasionally produce good results but often revert back, failing to stabilise. The Momentum Prior effectively captures and summarises these late-stage oscillations (which can be considered a form of predictive uncertainty), thereby better guiding the updates to the BNN’s posterior parameters. Moreover, a N(0,0.1I)N(0, 0.1I) prior essentially acts as weight regularisation for BNNs. This regularisation is overly restrictive, preventing the BNN from fitting a complex distribution. The assumption behind such a fixed Gaussian prior is that image predictions do not require overly complex weight distributions. In fact, even simply removing the prior entirely and directly optimising the BNN by minimising the MSE between the BNN’s output and the target results in better performance than using the fixed Gaussian prior. However, this approach no longer aligns with the standard of Bayestian estimation.

Therefore, it is not that EMA is too strong, but rather that the Zero-Mean Isotropic Gaussian prior is ineffective. For instance, when no prior is used, the BNN achieves a PSNR of 20.32 dB on the LOL dataset. While this still lags behind the EMA prior, it is significantly better than the naïve Gaussian prior, which only achieves 11.84 dB.

Regarding empirical Bayes prior, we hold an optimistic view. When the prior is derived from different datasets, it can act as a form of knowledge distillation, as demonstrated in previous work [2].

[1]. Pang, T., Quan, Y., Ji, H. (2020). Self-supervised Bayesian Deep Learning for Image Recovery with Applications to Compressive Sensing. ECCV 2020.

[2] Pan, Xingang, et al. "Exploiting deep generative prior for versatile image restoration and manipulation. TPAMI

[3] He, Kaiming, et al. "Momentum contrast for unsupervised visual representation learning." CVPR 2020.

[4] Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and scalable predictive uncertainty estimation using deep ensembles." Advances in neural information processing systems 30 (2017).

评论

Thank you very much for your reply and the valuable comment!

(a) We kindly ask how you implemented your BNN, as this may affect the noise pattern in the outputs. In our experiments, we tried three implementations, including Variational Bayesian Last Layers (VBLL) [1], Dropout, and our current implementation based on Bayesian-Torch. While the Dropout implementation did not produce such obvious noise in the outputs as shown in Fig. 15(a), we observed noise patterns with the other two implementations.

The reason for this noise could be: in a Bayesian layer, there is an inherent random noise term (e.g., weight = self.mu_weight + self.sigma_weight * self.eps_weight.data.normal_()), which is added to the weights. Below, we provide the forward code for the BNN linear layers we used. Another possible reason we speculate is that low-light images inherently contain Read Noise due to low ISO settings. The stochastic process in BNNs might amplify such Read Noise unintentionally, leading to noisy output. In Figure 24, we show the impact of Read Noise on the a one-stage BNN's prediction.

Also, if you fuse multiple predictions to produce a more robust output, this would likely reduce the noise due to averaging. In our experiments, while the Dropout implementation did not exhibit such obvious noise, it resulted in lower performance, leading us to abandon it in favour of the others.

    def _forward_uncertain(self, input):
        if self.training:
            with torch.no_grad():
                _decay = min(self.decay, (1 + self.step) / (10 + self.step))
                # _decay = self.decay
                self.prior_mu_weight = _decay * self.prior_mu_weight + (1 - _decay) * self.mu_weight
                self.prior_rho_weight = _decay * self.prior_rho_weight + (1 - _decay) * self.rho_weight
                self.prior_sigma_weight = torch.log1p(torch.exp(self.prior_rho_weight))

                if self.bias:
                    self.prior_mu_bias = _decay * self.prior_mu_bias + (1 - _decay) * self.mu_bias
                    self.prior_rho_bias = _decay * self.prior_rho_bias + (1 - _decay) * self.rho_bias
                    self.prior_sigma_bias = torch.log1p(torch.exp(self.prior_rho_bias))
            self.step += 1

        self.sigma_weight = torch.log1p(torch.exp(self.rho_weight))
        weight = self.mu_weight + self.sigma_weight * self.eps_weight.data.normal_()

        if self.bias:
            self.sigma_bias = torch.log1p(torch.exp(self.rho_bias))
            bias = self.mu_bias + self.sigma_bias * self.eps_bias.data.normal_()
        else:
            bias = None

        out = F.linear(input, weight, bias)

        return out

(b) Label Diversity Augmentation has not inlcuded into the experiments outside Appendix D, as we claimed in our original submission:

For consistency, these augmentation strategies were not applied in other experiments.

The sub-optimality of prediction fusion is an inherent characteristic of BNNs. For example, in autonomous driving, prediction fusion is often used to reduce the model's overconfidence. However, such a robust prediction tend to deviate from both the optimal and worst-case solutions, making them inherently sub-optimal.

[1] Harrison, James, John Willes, and Jasper Snoek. "Variational bayesian last layers." ICLR (2024).

评论

W3- Response to Comment: To A3: We adopted Mamba as our backbone because it has gained significant popularity in low-level vision tasks over this year. While exploring BNNs, we also aimed to adapt BNNs to more advanced backbones. We have conducted some experiments with Transformer backbone. We appreciate the reviewer pointing this out and will provide implementation details for the Transformer backbone in future updates.

Regarding "Controllable Local Enhancement" and "Label Diversity Augmentation," these concepts were not discussed in the main text but were included in the Appendices as addtional information for readers interested in further exploring the potential of BNNs about the attempts we made and possible future directions. Additionally, on our anonymous GitHub, we have experimented with approaches like Masked Image Modeling, which are prevalent in Deep Learning, as potential future avenues for BNNs. These additional explorations demonstrate our extensive efforts to advance the field of BNNs beyond theoretical discussions.

However, as the reviewer rightly pointed out, these topics are not the primary focus of the paper. If our paper is accepted, we will remove these contents from the appendix to ensure a sharper focus.

We agree with the reviewer that PSNR is not an ideal metric. As one of the few works in this field leveraging BNNs, we initially followed the convention of using PSNR as it has been the standard in DNN-based works. Moving forward, we will reduce the emphasis on PSNR in the main text and focus on more meaningful metrics.

Thank you for your valuable feedback, which has greatly helped us refine our work.

评论

I was too busy yesterday, but today I spent considerable time rereviewing this paper, the supplementary materials, other reviewers' comments, and the authors' responses.

Reply to “Response to Additional Comments”

Thank you for the detailed response. These issues are not directly related to the paper's main contributions, I just intend to provide some potential directions for solving these problems to the authors and other readers.

Additionally, here are a few points worth noting:

  1. Image denoising and super-resolution are also one-to-many problems. A single noisy image can correspond to different clean images because noise follows a certain probability distribution. Different combinations of clean images and noise can result in the same noisy image. The one-to-many issue in image denoising may be relatively minor, but it is significantly more pronounced in super-resolution [1].

[1]. PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

  1. In low-light conditions, shot noise tends to be more significant than read noise.

Reply to W2

  1. It’s okay if you are unfamiliar with this paper [2]. Whether it is cited or not doesn’t really matter. When I was initially working on BNN, I also didn’t come across this paper and only discovered it while studying self-supervised learning. However, it’s worth noting that the BNN results in this paper [2] also outperform those of DNN.

[2]. Self-supervised Bayesian Deep Learning for Image Recovery with Applications to Compressive Sensing

  1. In traditional BNNs, the variance of the Gaussian distribution is learnable, whereas the authors set it to a fixed value. This is likely the key reason why using EMA resulted in such a significant improvement. In reality, “fixed variance” < “learnable variance” < “learnable variance + EMA”. However, the authors seem to have highlighted the benefits of EMA by only comparing the results of “fixed variance” and “learnable variance + EMA.” Even under this somewhat unfair comparison, the “Empirical Bayes Prior” still achieved decent results.

Some additional issues to note:

  1. It would be better to unify the terminology for “simple,” “naïve,” or “fixed Gaussian Prior” into a single consistent term.

  2. In Figure 7, initializing σ=0.05 produced the best results. This initialization should be kept consistent when compared with other methods to ensure fairness. Moreover, the authors not only set σ as non-learnable but also appear to have chosen a suboptimal value for σ.

Reply to W3

I believe the authors could be more honest and conduct fairer comparisons. During the rebuttal period, I noticed additional issues that seem to have been overlooked or intentionally omitted. Supplementary materials should include details that could not be added to the main paper due to space limitations.

For instance, in the ablation study (Table 4), the "Single-stage" setup replaces the final BNN layer with a DNN. I suspect this was because the single-stage setup was not tested 100 times, while the two-stage setup was. Why not compare pure BNN, pure DNN, and the Two-stage approach instead? In Table 6, the DNN solution is presented with average performance, but it seems this cannot be directly compared with the other setups.

Summary

While the authors have been very responsive and provided detailed answers, I seem to have uncovered more issues that require further thought. I may need more time to decide on the final score.

评论

We sincerely appreciate your thoughtful review of our responses, especially during such a busy period.

We addressed all the points raised by the reviewer in the responses below.

H1-Reply to “Response to Additional Comments”:

(1) We acknowledge the use of uncertainty modelling in such tasks [1]. Typically, super-resolution datasets are created by downsampling high-quality images, without intentionally including multiple labels for the same input. This lack of label diversity may limit BNNs' effectiveness in super-resolution, though this is only a hypothesis as we have not conducted experiments. For very large-scale datasets, a single low-quality image might correspond to multiple high-quality images, posing a one-to-many challenge. We appreciate the potential of applying BNNs beyond the enhancement domain. We agree that uncertainty arising from downscaling is intuitive in [2].

(2) Whether it is shot noise or read noise, the single-stage BNN performs poorly when dealing with certain noise patterns.

[1] Ning, Qian, et al. "Uncertainty-driven loss for single image super-resolution." Advances in Neural Information Processing Systems 34 (2021): 16398-16409.

[2] PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models


H2-"Reply to W2":

(1) In traditional BNNs, the variance of the Gaussian distribution is learnable, whereas the authors set it to a fixed value. This is likely the key reason why using EMA resulted in such a significant improvement. In reality, “fixed variance” < “learnable variance” < “learnable variance + EMA”. However, the authors seem to have highlighted the benefits of EMA by only comparing the results of “fixed variance” and “learnable variance + EMA.” Even under this somewhat unfair comparison, the “Empirical Bayes Prior” still achieved decent results.

It seems there might be a misunderstanding regarding the variances. Could the reviewer have mistaken the posterior's variance for the prior's variance? In all our experiments, the posterior's variance in BNNs is always learnable and not fixed.

In the ablation study titled "Impact of Different Priors", all BNNs—whether using a fixed Gaussian prior or an empirical Bayes prior—employ learnable variance for the posterior, aligning with traditional BNN practices.

The conclusion that “fixed variance” < “learnable variance” < “learnable variance + EMA” is not accurate, as there is no fixed variance in our experiments. We believe the comparison is fair and consistent throughout.

The Empirical Bayes Prior did not achieve satisfactory results because it tends to take shortcuts during the optimisation of posterior parameters. For further clarification, please refer to Section 3.3 of the paper "Weight Uncertainty in Neural Networks." This highlights the inherent limitations of the Empirical Bayes Prior, which we believe contributes to its underperformance.

[3] Self-supervised Bayesian Deep Learning for Image Recovery with Applications to Compressive Sensing

(2) We have standardised the terminology by unifying "simple" and "naive" as "fixed." We sometimes use the adjective "fixed" to emphasise the inherently fixed nature of the naive Gaussian prior, which differs from other dynamic priors.

(3) In Figure 7, initialising σ=0.05 produced the best results. This initialization should be kept consistent when compared with other methods to ensure fairness. Moreover, the authors not only set σ as non-learnable but also appear to have chosen a suboptimal value for σ.

We would like to clarify that we are not setting σ=0.05\sigma = 0.05, but rather σo=0.05\sigma^o = 0.05, where σo\sigma^o is defined in Eq. (7) specifically for the Momentum Prior. If you are suggesting setting σ=0.05\sigma = 0.05 for the naive Gaussian Prior, this will lead to worse training outcomes, as it overly restricts all posterior weights to have absolute values less than 0.05. Setting σ=0.1 \sigma = 0.1 for the naive Gaussian Prior yields better results.


H3-Reply to W3:

In the ablation study (Table 4), the results for both models were tested 100 times. In Figure 8, it can also be observed that the PSNR for the single-stage BNN averages around ~22.5 dB for a single prediction and only reaches 24.78 dB when K=100K=100.

As we have mentioned, pure BNNs produce noisy outputs, leading to lower PSNR. This is why we replaced the single-stage BNN's last layer with a deterministic layer. We have already compared our approach with pure BNNs in Figure 15 and Table 8. We believe that comparing our two-stage model with the best single-stage BNN (with a deterministic last layer) is more appropriate than comparing it with the less effective pure BNN.

Table 6 simply compares the backbone performance under different configurations. These results do not require comparisons with other models and are solely intended to help select the optimal backbone.

评论

Regarding the prior:


The Gaussian prior is expressed as:

p(w) = \mathcal{N}(w; \mu, \sigma^2 I), $$ where $\mu$ is the mean, and $\sigma$ is the standard deviation. In BNNs, both the mean and variance are learnable. ------ In line 185, the authors introduce the Momentum Prior:

p(w) = \mathcal{N}(w; \mu_t^{ema}, {\sigma_t^{ema}}^2 I).

However,thefixedGaussianpriorusedforcomparisonisdefinedas: ------ However, the fixed Gaussian prior used for comparison is defined as:

p(w) = \mathcal{N}(w; 0, 0.1 I)

Regardingtheposterior,itcanbewrittenusingthereparameterizationtrickas: ------ **Regarding the posterior**, it can be written using the reparameterization trick as:

w = \mu + \sigma \odot \epsilon $$ where μ\mu and σ\sigma correspond to the μ\mu and σ\sigma in the prior distribution.


Summary

I don't think I misunderstood.

Regarding the prior, the consensus is that both the mean and variance are learnable. For instance, the referenced Momentum Prior paper (in page 2, upper right corner) explicitly stated that the mean and variance of the Gaussian prior are learnable:
"parameters μ\mu and σ\sigma are learnt while optimizing the cost function ELBO with the stochastic gradient steps."

Regarding the posterior, the authors also made the same clarification in lines 181-182.

评论

Thank you for your patient response and valuable suggestions.

We are committed to refining our paper, regardless of whether it is accepted or not.

We understand that the reviewer remains concerned about the Two-stage BNN-DNN approach. Allow us to illustrate it from a different perspective: the two-stage BNN-DNN framework can be viewed as a coarse-to-fine approach. In this framework, the BNN processes coarse-grained information, while the DNN handles fine-grained information. Why use a DNN for fine-grained information? Because fine-grained information is typically more certain, whereas coarse information, such as lighting and colour, involves greater uncertainty.

Additionally, it is worth noting that the one-stage BNN already outperforms the one-stage DNN. The Two-stage BNN-DNN approach offers two key advantages:
(a) More accurate predictions by reducing noise through the division of tasks between stages.
(b) Improved inference speed. When performing 100 inferences, placing all these inferences in the first stage of the two-stage framework significantly reduces latency.

We appreciate the reviewer’s suggestion to test diverse backbones. In fact, we have tested several different backbones with BEM. Allow us to list the backbones we tested:

  • restormer_arch.py
  • simple_arch.py
  • UNet_arch.py
  • SHVM24_arch.py (an hourglass architecture)
  • SimpleD_arch.py
  • RetinexMamab_arch.py
  • MambaVision_arch.py
  • RetinexFormer_arch.py Among these, SHVM24_arch.py represents an hourglass architecture, and simple_arch.py features a non-hierarchical plain structure. We apologise for the poorly organised namespace of these backbone names. We chose the Mamba-based UNet because it is a purer architecture without any specialised modules (e.g., RetinexFormer includes an illumination module, which we believe is overly specific to the LLIE task). More results using different backbones will be provided.

In terms of evaluation using PSNR, we are more concerned about some questions like why not provide the PSNR? Therefore, here it is!

We sincerely appreciate the reviewer's insights and hope this explanation provides further clarity on the benefits of our approach. Thank you for raising the score, which means a lot to us.

评论

Dear Reviewer g1ij,

We kindly wish to clarify that the update rule for the Momentum Prior is not a learnable process; it is updated based on EMA. If the parameters of a prior were learnable, it would not effectively function as a loss term, as it would also require gradients for learning itself. While it is common for BNNs to use a fixed prior, there are some works that explore dynamic priors by defining a distribution function for their priors.

For instance, we observed that BayesianTorch adopts this default setting ( P(W)=N(0,I)P(\mathbf{W}) = \mathcal{N}(\mathbf{0}, \mathbf{I})). Since we are using BayesianTorch in our implementation, we chose to compare our Momentum Prior with its default prior configuration for consistency and fairness in evaluation.

We provide the relevant code of BayesianTorch bellow:

class LinearReparameterization(BaseVariationalLayer_):
    def __init__(self,
                 in_features,
                 out_features,
                 prior_mean=0,
                 prior_variance=1,
                 posterior_mu_init=0,
                 posterior_rho_init=-3.0,
                 bias=True):
        """
        Implements Linear layer with reparameterization trick.

        Inherits from bayesian_torch.layers.BaseVariationalLayer_

        Parameters:
            in_features: int -> size of each input sample,
            out_features: int -> size of each output sample,
            prior_mean: float -> mean of the prior arbitrary distribution to be used on the complexity cost,
            prior_variance: float -> variance of the prior arbitrary distribution to be used on the complexity cost,
            posterior_mu_init: float -> init trainable mu parameter representing mean of the approximate posterior,
            posterior_rho_init: float -> init trainable rho parameter representing the sigma of the approximate posterior through softplus function,
            bias: bool -> if set to False, the layer will not learn an additive bias. Default: True,
        """ 

The reviewer defines p(w)=N(w,μ,σ2I)p(\mathbf{w}) = \mathcal{N}(\mathbf{w}, \mathbf{\mu}, \mathbf{\sigma}^2 \mathbf{I}), which is incorrect. N(w,μ,σ2I) \mathcal{N}(\mathbf{w}, \mathbf{\mu}, \mathbf{\sigma}^2 \mathbf{I}) represents the variational posterior q(wθ)q(\mathbf{w}|\mathbf{\theta}), where θ=(μ,σ)\mathbf{\theta} = (\mathbf{\mu}, \mathbf{\sigma}). Posterior's μ\mathbf{\mu} and σ\mathbf{\sigma} are always learnable. That's why we said the reviewer's conclusion that “fixed variance” < “learnable variance” < “learnable variance + EMA” is not accurate because we never fix the posterior's variance and the prior's variance cannot be learnable.

评论

Combining the code, I have re-examined the paper lines 183-190. I apologize for the previous misunderstanding.

I suggest the authors include the corresponding symbolic expressions whenever mentioning the posterior or the prior, and use a consistent notation. For example, either consistently use "the prior P(w)P(w)", as in line 185, or "the prior distribution P(w)P(w)," as in lines 170-171. In addition, it is best to avoid referring to it simply as "the prior," as in lines 192-193, or "the posterior," as in lines 179-180.

评论

Dear Reviewer g1ij,

We sincerely apologise for the confusion caused by the vague expressions in our writing. To address the reviewer’s concerns, we made some changes to the manuscript to resolve inconsistencies in notation. However, the OpenReview system denied these updates due to the submission deadline.

Thank you once again for your efforts in helping us refine our paper!

评论

Thank the authors for the detailed and patient responses. The authors have addressed some of the concerns I raised and clarified my misunderstandings.

I feel that this paper may be marginally above the acceptance threshold and i have raised my score. After discussions, the paper still has the following issues.

First, the motivation behind the two methods is not clear and explicit enough. While the authors do provide explanations, these explanations are too weak to be convincing. The issues with BNNs mentioned by the authors are not intuitively and clearly conveyed in the main body of the paper.

Second, the solutions are not elegant enough. In particular, the “BNN+DNN” approach feels like too much of a compromise. I am inclined to believe that the paper does not identify a genuinely existing problem or the true reason why the solution works. At the same time, the solution does not offer much insight.

The first and second issues together give the impression that a problem is created out of thin air, and then solved in the same manner.

Lastly, the most important issue is the evaluation. This paper aims to address the one-to-many problem, and the authors point out that, even if the images are aesthetically pleasing, PSNR and SSIM might still remain low. However, the authors chose to present a high PSNR, which somewhat contradicts the starting point. I would recommend using quantitative metrics related to diversity. Furthermore, as the authors propose a "BNN+DNN" framework, for such framework, I would recommend changing the backbone network for testing to increase generality. For example, "Retinexformer + BEM", "RetinexMamba + BEM".

Apart from the issues mentioned, I recommend that the authors carefully review the paper to unify the various expressions and correct some typos, before accepting. During the rebuttal phase, I already pointed out three different instances of inconsistent wording. In fact, there are still some inconsistencies and typos. For example, Eq. (3) is missing a comma “,”. In line 135, “By” should be changed to “by”.

In conclusion, I feel that this paper may be marginally above the acceptance threshold, but I'm not certain, and I could also be convinced by other reviewers.

Once again, thanks for the detailed responses.

审稿意见
3
  • This paper raises the issue that image enhancement processes, such as LLIE and UIE, involve a one-to-many problem, and introduces BNN to address it.
  • To ensure stable training of the BNN, this paper introduces the momentum prior.

优点

  • This paper introduces a two-stage design that leverages the strengths of both BNNs and DNNs.
  • The mathematical formulas are clear and easy to understand.

缺点

  • The Bayesian prediction process is impractical. For instance, the K=100 setting used in the paper requires 100 inferences, leading to high computational costs. While the paper draws comparisons to diffusion models, diffusion models provide methods to streamline the inference process, whereas the approach proposed in the paper does not seem to offer such possibilities.
  • Additionally, Algorithm 1 presents both cases with and without ground truth. In the paper, Table 1 seems to show results for a full-reference case, where ground truth is available during inference, and the predicted image can be evaluated based on its similarity to the ground truth. However, this approach is highly impractical and does not appear to offer a fair comparison with other models.
  • While the paper suggests using CLIP text feature cosine similarity as a non-reference approach, a detailed explanation is missing from both the main text and the appendix. In my opinion, the results based on the non-reference case using CLIP text features would be more reasonable, and these results should be presented in the paper as a table.

问题

  • While the content is clear and easy to understand, I have some questions about the inference setup. Below are a few concerns regarding the weaknesses:
    • In the case of K=100, as used in the paper, what are the statistics for the predicted enhanced images? For example, I am curious about the minimum, maximum, median, and mean values of PSNR/SSIM.
    • Could you add the results using CLIP features to Table 1 and Table 2? I am also interested in the statistics for those results.
    • When applying Bayesian models, are there any commonly expected generalization advantages? For instance, how does the model trained on LOL-v1 perform on LSRW[1*]? It would be helpful if results using both full-reference metrics and CLIP features were provided.
  • If the non-reference method using CLIP features can still demonstrate outstanding performance, I am willing to raise the rating. However, based on my experience, the method using CLIP features presented in the paper has not shown consistent performance. Could you provide additional explanations regarding this issue?

[1*] Hai, Jiang, et al. "R2rnet: Low-light image enhancement via real-low to real-normal network." Journal of Visual Communication and Image Representation 90 (2023): 103712.

评论

Thank you for your valuable review. Please see our responses below to clarify the main concerns


Q1: The Bayesian prediction process is impractical. For instance, the K=100 setting used in the paper requires 100 inferences, leading to high computational costs. While the paper draws comparisons to diffusion models, diffusion models provide methods to streamline the inference process, whereas the approach proposed in the paper does not seem to offer such possibilities.

A1: In Section 4.2, we provide an accelerated inference strategy and would like to supplement additional information to clarify its use:

  • When the inference metric DD is CLIP, the input resolution must be greater than 512x512512 x 512 for the accelerated strategy to be applicable. After extensive testing, we found that when the resolution exceeds 1024x10241024 x 1024, the quality of the enhanced images generated using the accelerated method is nearly identical to those produced without acceleration—achieving what can be considered virtually lossless acceleration. As shown in Fig. 3, the speed of accelerated inference remains nearly at the same level as the unaccelerated method.
  • We can provide a simple calculation to illustrate the efficiency: suppose the input resolution is 1920x10801920 x 1080, which is downsampled to 120x68120 x 68. Performing 100 inferences in parallel on the downsampled images is approximately equivalent to processing a single image with a resolution of 1200x6801200 x 680. In practice, the process is even faster since the CLIP model itself adopts various acceleration strategies. In our experiments, we did not observe noticeable delays during inference.
  • Diffusion-based methods provide samplers designed for acceleration. However, the overall diffusion process cannot be parallelised, whereas our K-times inference is fully parallelisable, resulting in significantly faster performance.

Regarding the sampling number KK, we provide an analysis in Appendix G.2. The required sampling count depends largely on the quality of GT images in the training dataset. For datasets where GT images have fewer quality issues, fewer inferences are sufficient to achieve satisfactory results. However, for low-light and underwater enhancement datasets—where GT images frequently have quality problems—a higher number of samples is needed to produce high-quality results.


Q2: Additionally, Algorithm 1 presents both cases with and without ground truth. In the paper, Table 1 seems to show results for a full-reference case, where ground truth is available during inference, and the predicted image can be evaluated based on its similarity to the ground truth. However, this approach is highly impractical and does not appear to offer a fair comparison with other models.

A2: Please refer to our response to the "Shared Comment".


Q3: While the paper suggests using CLIP text feature cosine similarity as a non-reference approach, a detailed explanation is missing from both the main text and the appendix. In my opinion, the results based on the non-reference case using CLIP text features would be more reasonable, and these results should be presented in the paper as a table.

A3: The reviewer’s comment has been highly enlightening for us. Initially, we were more inclined to use NIQE and UIQM as no-reference IQA metric. However, when designing the accelerated inference algorithm, we found it challenging to perform parallel computations with metrics like NIQE. This limitation led us to consider CLIP-IQA due to its excellent parallelisation capabilities.

That said, we did not fully appreciate the value of CLIP-IQA until receiving the reviewer’s insightful comment. In response, we have added Appendix H to our revised manuscript, where we further explain why CLIP-IQA, as a non-reference metric, holds more practical value compared to full-reference metrics such as PSNR.

We sincerely encourage the reviewer to view the demo we prepared in the Supplementary Materials, which showcases how BEM, combined with CLIP-IQA, can select the highest-quality enhanced image in practical applications. For comparison, we also provide predictions of RetinexFormer, trained on different datasets, for the same input image.

评论

Q4: While the content is clear and easy to understand, I have some questions about the inference setup. Below are a few concerns regarding the weaknesses:

Q4.1: In the case of K=100, as used in the paper, what are the statistics for the predicted enhanced images? For example, I am curious about the minimum, maximum, median, and mean values of PSNR/SSIM.

A4.1: We sincerely appreciate the reviewer’s valuable feedback regarding the collection of statistical data, which has provided us with an opportunity to further refine our paper. In response, we have provided a statistical analysis in Appendix G of the revised manuscript. This includes dataset-level statistical data, prediction distribution plots for randomly selected images, and the corresponding uncertainty maps.

Q4.2: Could you add the results using CLIP features to Table 1 and Table 2? I am also interested in the statistics for those results.

A4.2: Following your suggestions , we have added reuslts based on CLIP (BEM_clip) to Table 1 and Table 2 in our revised manuscript.

We provide a portion of the statistical data in the table below. Please refer to Appendix G for the complete statistical analysis.

MetricMaximumMeanMedianMinimumStandard deviation
PSNR26.8922.8722.9717.901.911
SSIM0.8760.8550.8560.8190.013
CLIP-IQA (Brightness) ×10093.6289.6389.7184.201.689
CLIP-IQA (Quality) ×10064.3459.1359.0854.221.825
CLIP-IQA (Noisiness) ×10036.1730.0630.0225.081.942
Negative NIQE-4.647-4.808-4.806-4.9710.059

We experimented with CLIP-IQA with prompts ("natural," "brightness," "warm") and reduced the weight of the brightness prompt (as we observed that low-quality GT images generally have lower brightness), and we achieved competitive performance.

Since CLIP with the prompt ("Quality") selects the best-quality prediction, but some GT images in the test sets for LLIE and UIE are low-quality, it is not feasible to use CLIP (Quality) to match GT images at the pixel level.

MethodLOL-v1 (PSNR )LOL-v1 (SSIM )LOL-v1 (LPIPS ↓)LOL-v2-real (PSNR )LOL-v2-real (SSIM )LOL-v2-real (LPIPS ↓)LOL-v2-syn (PSNR )LOL-v2-syn (SSIM ↑LOL-v2-syn (LPIPS ↓)
BEM_clip28.430.8820.07130.010.9100.07631.510.9610.030
MethodUIEB-R90 (PSNR ↑)UIEB-R90 (SSIM ↑)
BEM_clip24.360.921

Comparsions with other methods can be found on our response to the "Shared Comments".

Q4.3:When applying Bayesian models, are there any commonly expected generalization advantages? For instance, how does the model trained on LOL-v1 perform on LSRW[1]? It would be helpful if results using both full-reference metrics and CLIP features were provided

A4.3: We observed that the enhanced results produced by our Bayesian NN method consistently outperform others across various unpaired test sets and real low-light images captured by different type of cameras (please refer to the demo in the Supplementary Materials). The table below presents the evaluation results on the LSRW dataset, where our method, trained on LOL, achieves comparable performance to RetinexFormer [1] trained on LSRW, demonstrating the generalisation capability of our method.

Training SetModelPSNRSSIM
LOL-v1BEM (full-reference)19.510.550
LOL-v1BEM (CLIP)18.320.539
LOL-v1BEM_determ17.530.532
LOL-v1RetinexFormer17.780.518
LSRWRetinexFormer19.570.578
LOL-v2 realBEM (full-reference)20.820.566
LOL-v2 realBEM_determ17.630.541
LOL-v2 realRetinexFormer17.190.509

In Appendix H, we present the visualisatiof BEM on LSRW. The results were shown to non-expert participants, who demonstrated a clear preference for the results selected by the CLIP-based method compared to those based on full-reference metrics.

[1] Retinexformer: One-stage retinex-based transformer for low-light image enhancement ICCV. 2023.

评论

Q5: If the non-reference method using CLIP features can still demonstrate outstanding performance, I am willing to raise the rating. However, based on my experience, the method using CLIP features presented in the paper has not shown consistent performance. Could you provide additional explanations regarding this issue?

A5: Please refer to our response in A4.2 and our response to the "Shared Comments," where we demonstrate the superiority of our method when using CLIP as suggested by the reviewer. We truly appreciate this insightful suggestion from the reviewer.

We also observed that using CLIP with the prompt ("Quality") as a selector leads to relatively lower PSNR performance. (Please note that the BEM_clip's results we obtain have avoided using prompt "Quality") This discrepancy arises because CLIP features reflect perceptual quality at the feature level rather than the pixel level. PSNR, on the other hand, is a pixel-level metric that evaluates the average similarity between the GT and the prediction on a per-pixel basis. Therefore, when the GT is of high quality, we find that CLIP-IQA and PSNR exhibit a positive correlation. However, when the GT is of low quality, CLIP-IQA and PSNR demonstrate an inverse relationship.

Upon analysis of the test sets, we found that in the LOL-v1 test set, 40% of GT images have a CLIP score > 0.8, whereas in the LOL-v2-real test set, only 20% of GT images exceed this threshold. This explains why not using "Quality" as the prompt in CLIP-IQA can get higher PSNR. Instead, we employed prompts such as "natural, brightness, warm", which better capture the perceptual aspects relevant to the GT images in these testsets. Nevertheless, by using CLIP-IQA as a selector, we still achieved SOTA performance.

评论

Thank you for your response. The authors have addressed all the concerns I raised.

However, after reviewing the responses, my decision is to maintain my current score.

First, the difference between the min and max values is too large, which still leaves me questioning its practicality. While the authors may argue that using the proposed deterministic model would address this concern, I feel that this response contradicts the contribution claimed in the paper.

Additionally, while the experiments using CLIP are intriguing, they seem to involve too many additional elements, contrary to the claims made in the original paper. Evaluating this would require looking at the paper from a completely new perspective, and the current rebuttal version does not sufficiently address this issue.

If this paper aim to address the problem from a one-to-many perspective, I believe this paper should be written more like a generative model paper. As the authors pointed out, even if the results are aesthetically pleasing, PSNR and SSIM might still remain low. To emphasize the contribution of this paper more effectively, it should primarily focus on experiments and comparisons based on non-reference metrics (from an aesthetic perspective).

Once again, thank you for your detailed responses.

评论

Thank you for your detailed feedback.

Firstly, if the minimum and maximum values in the predictions are expected to be equal, then there is no uncertainty, which would contradict the purpose of our method. Our approach is specifically designed to model uncertainty in the data. The large difference between the minimum and maximum values serves as evidence of the effectiveness of our method in uncertainty modelling. In Appendix G, we provide comprehensive analyses showing that the lower bound of predictions is determined by the quality of the training data, rather than an issue with our model.

Throughout the paper, we emphasise that deterministic models produce sub-optimal predictions. The deterministic version of BEM is included solely to enable direct comparisons with prior DNN-based methods. However, our BNN can use various IQA methods to identify the best solution in real-world applications, rather than using the deterministic model. Furthermore, we have consistently highlighted the ability of our method to achieve diverse predictions, which is one of its key strengths.

Secondly, in response to the reviewer’s suggestion, we conducted additional experiments to explore the potential of CLIP. This included extensive analyses and supplementary experiments. Using CLIP is not complex; we analysed its IQA capabilities and simply adjusted the prompts (e.g., removing the "Quality" prompt) to better match low-quality GT images. Doesn't this further validate that CLIP can serve as an effective IQA method to assist BNN in selecting higher-quality predictions?

As the first work to explore BNNs in the enhancement domain, we followed the conventional metrics, such as PSNR and SSIM, which have been widely used in this field. However, in the main text, only Table 2 exclusively focuses on PSNR and SSIM. Moreover, as pioneers of BNN in this direction, we took a bold step to question the validity of metrics like PSNR, which demonstrates our forward-looking perspective for the field. We hope the reviewer can be more accommodating of our attempt to introduce BNNs in enhancement tasks, recognising that many limitations can be addressed in follow-up work.

As the reviewer noted that non-reference metrics and evaluations from an aesthetic perspective are more valuable, we have already dedicated significant portions of the paper to experiments using these metrics and evaluations. For example, Tables 2, 3 and Figures 5, 6, 13, 14, 18, 22, 23 analyse the model’s performance from an aesthetic perspective. The experiments for non-reference metrics and aesthetic evaluations constitute the majority of the paper.
We are willing to remove the full-reference evaluation in Section 5.1 if that would help the reviewer reconsider the contributions of our work.

We sincerely hope the reviewer can re-evaluate our paper’s contributions as the first to introduce BNNs into the enhancement domain and allow some tolerance for innovation. If there are concerns about the practicality of our method, we have provided an anonymous open-source repository (https://github.com/Anonymous1563/Bayesian-Enhancement-Model), including pre-trained models and Demo in real-world applications. We are happy to provide further explanations or address any experimental doubts.

Thank you again for reviewing our work.

审稿意见
5

This paper presents a novel two-stage framework to address the one-to-many mapping problem in image restoration. In the first stage, the authors employ a Bayesian Neural Network (BNN) to capture inherent uncertainty and accommodate diverse outputs in low-dimensional image representations. In the second stage, a Deterministic Neural Network (DNN) is used to refine the output from the first stage. Additionally, the authors introduce a momentum prior to accelerate the convergence of the BNN. The experimental results demonstrate that the proposed method achieves superior performance in low-light enhancement and underwater image enhancement tasks.

优点

The method presented in the article is interesting, particularly in its use of Bayesian Neural Networks (BNNs) to address the one-to-many mapping problem. The experimental results indicate that BNNs can effectively generate multiple clear images. Moreover, the authors achieve state-of-the-art (SOTA) performance, demonstrating the effectiveness of their proposed approach. Additionally, the qualitative visual results show a significant improvement in visual quality with the proposed method.

缺点

In the low-light enhancement task, the method presented in [1], published in ECCV 2024, outperforms the proposed approach. For instance, the PSNR and SSIM values for [1] on the LoL-v1 dataset are 27.35 and 0.883, respectively, while the proposed method achieves PSNR and SSIM values of 26.83 and 0.877. The performance of [1] is calculated on results without GT mean.

Furthermore, in the context of underwater image enhancement, the authors only utilized a LR-GT paired dataset to demonstrate the superiority of their method, which is inadequate. Recent papers on underwater image enhancement, such as [2] and [3], provide LPIPS and FID metrics for comparison. Therefore, the authors should include a more comprehensive comparison to strengthen their claims.

The framework of the method is unclear. The authors should provide a detailed structure of DNN and BNN in the main paper or in the supplementary material, such as model shape, the level of model, the main modules, how many main modules each level contains, etc.

[1] GLARE: Low Light Image Enhancement via Generative Latent Feature based Codebook Retrieval ECCV2024 [2] Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration CVPR2023 [3] Learning A Physical-aware Diffusion Model Based on Transformer for Underwater Image Enhancement MM2024

问题

The author proposes a reduction function to compress high-dimensional image data. However, for image restoration task, compression is a risk operation since it may loss the information of image. How does this paper compensate for this lost information?

In Section 3.3, the authors mention using reference or no-reference indicators to select the top k candidates. However, it is unclear how the final result is chosen from these candidates when calculating the metrics.

There are some doubts about the experimental results. Firstly, for paired dataset, why did the author use GT images to find better results? Using GT to find a better may be inappropriate, since GT is only used to evaluate the effectiveness of the method. If good results are found using GT, the performance may be better. Meanwhile, for underwater image enhancement task, why did the author rely on UIQM and UCIQE to get better results? Because these metrics are used to evaluate the performance of their proposed method, if the author use these metrics to select results, it may inevitably lead to the restoration results performing well on UIQM and UCIQE.

As mentioned in paper, the author train multiple sets of network weights or even multiple networks, where each set is capable of predicting one of the potential targets. This leads to a linear increase in the number of parameters, computational complexity, and running time of this method. Therefore, the author should add comparative experiments on the number of parameters, computational complexity, and running time.

评论

Q6: For paired dataset, why did the author use GT images to find better results? For underwater image enhancement task, why did the author rely on UIQM and UCIQE to get better results? Because these metrics are used to evaluate the performance of their proposed method, if the author use these metrics to select results, it may inevitably lead to the restoration results performing well on UIQM and UCIQE.

A6: Regarding the use of GT to find better results, please refer to our response to the “Shared Comments.”

In Table 2 of our revised manuscript, we present the results obtained by BEM_clip, which selects predictions without relying on UIIQM or UCIQE. Notably, BEM_clip achieves competitive performance in terms of UIQM and UCIQE.

Since UIQM and UCIQE are effective metrics for evaluating the quality of underwater images, we used these metrics to identify the best predictions from multiple outputs of BEM. We find that the enhanced images selected using UIQM and UCIQE (averaging both metrics) consistently achieve good results across other metrics. For instance, on the UIEB dataset, the enhanced outputs selected by UIQM and UCIQE achieve an LPIPS↓ score of 0.1102, outperforming WFI2-Net (0.1248) [2] and PA-Diff (0.1328).

Additionally, in Fig. 5 and Fig. 14, the enhanced images from unpaired data selected using UIQM and UCIQE exhibit perceptual quality superior to other methods. Therefore, selecting enhanced images using UIQM and UCIQE contributes not only to improvements in specific metrics but also to better perceptual quality and overall performance.


Q7: As mentioned in paper, the author train multiple sets of network weights or even multiple networks, where each set is capable of predicting one of the potential targets. This leads to a linear increase in the number of parameters, computational complexity, and running time of this method. Therefore, the author should add comparative experiments on the number of parameters, computational complexity, and running time.

We sincerely apologise that the explanation of our method has led to some misunderstandings. To clarify, we did not train multiple sets of networks. Instead, our approach employs a Bayesian NN where each weight incorporates a noise term. For each inference, by sampling this noise term from a normal distribution N(0,1)\mathcal{N}(0, 1), which is akin to generating different sets of network weights. This effectively simulates having multiple networks, although, in reality, our model consists of only a single set of weights and a noise term (e.g., weight = self.mu_weight + self.sigma_weight * self.eps_weight.data.normal_()).

The FLOPs of our model are detailed in Table 4, and further information on the backbone’s inference speed, FLOPs, and parameter count can be found in Table 6 from Appendix B.

We also agree with the reviewer’s valuable suggestion that computational complexity comparisons across models would be beneficial. Unfortunately, many related works, including those referenced by the reviewer ([1][2][3]), do not provide their computational complexity data, and some do not release their code, making direct comparisons difficult. Nevertheless, we have included detailed computational complexity metrics for our method in the revised manuscript, hoping that this will facilitate future comparisons and contribute to the field.

We greatly appreciate your constructive feedback and remain open to further suggestions for improving the clarity and thoroughness of our work.

[1] GLARE: Low Light Image Enhancement via Generative Latent Feature based Codebook Retrieval ECCV2024

[2] Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration CVPR2023

[3] Learning A Physical-aware Diffusion Model Based on Transformer for Underwater Image Enhancement MM2024

评论

Thank you for your valuable review. Please see our responses below to clarify the main concerns


Q1: In the low-light enhancement task, the method presented in [1], published in ECCV 2024, outperforms the proposed approach. For instance, the PSNR and SSIM values for [1] on the LoL-v1 dataset are 27.35 and 0.883, respectively, while the proposed method achieves PSNR and SSIM values of 26.83 and 0.877. The performance of [1] is calculated on results without GT mean.

A1: We kindly wish to point out that the GLARE method used the GT mean for their LOL-v1 results, although this was not explicitly mentioned in their paper. Please refer to the discussion on their GitHub repository: https://github.com/LowLevelAI/GLARE/issues/9 and the leaderboard https://paperswithcode.com/sota/low-light-image-enhancement-on-lol


Q2: Furthermore, in the context of underwater image enhancement, the authors only utilized a LR-GT paired dataset to demonstrate the superiority of their method, which is inadequate. Recent papers on underwater image enhancement, such as [2] and [3], provide LPIPS and FID metrics for comparison. Therefore, the authors should include a more comprehensive comparison to strengthen their claims.

A2: We appreciate the reviewer’s valuable feedback. In response, we have included the LPIPS and FID results of our method on the validation set of UIEB in the table below.

MethodLPIPS↓FID↓
BEM0.101926.11
WFI2-net [1]0.124827.85
PA-Diff [2]0.132828.74

A full table has been added to Appendix I in our revised manuscript.

[1] Zhao, Chen, et al. "Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration." CVPR. 2024.

[2] Zhao, Chen, Chenyu Dong, and Weiling Cai. "Learning a physical-aware diffusion model based on transformer for underwater image enhancement." arXiv preprint arXiv:2403.01497 (2024).


Q3: The framework of the method is unclear. The authors should provide a detailed structure of DNN and BNN in the main paper or in the supplementary material, such as model shape, the level of model, the main modules, how many main modules each level contains, etc.

A3: We appreciate the reviewer’s valuable feedback and have refined our paper accordingly. In our revised manuscript, we have included a diagram of our backbone architecture in Figure 11. For the BNN component, we introduce a noise term to each weight of the backbone, as described in Eq. (6). Detailed configurations of the backbone are provided in Table 6 in Appendix B, which also includes the results of ablation studies conducted on the backbone.


Q4: The author proposes a reduction function to compress high-dimensional image data. However, for image restoration task, compression is a risk operation since it may loss the information of image. How does this paper compensate for this lost information?

A4: The information loss introduced by the reduction function is compensated in the second-stage model, which additionally incorporates the original high-resolution, low-quality input (see Eq. 10). Our two-stage framework aligns with a coarse-to-fine approach. In the first stage, the BNN capturing coarse-grained feature, which is akin to illumination estimation for LLIE. Notably, under low-resolution conditions, it can more effectively predict the coarse-grained information. In the second stage, the DNN’s input contains the high-resolution but low-quality input and the coarse prediction from the first stage (see Eq. 10). This design enables the model to restore fine details while simultaneously adjusting the image's illumination based on the coarse information from the frist stage. Similarly, for the UIE task, the reduced-dimensional prediction from the BNN is analogous to estimating the medium transmittance in the optical model of Koschmieder. Further details on the design principles and comparisons with one-stage methods are provided in Appendix F.


Q5: In Section 3.3, the authors mention using reference or no-reference indicators to select the top k candidates. However, it is unclear how the final result is chosen from these candidates when calculating the metrics.

A5: We kindly inform the reviewer that in Algorithm 1 (Section 3.3), we generate K candidates rather than selecting the top-K candidates. Subsequently, we use a reference or no-reference image quality assessment (IQA) metric to select the best one.

审稿意见
5

This paper presents a Bayesian enhancement model designed to address uncertainty and provide a range of solutions for image enhancement tasks. The method begins by utilizing a Bayesian neural network to model image representations in a reduced-dimensional space, followed by a deterministic network for further refinement. Additionally, the authors introduce a dynamic Momentum Prior to mitigate convergence challenges. Experiments are conducted on tasks involving low-light and underwater image enhancement.

优点

  1. Image restoration is an ill-posed problem, and modeling the inherent uncertainty while accommodating diverse outputs is an intriguing and valuable challenge.
  2. It seems reasonable to use BNN to tackle this issue.
  3. The paper provides a clear and detailed description of the proposed method. the experiments conducted are thorough and well-structured.

缺点

  1. Why choose low-light image enhancement and underwater image enhancement? Denoising and super-resolution seem to be two more typical tasks. How does the proposed method perform on these two tasks? Besides, how does the proposed method perform in image dehazing, another typical image enhancement task?
  2. What is the reason for the missing data in Table I, II and III, and how were the results of the comparison methods obtained?
  3. There is a lack of quantitative analysis of the predicted uncertainty.

问题

  1. My main question is why were LLIE and UIE selected as the two tasks? As far as I know, currently there are no comprehensive datasets available for these tasks.
  2. The reference images for LLIE and UIE may be inaccurate; how does the proposed method tackle this issue?
评论

Thank you for your valuable review. Please see our responses below to clarify the main concerns.


Q1: Why choose low-light image enhancement and underwater image enhancement? Denoising and super-resolution seem to be two more typical tasks. How does the proposed method perform on these two tasks? Besides, how does the proposed method perform in image dehazing, another typical image enhancement task?

A1: Denoising and super-resolution are restoration tasks with well-defined one-to-one input-output mappings, making deterministic NNs sufficient for achieving good results. In contrast, enhancement tasks like low-light and underwater image enhancement often involve one-to-many mappings due to the variability or subjectivity in ground truth. For example, in low-light datasets, an input can correspond to multiple ground-truth images (see Fig. 1 and Fig. 19), where BNNs excel by modelling uncertainty.

Furthermore, the ground truth in enhancement tasks can be less ideal. In underwater images, high-quality ground truth is difficult to obtain due to water's colour absorption and scattering. Similarly, in low-light enhancement, "ideal" ground truth is subjective, depending on the context (e.g., filmmaking may prioritise low noise, while documentaries favour sharp details). Our BEM model, by fixing random seeds and employing CLIP feature matching, is capable of generating predictions tailored to different preferences.

Based on your suggestions, we conducted additional image dehazing experiments but observed no clear advantage of BEM. Analysis revealed that in datasets like ITS [1], hazy inputs are synthetically generated with a one-to-one correspondence to ground truth, avoiding the one-to-many mapping problem. Consequently, BEM's uncertainty modelling offers limited benefits in this context.

Overall, our method is most effective where ground-truth images are low-quality or subjective, resulting in one-to-many mappings. As our paper title suggests, it is designed specifically for such cases in image enhancement.


Q2: What is the reason for the missing data in Table I, II and III, and how were the results of the comparison methods obtained?

A2: The missing data is due to the original authors not providing these results, and we were unable to reproduce them. For instance, as noted in Table 3, the GLARE model could not process 2K-resolution images in the VV dataset, as we encountered CUDA out-of-memory (OOM) issues on an A100 GPU while running their code. All results for the compared methods are either directly taken from their original papers or reproduced by us only when the original papers did not report the results.


Q3: There is a lack of quantitative analysis of the predicted uncertainty.

A3: We appreciate your valuable feedback. In response, we have included an uncertainty analysis in Appendix G of our revised manuscript. We sincerely hope you will review Appendix G, which was prepared based on your suggestion.

In Appendix G, we provide an uncertainty map of an example image to illustrate the magnitude of predictive uncertainty. Additionally, statistical data are summarised for analysis, and violin plots are included to visualise the prediction distribution.

In summary, we found that the predicted uncertainty correlates with the quality of the training data (i.e., the GT quality in training data). For datasets with lower-quality ground truth in training data, BEM requires more sampling iterations to achieve satisfactory predictions. For individual images, uncertainty exhibits a structured pattern: shadowed regions tend to have lower uncertainty, while illuminated areas show higher uncertainty.


Q4:The reference images for LLIE and UIE may be inaccurate; how does the proposed method tackle this issue?

A4: We appreciate the opportunity to clarify this point. It is precisely because the reference images are not always accurate that BNNs are valuable for modelling uncertainty. For these types of tasks, we believe it is less meaningful to focus on making predictions closely match the reference images. Instead, we propose using no-reference metrics to identify outputs that go beyond the limitations of the reference. For example, the CLIP metric plays a crucial role in selecting the best-quality image from BEM's output candidates.

We invite you to watch the demo of the no-reference inference process provided in the Supplementary Materials. This demo demonstrate how CLIP is utilised to select predictions without relying on imperfect reference images. Also, in Figures 22 and 23 of Appendix H, we present examples where CLIP successfully identifies the best-quality image, often surpassing the quality of the reference image itself.

Thank you again for your valuable feedback. If you have any additional questions or suggestions, we would be more than happy to address them.

[1] B. Li et al., "Benchmarking Single-Image Dehazing and Beyond," IEEE TIP, 2019

评论

Thanks for your detailed responses. I like the idea of approaching the image enhancement task from the perspective of uncertainty. However, the experiments presented in the paper are not convincing enough. On one hand, there is a need for more reliable no-reference metrics to validate the method’s superiority (NIQE, UIQM, and UCIQE are not robust enough). On the other hand, additional visualizations, especially those that highlight uncertainty, are essential to reinforce the findings. Therefore, I maintain my original score.

评论

Thanks for your responses.

Our method achieves state-of-the-art results across 11 datasets. Without additional details on which experiments are considered unconvincing and the reasons behind this assessment, we are unable to provide further clarification.

NIQE, UIQM and UCIQE are the most commonly used no-reference metrics for low-light image enhancement and underwater image enhancement.

We further provide our results using BRISQUE on the five testsets below:

MethodDICMLIMEMEFNPEVV
LLFlow [1]26.3627.0630.2728.8631.67
CIDNet [2]21.4716.2513.7718.9230.63
BEM11.5515.439.5810.9515.02

As you can observe, our method is also effective in terms of BRISQUE. The results of other methods are from [2].

In Fig. 4, Fig. 18, Fig. 22, and Fig. 23, we have highlighted the importance of uncertainty through visual analysis. These visualisations of predictive uncertainty are sufficient to reinforce our findings.

[1] Wang, Yufei, et al. "Low-light image enhancement with normalizing flow." Proceedings of the AAAI conference on artificial intelligence. Vol. 36. No. 3. 2022.

[2] Yan, Qingsen, et al. "You only need one color space: An efficient network for low-light image enhancement." arXiv preprint arXiv:2402.05809 (2024).

评论

Q1: There was a shared comments raised by Reviewer 5Lg5, Reviewer rtBX, and Reviewer g1ij regarding the reasonableness of using ground-truth (GT) images in the full-reference inference process described in Algorithm 1, and whether this might lead to unfair comparisons with other methods.

A1: The motivation behind our full-reference evaluation was to verify whether the BNN can generate an image that is sufficiently close to the GT image. This serves as evidence of the diversity in BEM’s outputs. However, as highlighted in lines 337–340 of the paper, we also acknowledge that full-reference inference has limited practical value in real-world applications.

We acknowledge that in real-world scenarios, GT is not available, so we also discuss the non-reference case in the paper. Here, we provide additional results of BEM in deterministic mode (BEM_determ). BEM_determ generates a single deterministic prediction by using the mean of the weight distribution (i.e., μ in Eq. 6), and its inference process does not rely on Algorithm 1. The results generated by BEM_determ provide a fair basis for comparison with previous methods. Additionally, based on the valuable feedback from Reviewer rtBX, we explored the use of a no-reference image quality metric, CLIP-IQA, as a selector to identify the best-enhanced images from BEM’s multiple predictions. This version of the BEM model, which employs CLIP-IQA for selection, is denoted as BEM_clip. Since BEM_clip does not rely on ground-truth images for ranking, it allows for a fair comparison with other DNN-based methods. Fair comparsions with other methods are provided in bellow Tables.

MethodLOL-v1 (PSNR ↑)LOL-v1 (SSIM ↑)LOL-v1 (LPIPS ↓)LOL-v2-real (PSNR ↑)LOL-v2-real (SSIM ↑)LOL-v2-real (LPIPS ↓)LOL-v2-syn (PSNR ↑)LOL-v2-syn (SSIM ↑)LOL-v2-syn (LPIPS ↓)
GlobalDiff [1]27.840.8770.09128.820.8950.09528.670.9440.047
GLARE [2]27.350.8830.08328.980.9050.09729.840.958-
BEM_determ28.300.8810.07231.410.9120.06430.580.9580.033
BEM_clip28.430.8820.07130.010.9100.07631.510.9610.030
MethodUIEB-R90 (PSNR ↑)UIEB-R90 (SSIM ↑)
WFI2-Net [3]23.860.873
BEM_determ22.350.913
BEM_clip24.360.921

Since reference images are inherently imperfect and often subjective, a model that closely matches reference images is not necessarily the best. Such a model merely demonstrates its ability to learn a good mapping from the input to the reference image. Therefore, achieving higher PSNR/SSIM in a paired test does not necessarily reflect a model’s generalisation ability. This is precisely why, among the 11 datasets we evaluated, 8 are unpaired datasets. By leveraging no-reference metrics, our BEM achieves state-of-the-art performance on these 8 unpaired datasets, demonstrating its robustness and generalisation in real-world applications.

To highlight the practical significance of BEM in no-reference scenarios, we included visualisation results in Fig. 5 and Fig. 6. These figures showcase visual improvements on unpaired data, illustrating the perceptual enhancements achieved by our method. Furthermore, in Appendix H of the revised manuscript, we provide visual comparisons between full-reference and no-reference evaluations. These comparisons further demonstrate that full-reference metrics, such as PSNR, do not always correlate well with perceptual quality, particularly in low-light image and underwater enhancement tasks.

[1] Hou, Jinhui, et al. "Global structure-aware diffusion process for low-light image enhancement." Advances in Neural Information Processing Systems 36 (2024).

[2] Zhou, Han, et al. "Glare: Low light image enhancement via generative latent feature based codebook retrieval." ECCV (2024).

[3] Zhao, Chen, et al. "Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration." (CVPR) 2024.

评论

In Table 3, NIQE was used to select the best results, as the comparison metric is also NIQE. This table should also use either BEM_determ or BEM_clip. In fact, I recommend the simpler BEM_determ to avoid the high computation cost from multiple inferences.

评论

Thank you for the valuable advice.

We have updated Table 3 in our revised manuscript based on your suggestions. For your reference, the newly added results are provided below.

MethodDICMLIMEMEFNPEVV
BEM_determ3.773.943.223.852.95
撤稿通知

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.