Fully Heteroscedastic Count Regression with Deep Double Poisson Networks
A state-of-the-art method to quantify aleatoric and epistemic uncertainty on count regression tasks with fully heteroscedastic predictive variance and ensembles.
摘要
评审与讨论
In this paper, the authors consider the problem of estimating heteroscedastic uncertainty within the context of counting tasks, where the final outputs should represent positive integer numbers. While many successful solutions have been proposed for heteroscedastic uncertainty in general (real-valued) regression tasks, this is not the case for counting, as it requires a different parametrization of the output distribution. Earlier solutions for the counting setting, such as the Poisson distribution, suffer from restricted heteroscedastic variance, meaning that the parameter defining the mean value of the distribution significantly restricts the possible predicted variance. In this paper, the authors propose using the Double Poisson distribution for counting tasks and prove that it resolves the issue of the former method, namely, it has unrestricted variance. Additionally, they demonstrate that the proposed loss has the property of adaptive loss attenuation, which lowers the impact of outlier points during training. Finally, they propose a way to make this attenuation controllable through -DDPN. The effectiveness of the proposed method is demonstrated on several datasets from different domains, showing that the proposed parametrization outperforms other parametrizations for counting tasks in terms of uncertainty quality (calibration) and accuracy.
给作者的问题
N/A
论据与证据
The authors present their claims and contributions in a clear manner while also supporting them with both theoretical (for example, proving that the proposed DDPN regressors are fully heteroscedastic) and experimental results.
方法与评估标准
The authors primarily compare against other loss-based heteroscedastic approaches on various counting tasks, clearly demonstrating the effectiveness of the proposed approaches in the discussed counting setups.
理论论述
The main theoretical contributions of the paper could be considered Propositions 2.3 and 3.1, which prove that previously proposed parameterizations, such as Poisson and Negative Binomial, are not fully heteroscedastic, while DDPN is. The proposed proof appears to be correct and valid, with no observable issues.
实验设计与分析
The experimental design and analysis are adequate and rigorous, with no issues.
补充材料
Additional experiments and the full proofs of the main propositions are provided in the supplementary material, both serving as a valuable extension of the results discussed in the main body.
与现有文献的关系
The paper clearly positions itself within the existing literature by thoroughly discussing prior work on heteroscedastic uncertainty estimation, particularly in regression and counting tasks. It provides sufficient detail on previous parameterizations, such as Poisson and Negative Binomial, highlighting their limitations and demonstrating how the proposed DDPN framework overcomes these constraints.
遗漏的重要参考文献
No, there are no critical references missing in the paper.
其他优缺点
In short, the major Strengths of the paper are:
- The paper clearly discusses the problem of heteroscedastic uncertainty estimation in the counting context, its existing problems, and solutions.
- The proposed method is clear, easy to implement, and demonstrates good performance on a number of different tasks.
- In contrast to many other uncertainty methods, it does not require a significant increase in computation/memory during inference while still producing high-quality uncertainty estimations.
One of the potential Weaknesses:
- The paper mostly compares the method against other loss-based uncertainty methods. Introducing additional uncertainty approaches, such as ensembling methods (Deep Ensembles, Batch Ensembles, etc.), could be beneficial.
其他意见或建议
N/A
We appreciate the reviewer’s thoughtful comments.
Introducing additional uncertainty approaches, such as ensembling methods (Deep Ensembles, Batch Ensembles, etc.), could be beneficial.
An important aspect of our work is the interplay between DDPN and Deep Ensembles. We demonstrate this connection throughout the paper. In Section 3.4 we show how individual DDPNs can be combined as ensembles. Then, in the bottom half of Table 2 we present results with ensemble DDPNs. Table 2 demonstrates that learning ensembles of DDPNs improves accuracy and the quality of predictive uncertainty.
We suspect that similar results will hold for BatchEnsembles, as this type of ensemble just changes how the weights of each member are derived: combining slow shared weights and fast, independent rank one weights. The principles we propose in this paper could easily be applied to other types of ensembles. We leave the validation of this hypothesis to future work.
Finally, in the discussion with Reviewer 68NH, we discuss how DDPN can be combined with other UQ methods such as Bayesian Neural Nets, Evidential Methods, Laplace Approximation and Monte Carlo Dropout.
The work introduces deep double poisson networks for the count regression problem. The proposed approach can quantify both the aleatoric and epistemic uncertainty with ensemble. Also, double poisson network allows unrestricted variance to model discrete count data, and can show robustness to outiers. Authors carry out experiments where the approach performs better in terms of calibration, out-of-distribution detection, and accuracy.
给作者的问题
- How does the work perform compared to natural posterior networks and the presented baselines on the benchmark Bike Sharing dataset (Natural Posterior Network: Deep Bayesian Uncertainty for Exponential Family Distributions [Bertrand Charpentier, Oliver Borchert, Daniel Zügner, Simon Geisler, Stephan Günnemann]) ?
论据与证据
- The authors claim that the proposed deep double poisson network can perform well on the count regression task. The claims are empirically validated through experiments on benchmark datasets and some baseline methods.
方法与评估标准
The methods and evaluation criteria look reasonable. The authors consider discrete count regression problem, and look at the different metrics, evaluating the method along different dimensions including OOD detection, calibration, and accuracy.
理论论述
- Authors present some theoretical claims, but these seem to be derived from standard double poisson networks.
实验设计与分析
The experimental design looks sound.
补充材料
Supplimentary materials shows the loss function for the double poisson networks, role of the hyperparamter beta, experimental details, evaluation metrics used, training details, some additional details, and results. I briefly went through the supplementary materials.
与现有文献的关系
The work is likely to have limited impact to a narrow subfield in the scientific community.
遗漏的重要参考文献
The work Natural Posterior Network: Deep Bayesian Uncertainty for Exponential Family Distributions [Bertrand Charpentier, Oliver Borchert, Daniel Zügner, Simon Geisler, Stephan Günnemann]) introduces a general evidential approach that can be effective for a wide range of problems including uncertainty-aware classification, uncertainty-aware regression, and uncertainty-aware count regression. Discussion and comparison with the work could be beneficial.
其他优缺点
Strengths
- The paper is easy to follow and I found it to be a pleasant read.
- The work introduces deep double possion network which seems to be effective in discrete count regression based on the experimental results The authors show the robustness to outliers of the proposed approach. Also, the approach performs well in terms of accuracy, calibration, and ood detection.
Weaknesses: - Beyond ensembling, there are other approaches (e.g., Bayesian neural networks, evidential approaches, dropout-based uncertainties). While the authors compare with standard Poisson, NB, and Gaussian-based heteroscedastic networks, a thorough comparison can help better illustrate the effectiveness of the approach.
其他意见或建议
- Figures, labels and captions can be better presented. Many labels/legend texts are too small and not clearly legible. Also, the captions are too long and could be shortened for a better read.
We appreciate the reviewer’s thoughtful comments and helpful feedback.
Comparison to the Natural Posterior Network
We have followed the official repository to download the bike-sharing dataset file and pre-process exactly as used in the paper reviewer mentioned. For training, the paper mentioned they perform training after the grid search in the space [1e−2, 5e−4], since the search step is not specified, we took the log-scale space step as below: [0.01, 0.005623, 0.003162, 0.001778, 0.001, 0.000708, 0.0005], and found the lr as: 0.003162 in our model’s configuration. Following the exact same settings, we conducted evaluation on 5 rounds of training and report the mean and standard deviation of our model’s performance on RMSE, the results are:
| Method | RMSE |
|---|---|
| Dropout-N | 70.20 ± 1.30 |
| Ensemble-N | 48.02 ± 2.78 |
| EvReg-N | 49.58 ± 1.51 |
| NatPN-N | 49.85 ± 1.38 |
| Dropout-Poi | 66.57 ± 4.61 |
| Ensemble-Poi | 48.22 ± 2.06 |
| NatPN-Poi | 51.79 ± 0.78 |
| DDPN (ours) | 47.87 ± 0.42 |
How does DDPN compare to other uncertainty methods?
In short, the objective function proposed in Equation 1 enables the network to capture aleatoric uncertainty over count data. We show how this can be combined with Deep Ensembles to better capture epistemic uncertainty. Table 2 shows this combination is effective. DDPN is presented in our paper in terms of maximum likelihood + ensembles for 1) simplicity, 2) effectiveness, and 3) likelihood of community adoption. DDPN can easily be combined with other UQ methods.
Bayesian Neural Networks
One could put a prior over the weights of the network (ideally an isotropic Gaussian prior). Let denote the neural network parameters, , and denote the training dataset. The log posterior is:
where is the normalizing partition function and is usually dropped during inference.
Fortunately, the negative log likelihood, is already defined in Equation 1 of our paper, and the log gaussian prior is easy to compute, , where the prior hyperparameters are and .
Then one could choose the preferred inference algorithm: MAP, HMC, SGLD etc… and estimate the posterior.
Empirically, Bayesian neural networks can outperform Deep Ensembles when using high fidelity inference algorithms such as Hamiltonian Monte Carlo. However, in practice MCMC-based inference is impractical and DEs often outperform less exact inference methods (i.e., SGLD, Variational Inference etc…) [Izmailov et al. What Are Bayesian Neural Network Posteriors Really Like? ICML’21]. We suspect the same results hold for DDPNs.
Moreover, many recent works have directly connected Deep Ensembles to Bayesian Inference by showing that DEs are a coarse approximation of the posterior, sampled and multiple modes with no local uncertainty [Fort et al. Deep Ensembles: A Loss Landscape Perspective. 2019][Wilson and Izmailov. Bayesian Deep Learning and a Probabilistic Perspective of Generalization. NeurIPS’20].
The effectiveness, simplicity and attractive theoretical properties of DEs motivated our decisions to use them in our experiments.
Evidential Approaches
DDPN could also be easily applied with evidential regression techniques [Amini et al. Deep Evidential Regression. NeurIPS’20]. Because DDPN uses a likelihood function during training, one would simply have to specify evidential priors over the parameters of the DDPN, for the mean and for the inverse dispersion. Then, the network would be trained to predict the parameters of the higher-order evidential distribution.
Laplace Approximation (LA)
This is perhaps the easiest since LA is typically a post-hoc technique. One would train the DDPN in the standard way described in our paper. Then, one could apply any of the post-hoc second-order, covariance approximation methods described in [Daxberger et al. Laplace Redux – Effortless Bayesian Deep Learning. NeurIPS’21]
Dropout-based uncertainty
Monte Carlo dropout estimates epistemic uncertainty by randomly dropping out weights at test time and approximates the Bayesian posterior.
DDPN can easily be combined with MC dropout by 1) training the single-member DDPN to convergence, and 2) applying the MC dropout procedure with different forward passes through the dropped out model [Gal and Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ICML’16]
However, [Lakshminarayanan et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NeurIPS’17] show that MC dropout is clearly inferior to DEs.
The authors have addressed my comment and I vote to keep my original score of weak accept.
The paper introduces the Deep Double Poisson Network (DDPN), a novel neural network model for count regression that provides accurate input-conditional uncertainty quantification. The main conceptual idea is that DDPN extends deep ensembles to count regression by using the Double Poisson distribution, which allows for heteroscedastic variance in count data. This flexibility enables improved estimation of aleatoric uncertainty (inherent variability in data) and, consequently, better epistemic uncertainty (model uncertainty) estimation. The paper proves that DDPN exhibits properties similar to heteroscedastic Gaussian models. The authors introduce a loss modification to control the learnable loss attenuation mechanism, allowing for more precise uncertainty calibration. Experiments on diverse datasets show that DDPN outperforms existing count regression baselines in accuracy, calibration, and out-of-distribution detection.
update after rebuttal
I acknowledge that the authors have improved the proofs (my point 1.) but do not provide a strong argument for point 2. I will increase my score to 2, but still think the work does not reach the acceptance bar.
给作者的问题
Given the identified flaws and weaknesses, I would likely revise my evaluation if the authors:
- Provide convincing proofs that correct the identified issues.
- Establish a stronger version of full heteroscedasticity for Double Poisson regressors without relying on moment approximations.
论据与证据
Overall, the claims made in the submission are supported by clear evidence, but I found that some theoretical claims are not supported by convincing mathematical arguments (see below).
方法与评估标准
Yes.
理论论述
I checked the theoretical claims and found significant flaws. Some may be fixable, but others appear more critical.
-
One issue concerns Proposition 3.3: the stated convergence to 0 does not seem valid under the proposed definition of the learnable attenuation loss function (Definition 3.2). A monotonically increasing function does not necessarily tend to infinity, just as a monotonically decreasing function does not necessarily tend to 0—both cases can have a constant asymptote. This flaw undermines the argument in the proof of Proposition 3.3.
-
Another issue concerns the derivation of the DDPN objective. The derivation in Appendix A.1 omits the normalizing constant of the Double Poisson (DP) distribution, denoted c(\mu, \gamma) at the beginning of Section 3. Since c(\mu, \gamma) is not constant with respect to (\mu, \gamma), the stated objective does not properly learn the parameters of a DDPN.
-
Additionally, most equations in Appendix A.1 fail to hold because the maxima and minima do not align due to omitted constants between successive lines. Using \arg\max and \arg\min would provide more precision. Also, the parameterization of the network f_\Theta(x_i) is inconsistent: while a log link function is used in the main text, this log transformation is omitted in the first line of Appendix A.1.
-
Finally, it is unclear why the DDPN objective loss in Equation (1) of Section 3.1 does not include a summation over all training examples i = 1, \dots, N.
实验设计与分析
Owing to the previous flaws, I did not check the soundness of the experimental analyses.
补充材料
I did thoroughly review Sections A and C of the supplementary material.
与现有文献的关系
The paper builds on prior work in deep ensembles for uncertainty estimation in regression, extending heteroscedastic modeling from Gaussian outputs to count data using the Double Poisson distribution, addressing a gap in discrete uncertainty modeling and improving epistemic uncertainty quantification.
遗漏的重要参考文献
Not that I am aware of.
其他优缺点
I identify a weakness in Proposition 3.1, which claims that DDPN regressors are fully heteroscedastic. In reality, the proposition is derived under the moment approximations proposed by Efron (1986), where the first two moments are approximated as \mu and \mu / \gamma. Given this approximation, full heteroscedasticity is unsurprising. It would be more meaningful to establish this result without relying on Efron’s approximation. I suspect that the exact Double Poisson distribution is inherently fully heteroscedastic, and proving this directly would be a more valuable contribution.
其他意见或建议
-
The second line of the displayed equation in app C.6 is hard to parse: adding parentheses to the rhs 2nd sum would help.
-
There are a few (math) typos that could easily be fixed.
We thank the reviewer for the thoughtful feedback.
Monotonicity vs tend to infinity
We agree with this remark and propose to change Def. 3.2 to:
where and
Fortunately, the proof in Appendix C.3 holds under this new definition, as these limits are in fact used to show that the residual error tends towards 0 (lines 1108-1109). With this change, our proof of Prop. 3.4 also remains valid (since logx tends to infinity and 1/x tends to zero).
Normalizing constant
To simplify our objective, we followed previous work (see below) and assumed . This facilitates easier differentiation. We will make this more clear in App. A.1.
- See Fact 1 (Eqn. 2.10) of [Efron, B. "Double exponential families and their use in generalized linear regression." Journal of the American Statistical Association. 1986]
- Follow-up work has also set [Chow, N., and David Steenhard. 2009. "A flexible count data regression model using SAS Proc nlmixed”]
Max/Min in Appendix 1
We propose two changes to the derivation of our objective to increase clarity and align with convention:
- Replace max/min with argmax/argmin
- Clarify that we maximize over training examples (see below) and derive the per-instance loss defined in Equation 1 (see discussion below)
The NLL becomes: .
Lack of Log link in Appendix A.1
In App. A.1 we derive the training objective. In contrast, Section 3 describes how it can be used to train DDPN. The connection between the two is trivial (exponentiate the log link to evaluate Eq. 1) and is stated in Footnote 1 (pg. 4).
Lack of Summation in Equation 1
Eq. 1 expresses the loss for a single training example, . We state in lines 183-184 that the loss is:
averaged across all prediction / target tuples…in the dataset
To be more explicit, we propose to change to .
Prop. 3.1: DDPN regressors and full heteroscedasticity
In line with prior work, we use Efron's approximations for the first two moments in the proof of Prop. 3.1. We propose to revise this proposition: With mild assumptions, DDPN regressors are fully heteroscedastic (where we assume that Efron's approximations hold).
How good are Efron's approximations? We introduce the concept of moment-deviation functions (MDFs) to assess this theoretically:
Let be a family of distributions parametrized by . Suppose we are given , which outputs s.t. the first moments of are nearly equal to target moments . Then for any pair , {} for are moment-deviation functions if, for any valid , if , we have for all .
We focus on n=2, (error for the mean/variance). If we can pick s.t. are small, is flexible, as there are parameters that can roughly achieve any mean and variance. In the Gaussian case (), we have .
Proposition
Let denote the Double Poisson family. Set . Letting , the MDFs for are:
$
\varepsilon_1(\mu_0,\sigma_0^2) &= \left|\frac{\sum_{y=0}^{\infty}s(\mu_0, \gamma_0, y)(y - \mu_0)}{\sum_{y=0}^{\infty}s(\mu_0,\gamma_0,y)}\right| \\\\
\varepsilon_2(\mu_0, \sigma_0^2)&=\left|\frac{d(\mu_0,\gamma_0)\gamma_0^{\frac{1}{2}}\sum_{y=0}^{\infty}s(\mu_0,\gamma_0,y)-\gamma_0(\sum_{y=0}^{\infty}s(\mu_0, \gamma_0,y)(y-\mu_0))^2}{\gamma_0(\sum_{y=0}^{\infty}s(\mu_0,\gamma_0,y))^2}\right|
$
where: $h(z)=\frac{e^{-z} z^z}{z!}, r(\mu,\gamma,z)=\gamma(z-\mu +z\log\mu -z\log z), s(\mu,\gamma,z)=h(z)\exp(r(\mu,\gamma,z)),$ and $d(\mu,\gamma)=\gamma^{-1/2}\left[\sum_{y=0}^{\infty}s(\mu, \gamma, y)(\gamma(y-\mu)^2-y)+\sum_{y=0}^{\infty}s(\mu,\gamma,y)(y-\mu)\right]$.
If desired, we can provide the proof to this proposition in the follow-up response. We plot the error incurred via Efron’s estimates on a grid of target means and variances, using 100th partial sums (https://anonymous.4open.science/r/ddpn-651F/deep_uncertainty/figures/artifacts/epsilon_1.png). To see epsilon_2, change the filepath to epsilon_2.png. Except for the case of small μ, high σ², the error is essentially zero. Thus, in most settings we can treat DDPN as fully-heteroscedastic. Empirically, DDPN produces flexible, well-fit distributions (Fig. 3/4, Table 2).
Monotonicity vs tend to infinity
I agree that the proposed change to Def. 3.2 make the proof in Appendix C.3 now possible.
Normalizing constant
The authors replied:
assumed .
I do not think this assumption was made clear anywhere in the submitted paper; it could only be discovered by checking the earlier work by Efron.
Max/Min in Appendix 1
We propose two changes to the derivation of our objective to increase clarity and align with convention:
Replacing max/min with argmax/argmin is not just a question of clarity or convention: the proof is simply wrong without.
Lack of Log link in Appendix A.1
I understand that the connection with and without log link is trivial. But as I noticed, the parameterization of the network is inconsistent between main text and supplementary.
Lack of Summation in Equation 1
I agree that changing to helps with clarity.
Prop. 3.1: DDPN regressors and full heteroscedasticity
Introducing moment-deviation functions to assess the error of Efron’s approximation is a nice idea. However, I still believe that demonstrating full heteroscedasticity for the exact Double Poisson distribution should be the primary goal. After all, full heteroscedasticity simply means that, for any fixed mean, the variance can span the entire interval . I genuinely think this is an attainable property for the exact Double Poisson distribution.
Score revision
I acknowledge that the authors have improved the proofs (my point 1.) but do not provide a strong argument for point 2. I will increase my score to 2, but still think the work does not reach the acceptance bar.
We appreciate the additional thoughtful comments from the reviewer. As discussed, we will make all of the proposed improvements to the proofs (point 1) in the camera ready manuscript. With respect to point 2, we will include a discussion of moment-deviation functions (and the quality of Efron's approximations) in the appendix.
The paper focuses on outputting distributions for non-negative integer predictions (i.e., count data). To do so, the paper has a model output the parameters for a Double Poisson distribution, which admits separate mean and variance parameterizations. Then the paper further utilize ensembles to include epistemic model uncertainty. Results comparing against other predictive distributions, such as a typical Gaussian, show that the proposed predictive distribution outperforms on count tasks.
给作者的问题
None
论据与证据
Yes, the paper is quite clear on the proposed approach, the different uncertainties involved (e.g., the diff between aleatoric and epistemic, which is often confused or muddled), and the experimental setup. The reasoning for using the Double Poisson instead of a regular Poisson is clear and backed by both theory and empirical evidence.
方法与评估标准
Yes, the proposed Double Poisson makes sense for count data and for the goal of heteroscedastic variance (i.e., per-example variance controlled by the model). The evaluation datasets are fine and varied, and the baselines being other predictive distributions is appropriate and expected.
理论论述
The formal definitions of the distributions (e.g., 2.1, 2.2, 3.2) and the propositions appear to be correct. In general, the approach is straightforward: have a model output the parameters of the Double Poisson distribution and then optimize that distributions NLL; this is generally the same type of approach used in modern models, just with a different distributional family.
实验设计与分析
Yes, the experimental setup, including datasets, metrics, and baselines, are appropriate and expected for the type of approach being proposed. E.g., accuracy and a proper scoring rule is an ideal combination (often, the latter is missed), and the baselines are appropriately other predictive distributions and do not conflate that with other modeling choices.
补充材料
I skimmed the appendix, particularly for the base models used per task. I did not rigorously check the derivations in the appendix, e.g., A.1.
与现有文献的关系
This fits well within the broader literature on uncertainty quantification. Accordingly, the appropriate papers are generally referenced. Most existing work has focused on categorical or continuous problems. This paper's novelty is in focusing on count data and using the less common Double Poisson for full heteroscedasticity.
遗漏的重要参考文献
No
其他优缺点
This paper's novelty is in focusing on count data and using the less common Double Poisson for full heteroscedasticity. It's not a surprisingly result and is fairly straightforward, but it's a useful paper to have in the literature. In particular, I'm pleased by the way in which the paper is very clear on the various uncertainty concepts and does not confuse or conflate any terms.
I am assigning a "4: Accept" to mean that it's a solid paper; a 5 would be for an exceptionally exciting result, such as showing that this pushes on SoTA in some current frontier model.
其他意见或建议
None
We sincerely thank the reviewer for the recognition of our work.
This work considers the problem of non-negative integer prediction (count regression) with heteroscedastic uncertainty by fitting neural networks to the parameters of a double Poisson distribution. The resulting regressor is shown to be fully heteroscedastic with learnable loss attenuation, a property also enjoyed by Gaussian heteroscedastic regressors. To get clearer uncertainty quantification, ensembling is performed. These models are shown to be competitive through a variety of experiments. Reviewers were generally positive about the work; while not necessarily found to be paradigm-shifting, reviewers agreed with the soundness of the model, appreciated the comprehensive analysis, and the variety of experimental findings.
The most major concerns identified during the review period have been addressed in the author rebuttal, however, some concerns remain regarding the proof of heteroscedasticity, as this was found to hold only under approximations for the moments. I agree that this is an inherent limitation of the work, but can appreciate that proving general heteroscedasticity in this case may be out of reach. I would suggest that the authors include the evidence provided in their rebuttal, but adjust their wording to acknowledge they only show "approximate heteroscedasticity" under the moment assumptions. This needs to be very clear and precisely defined. This is not imposing "mild assumptions", this is showing that the approximate moments are unrestricted. With these inclusions, as well as those discussed throughout this review period, I believe this is is a good contribution to ICML.