Reformulating Strict Monotonic Probabilities with a Generative Cost Model
We propose a generative method that solves the monotonic modeling task.
摘要
评审与讨论
The paper proposes a latent variable model capturing both a monotonic part and a part that need not be monotonic. They give some background, and then propose a loss function for a variational training with neural nets.
优点
It is an interesting problem and their construction is quite novel and creative.
缺点
The objective function and implementation leaves a lot of details missing and unexplained. The paper does not inspire confidence in the method.
问题
Overview with some questions embedded:
Lines 054-059 could be clearer with a graphical model diagram. Edit: there is one later, nice.
Line 063 may not be quite right; p_theta(x,r) cannot be ignored in the evidence unless you drop the dependence on theta. Maybe rework the setup / explanation a bit, this shouldn’t be a big issue.
Line 101 equivalent to what?
Line 150 consider the machinery within https://arxiv.org/abs/2301.11695 and also the separate line of work on normalizing flows that all involve e.g. invertibility, monotone transformations, etc.
Line 210 (0,1) should be {0,1}
Line 189 should y and r have the same dimension?
Line 289 elementwise
Line 304 looks like it would suffer from high variance since pi(z) is fixed but p(z|x) depends on x.
Line 323 the combination of losses looks a little suspicious.
Line 323 Note that IWAE (Burda) is equal to ELBO (Jordan et al) for IWAE number of samples set to one; why is adding ELBO to IWAE sensible? Shouldn’t pi be something else and then no ELBO? Please expand, this is unconvincing.
Line 324 it seems like this doesn’t really match the beta vae setup; it would if you put a beta in front of the kl in line 318. What is this doing?
Line 378 how about some ablations of alpha and beta terms? Edit: there are some in appendix C.
Table 1 does not seem to match appendix C for any value of the parameters, what is missing? Please add details.
Line >= 378 how did you set all of the parameters?
Line 698 affect
Thank you for the comprehensive review and valuable suggestions. We have revised our paper as you requested and here are our replies (updated in 2024-11-29).
# Question 1: is ignored in the evidence.
We ignore because we always have the given variables and , which are sampled from their prior . So, sampling from requires only sampling based on by the given .
# Question 2: Should and have the same dimension?
No, it only requires for any . The definition of monotonic conditional probability is in the background section (Section 2 - Monotonic Conditional Probability).
# Question 3: The high variance issue for importance sampling.
We modified the model for being a categorical variable. In the original version, we assume that is a -categorical variable that , and we take the sampling distribution , this leads to the evidence estimator by importance sampling as , where . This is equivalent to the law of total probability when we take and . As a result, we are aware that we do not need to introduce the importance sampling technique. In the new revision, we change the model for categorical latent variable based on the law of total probability. For complex latent distribution assumptions, we follow the reparameterization trick as in VAE and use the recognition model to sample , which helps reduce the high variance of the evidence estimator.
# Question 4: The combination of losses.
Thanks for the suggestion, we rethink about this and realized that the combination of LL and ELB is not really necessary. We also remove the KL divergence term of in , and now the loss functions of GCM and GCM-VI become:
where , and
where .
Therefore, in the new formulation, we do not need the combination and the hyperparameters of and are dropped. So we also remove the ablation study on and , and add an ablation study on the sample number and latent dimension in Appendix E.
The experimental result of this new setting still shows that GCM-Vi performs best in 3 of the four public datasets. Proving that the combination of and is not necessary.
# Question 5: About IWAE and ELBO.
In the latest revision, we follow the form of original IWAE for the GCM-VI loss. This is shown in our reply to question 4, the experimental results are also renewed.
# Question 6: The loss does not match -VAE.
To further simplify the model, we remove the additional KL divergence term . As a result, the parameter is removed from the latest revision.
# Question 7: Table 4 (table 1 in the original paper) does not match the ablation study.
This is because we used in the experiment section, but did not include it in the ablation study (where we take ) and we also use different random seeds in the experiment part and the ablation part. So in the latest revision, the result of the experiment in Table 3 (Adult ACC =) is not identical to the ablation result in Table 5 (Adult ACC = when and ).
# Question 8: Parameter details.
We add the hyperparameter details in Section 5 (line 480), which we take and for all experiments on the public datasets, and we upload the code to: https://github.com/iclr-2025-4464/GCM, it has the full realization of the GCM, GCM-VI and baseline models, as well as the parameter settings.
The paper proposes the Generative Cost Model (GCM) to enforce strict monotonicity by modeling a latent cost variable, using variational inference and importance sampling. This approach avoids architectural constraints and outperforms traditional methods on synthetic and real datasets.
优点
The paper seems to present a novel approach to formulate the problem of strict monotonic probability, it provides convincing theoretical analysis and emprical results.
缺点
The paper lacks empirical validation across diverse, high-dimensional real-world datasets, limiting the demonstrated generalizability of the Generative Cost Model (GCM). Moreover, the code was not provided in the supplementary materials to assess the reproducibility of the results.
问题
-
Can the authors provide additional justification or empirical validation for the assumption of conditional independence between the latent variable z and revenue r given x, as this assumption may affect model flexibility in real-world applications?
-
Could the authors elaborate on the computational efficiency of the Generative Cost Model (GCM) compared to traditional monotonic models, especially when applied to high-dimensional datasets? How does the computational time compare to the benchmarks?
Thank you for the reviews and valuable suggestions. We have revised our paper as you requested and here are our replies (updated in 2024-11-29).
# Weakness 1: Lack of validation.
We added two more real-world datasets: Diabetes and Blog Feedback, where the Diabetes is a large dataset with 253,680 examples, and the task of Blog Feedback dataset is a regression task with 8 monotonic features and 272 nonmonotonic features. The experiment on both datasets proving our GCM method is consistently well performed. You can find the details in Table 3 in the experiment section and more detailed results in the tables in Appendix E.
# Weakness 2: Lack of code.
We uploaded the anonymous code to https://github.com/iclr-2025-4464/GCM.
# Question 1: The assumption of needs additional justification.
We rethink about this assumption and find that it is not necessary. In Appendix D, we build another generation model that generates both the revenue variable and the cost variable by the latent variable , this helps our model avoid . We call the new cogeneration method the GCRM (generative cost-revenue model). We also performed an experiment based on GCRM, and found that this approach performs close to the original GCM in four public datasets, which shows an optimistic potential.
# Question 2: Elaborate on the computational efficiency.
We validate the time consumption of different monotonic modeling methods, the detailed data are shown in Appendix F. The inference task requires one to estimate , where is fixed and varies in a set of candidates . The GCM is slow when the candidate number is small, however, when the number grows, the GCM exceeds the baseline methods. The efficiency of GCM in the large number of candidates is that GCM does not need to put all pairs of in a deep neural network as traditional methods, which extend the computation time by times. Instead, GCM only requires us to put into a deep neural network once to generate , which does not take additional time spent, after obtaining , it is efficient to calculate for since this step does not need a DNN, we can simply solve it using equation (10). As a result, when the candidate number of reaches the extreme value of 1024, GCM can save up to 72% time cost compared to the fastest baseline model.
This paper proposes a novel approach to model strict monotonic probabilities. Without loss of generality, it considers the binary classification formulation , where is a function that is monotonic in . The target is to learn the function . Instead of directly learning , the paper introduces a cost variable and reformulates as an integration . The paper then introduces a generative cost model to approximate the conditional distribution .
优点
The paper tackles an interesting problem of modeling monotonic probabilities. The problem itself is important, and the reformulation proposed by the paper is unique. Extensive experiments are conducted to support this new method.
缺点
The paper lacks theoretical results on the finite sample efficiency of the proposed algorithm. For the experiment design, an important setup that requires strict monotonicity is quantile regression, where the conditional quantile should be monotonic in the quantile argument. It would be interesting to see experiments designed for it and a comparison with the existing benchmarks.
问题
No additional questions.
Thank you for the reviews and valuable suggestions. We have revised our paper as you requested and here are our replies (updated in 2024-11-29).
# Weakness 1: The paper lacks theoretical results on the finite sample efficiency of the proposed algorithm.
We find some difficulties in proving the finite sample efficiency. To prove the finite sample efficiency, we need to validate if the variance of an estimator of the parameter reaches the Cramér–Rao lower bound, i.e. holds. However, we do not have an analytical estimator of the parameters and , since they are optimized using gradient descent algorithms. Furthermore, we are unable to compute Fisher information , since it is computationally impossible to integrate a complex deep neural network.
If we misunderstand your question, please inform us, and we will try to consider a proper explanation as soon as possible.
# Weakness 2: Application in quantile regression.
It is an interesting application of monotonicity. We added an additional experiment of quantile regression through a simulation, and we put the details in Section 5.1. We show a comparison of the quantile curve between all methods in Figure 3 and MAE of the th quantile predicted by all methods with different settings of . The test result shows that GCM performs the best in MAE metrics. We can also see that the traditional strict monotonic methods (MM, SMM, CMNN) suffer from approximation accuracy, as the strict monotonic structures (e.g. positive weighting matrices and monotonic activations) weaken the universal approximation ability of neural networks. For the non-monotonic (MLP) and weak-monotonic (Hint, PWL) methods, they have better approximation accuracy than the strict monotonic methods, but their th quantile curves of different 's are not sufficiently separated as shown in Figure 3, due to the lack of strict monotonic constraints.
The paper models a conditional monotonic distribution as a marginalization over latent variables. The key idea is to modify the monotonic modeling problem into modeling an element-wise cumulative distribution function (over a cost variable C). To actually model the latter, the authors introduce a latent variable modeling problem and solve it via a standard importance-weighted likelihood estimate with or without an additional ELBO term. The paper presents improved results in two experiments over a variety of baselines.
优点
- Clean reformulation of the monotonic problem into a classification over a latent variable.
- Well-motivated techniques to solve the problem.
- Comparison against a variety of baselines.
缺点
No confidence intervals in the results tables. It's not clear that the improvement achieved by GCM is large enough (even over the simple MLP) to justify the much more complicated modeling procedure.
问题
- How wide are the confidence intervals for each of the metrics? Just do standard bootstrap if possible and report please.
- Why introduce the extra prior Also, why is the prior different from ?
伦理问题详情
Thank you for the reviews and valuable suggestions. We have revised our paper as you requested and here are our replies (updated in 2024-11-29).
# Weakness 1 & question 1: No confidence intervals in the results tables.
We add a 95% confidence interval to the experimental results; the confidence intervals are estimated by repeating the experiments 10 times. The results with confidence interval can be found in Table 1 & 3 in the section 5, and tables in Appendix A, C and E.
# Weakness 2: It's not clear that the improvement achieved by GCM is large enough.
With the demonstration of confidence intervals, the improvement in the GCM compared to the baselines, including the MLP, is significant. The detailed experimental results are shown in Table 1 & 3 in the section 5 and in Appendix E. As past studies (e.g. paper 1&2 ) on monotonic networks show, the improvement of AUC and ACC on Adult and COMPAS datasets are not numerically large, due to the limitation (dataset size, random noise, etc.) of datasets. In fact, a 0.003 improvement in the AUC metric is considered huge in various scenarios.
We also add two more sets of experiments on the Diabetes dataset and the Blog Feedback dataset, and the GCM-VI is still the best method in all datasets, and the results are reported in Appendix E.
Moreover, MLP is not a monotonic model, we use it to see the effect if there is no monotonic constraint of a neural network. Due to the difficulty of training a strict monotonic neural network (for example, all the weighting matrix must be positive in the classic Min-Max monotonic network, see paper 3; can not apply the layer normalization technique, which is not monotonic; lack of scalability, see paper 4, etc), it is an achievement that a monotonic model ties with the unconstrained MLP, while preserving the advantage of giving strict monotonic predictions. In fact, not all monotonic models can defeat MLP due to the rigorous monotonic constraints, which are shown in Table 1 & 3. However, in all sets of experiments, the GCM model surpasses the MLP and baseline models as shown in Table 1 & 3, which proves the ability of our model.
paper 1: https://arxiv.org/pdf/2306.01147
paper 2: https://arxiv.org/pdf/2011.10219
paper 3: https://proceedings.neurips.cc/paper/1997/file/83adc9225e4deb67d7ce42d58fe5157c-Paper.pdf
paper 4: https://openreview.net/pdf?id=DjIsNDEOYX
# Question 2: Why introduce the extra prior ? Also, why is the prior different from ?
We take the sampling distribution same as the prior in the original paper. In the latest revision, we remove the and modify the evidence estimator from to the exact estimator for a categorical variable , which is equivalent to the importance sampling when is categorical and . In the new revision, we also provide an estimator of the evidence for a Gaussian latent variable using the reparameterization trick instead of importance sampling, where and , avoiding the high-variance issue in importance sampling. The experimental results in Section 5.1 are updated using the Gaussian latent variable . We also compared the performance between Gaussian and categorical latent variables in Appendix C.2, showing that they can achieve similar performance in four public datasets.
Dear all reviewers, thank you for all the comprehensive reviews and valuable suggestions, we have submitted the final revision, and here are the modifications we made comparing to the original paper.
1. We revised the latent distribution and simplified the loss functions of GCM and GCM-VI.
The latent distribution we used in the original paper is categorical, and to deal with it, we have to use importance sampling to estimate evidence. However, if we traverse all possible values of , it is equivalent to the law of total probability, which gives the exact estimate of the evidence by:
In this case, we can directly optimize the evidence and we have the loss function:
For complex latent variables, for example, form the Gaussian distribution, we cannot use the law of total probability (impossible to integrate on ) or importance sampling (high variance). We follow the classical reparameterization trick and reformulate the loss of GCM as:
where . To make a better estimate of , we adopt a recognition model , similar to the IWAE, the variational version of GCM (GCM-VI) has a loss function formulated as:
where . The three revised loss functions are simplified compared to the original paper. We remove the hyperparameters and in the original version formulated as . So we now have fewer hyperparameters and the loss function is cleared without a redundant linear combination of and .
We performed an ablation study on these three loss functions in Appendix C.2, and the results are as follows.
| Method | Adult ACC | COMPAS ACC | Diabetes ACC | BlogFeedback RMSE |
|---|---|---|---|---|
| GCM-Categorical | 0.88500.0013 | 0.69830.0010 | 0.84430.0003 | 0.09880.0010 |
| GCM-Gaussian | 0.88540.0013 | 0.69910.0011 | 0.84410.0001 | 0.09940.0003 |
| GCM-Gaussian-VI | 0.88580.0014 | 0.70110.0011 | 0.84420.0002 | 0.10050.0004 |
We can see that GCM-VI and GCM-categorical perform the best, this is consistent with their objectives, since GCM-categorical is trained by the exact and GCM-VI provides a better estimation of the latent than the original GCM.
2. We replaced the card-gambling experiments with the quantile regression experiment.
The original rules of the card-gambling experiment are hard to fully understand. Therefore, we adopt a simpler task of quantile regression based on simulations in Section 5.1. The task is formulated as follows:
Here is the prediction of the th quantile of and the objective function in the third line follows the classical quantile regression. According to the definition of the th quantile, which is monotonic with respect to , the prediction value should also be monotonic with respect to . So we can test the monotonic methods based on this task. In our experiment, the training examples of and are generated by:
The test results are shown in Figure 3 and Table 1 in Section 5.1. Proving that GCM has the best performance among all methods. We show the MAE metrics as follows.
| Method | |||||
|---|---|---|---|---|---|
| MLP | 0.14950.0340 | 0.11570.0283 | 0.10570.0255 | 0.12300.0309 | 0.14770.0386 |
| MM | 0.20020.0572 | 0.11030.0320 | 0.07230.0245 | 0.10670.0346 | 0.17450.0495 |
| SMM | 0.23450.0693 | 0.11940.0356 | 0.08120.0246 | 0.12360.0366 | 0.19190.0556 |
| CMNN | 0.17680.0340 | 0.11190.0174 | 0.08230.0161 | 0.10070.0198 | 0.14800.0332 |
| Hint | 0.14020.0285 | 0.11370.0263 | 0.10680.0292 | 0.11540.0368 | 0.13160.0374 |
| PWL | 0.17930.0282 | 0.14760.0164 | 0.13940.0193 | 0.15240.0216 | 0.16980.0207 |
| GCM | 0.09840.0188 | 0.07770.0119 | 0.06690.0096 | 0.07590.0127 | 0.09910.0211 |
3. We renewed the experiments in public datasets.
We added two public datasets, so now we have four public datasets in the experiment, they are:
| dataset | total examples | dimension of x | dimension of r | target |
|---|---|---|---|---|
| Adult | 48,842 | 33 | 4 | classification |
| COMPAS | 7,214 | 9 | 4 | classification |
| Diabetes | 253,680 | 105 | 4 | classification |
| BlogFeedback | 52,397 | 272 | 8 | regression |
They are tested in Section 5.2, and GCM-VI achieve the best performance in Adult, COMPAS and Diabetes, while GCM performs the best in BlogFeedback.
| Method | Adult ACC | COMPAS ACC | Diabetes ACC | BlogFeedback RMSE |
|---|---|---|---|---|
| MLP | 0.88370.0012 | 0.69550.0008 | 0.84310.0004 | 0.10420.0004 |
| MM | 0.88360.0010 | 0.69490.0021 | 0.84090.0008 | 0.11000.0018 |
| SMM | 0.88370.0011 | 0.69550.0020 | 0.84010.0013 | 0.11140.0008 |
| CMNN | 0.88320.0013 | 0.69970.0011 | 0.83930.0015 | 0.11180.0005 |
| Hint | 0.88460.0011 | 0.68610.0024 | 0.84070.0005 | 0.11180.0013 |
| PWL | 0.88350.0012 | 0.69600.0013 | 0.84170.0003 | 0.10690.0006 |
| GCM | 0.88540.0013 | 0.69910.0011 | 0.84410.0001 | 0.09940.0003 |
| GCM VI | 0.88580.0014 | 0.70110.0011 | 0.84420.0002 | 0.10050.0004 |
4. We uploaded the code.
Our code is uploaded to https://github.com/iclr-2025-4464/GCM, welcome to try it and feel free to give us some advice.
In the end, we sincerely appreciate your reviews and are eager to hear from you in the future.
The paper addresses the problem of conditional monotonic distribution modeling by representing it as a marginalization over latent variables. The central idea is to reformulate the monotonic modeling challenge into modeling an element-wise cumulative distribution function (CDF) with respect to a cost variable CCC. To achieve this, the authors introduce a latent variable modeling framework, which is solved using a standard importance-weighted likelihood estimation, optionally incorporating an ELBO term. The original paper has many issues and errors. For example, the basic conditional independence assumption was found unnecessary in the rebuttal period. These major changes are not fully verified in the rebuttal phase and it was hard for the reviewers to fully review the new version again. I do not think the current version is ready for acceptance.
审稿人讨论附加意见
None of the reviewers engaged in discussions with the authors, likely because many of the concerns raised by the reviewers stem from errors made by the authors. While the authors attempted to address these issues in the revision, the extent of the changes is too substantial for a second review. Additionally, the significance of the results remains unconvincing. For instance, the authors argue that a 0.003 improvement in the AUC metric is highly significant in various scenarios, a claim that appears questionable.
Reject